Network Teaming Gone Wild. Can’t ping VM

If you want to save your self from the sob story and just learn the fix it’s just VMotion.

This year I had few problems with my network teaming.  I don’t know if it affects the other load balancing algorithm but I am using  “route base IP hash” to be exact.  This issue is either not mentioned or hard to find a reference fix on the web.  This is not a complaint against network teaming but rather a good samaritan contribution to the VMware community.  I have been using teaming for four years now and I am very happy with my nic teaming performance and the two times that I will describe here that went down both are not fair to blame on VSphere.  The first incident is ESXi 5.1 and the second incident is version 5.1U1.

First Incident:  happened beginning of the year 2013.  It happened when the UPS where the Cisco 3750 switch, same switch where the ESXi nic teaming is terminated went down while the ESXi host that are on different UPS remained up.

Second Incident:  happened on just 1 vm print server.  I know one of our guys in my group was messing with this switch earlier that day so I am 70% sure that whatever he did was the root cause of that one VM nic going down.  This particular one I fix over the phone.  One of our admin called me to let me know that he was going to restore this one vm from VDP after two hours of battle.  He sounds defeated but I was able to recognize the symptom so I was able to advice him to try the fix I describe further down the blog.

Symptoms:  The sympstoms for these two events is all identical.  The nic appears to be up but you can’t ping the VM and even when you log-in locally on the vm you can not ping anywhere as well.  The first symptom I described affected 90% of the vm.  The thing that made this unique and somewhat hard to troubleshoot is that it only affects some vm and not all.  The second incident I mention only affected 1 vm.  Reboot does not fix the issue.  Removing the nic and assigning a new nic does not fix the issue like one would think.

The Fix:  VMotion to a different ESXi host.  VMotion seems to jump-start the gears and get the juice going again.  I wish I can say more about the fix but I really can’t, if you notice I am just typing this so I can make it seem that the fix is complicated when it’s not.  One might ask, How did I know that VMotion will fix the issue? My technical answer is “my gut told me so” =).  I honestly remembered just stumbling on that fix.  On this issue it’s a shame that “Jump-start” and “gut”  is the closest I get of being technical.  Really it’s just VMotion.

Notes:  Some of the fix that I was not able to test are:  instead of “reboot” I should issue “reset”.  Another fix that I did not try is to briefly unplug the ethernet cable and replug it back in.  Reboot the ESXi host.

Reading material: If you want a more detailed explanation on how to troubleshoot “IP hash”, you should read Mike Da Costa’s article on this, my comments are at the buttom.

Summary:  There it is when you see the symptoms I described above and you are using nic IP hash nic teaming, remember to include VMotion as your 4th or 5th option on your tool belt.