When you create a vNIC in UCSM with Cisco UCS you have the option to pin that vNIC to Fabric A or Fabric B and an option to Enable Failover. Most of the servers we deploy on UCS blades are ESX 4 and with ESX we always create two vNICs, one on Fabric A and one on Fabric B and then let ESX handle the NIC teaming and failover. With the new Palo adapter we create 4 vNICs (eth0-eth3), assign two for SC/vMotion in a Standard vSwitch and two for VM networking in a Distributed vSwitch or Nexus 1000v.
I was curious about how the UCS level vNIC failover worked so I built a Windows 2008 R2 blade. In the Service Profile I presented only on vNIC to it and checked the Enable Failover option. I was wondering if Windows would see two NICs or one because someone else had told me that Windows will see two NICs but will auto failover between them if the uplink on Fabric A goes down.
After I loaded the Cisco enic Palo drivers in Windows it only sees one NIC, here is a screen shot of Device Manager and the Network Connections window.
To test the failover I first SSH into our Cisco 3750E switch to see witch 10GB uplink the MAC address was coming across and it was using Te1/0/1.
Cisco3750#show mac address-table | include 0025.b50c.95ab
1 0025.b50c.95ab DYNAMIC Te1/0/1
I then started an extended ping to the server and shutdown switch port Te1/0/1, failover happened almost immediately with only one ping missed. I then check where the MAC address was and it was on the other uplink for Fabric B.
Cisco3750#show mac address-table | include 0025.b50c.95ab
1 0025.b50c.95ab DYNAMIC Te1/0/2
Wow!! that was cool, NIC failover without having to mess with NIC teaming drivers in Windows. This is a much cleaner solution than using the old Intel ProSet, HP or Broadcom tools to create the third virtual NIC. It has to be more efficient using hardware based failover as well.
I then wanted to see what would happened if I brought switch port Te1/0/1 back online. To my surprise it switched back over to Fabric A almost immediately after enabling the switch port but this time there were 0 pings missed.
I checked the Windows event log and didn’t see any event where Windows detected a loss of network connectivity. This is really cool that Windows never knew anything about the uplink failure.
Very cool stuff.
16 thoughts on “Cisco UCS vNIC Failover”
How are setting the fabric failover with the 4 vNICs on the Palo card? Are you still just setting them to not failover due to the MAC table issues of failing over with virtual switches in vSphere?
Have you played with QoS at all yet in vSphere?
Good Stuff! Thanks!
I have not been enabling failover on the ESX vNICs but now that you mention it I would like to try it. I have just been alternating which frabric the vNIC is on(eth0 and 2 on A and eth1 and 3 on B).
Makes since, then ESX would never see a NIC disconnected unless the northbound uplinks for fabric A and B were both down. Like you were saying could a MAC table issue but there could also be a MAC table issue with the ESX failover. Man, it starts making your head hurt when you try and visualize all the different scenarios.
Pingback: Cisco UCS vNIC Failover « Jeremy Waldrop's Blog
Interesting, great info. For the Melo card you still do see two vNICs at the OS when you enable failover. Both of those presented can fail to either fabric. For the Palo the OS sees however many you configure (because it has virtualised hardware) and the failover occurs underneath. Makes sense.
you can use the UCS auto failover with the Emulex and QLogic cards as well. They built a mini switch in the card for each physical interface. That way UCS can switch the traffic if a nic fails. You can not manage or see the mini switch as far as I can tell.
Interesting… I have been told by Cisco that you can NOT enable fabric failover when using the any of the cards with vSphere and any of the virtual switches. The issues was realted to MAC address tables.
When the failover occurs, the MAC table needs to be relearned. That is fine if a virtual machine is talking on the network and the MAC is learned quickly. The problem comes up when a vm isn’t talking (and isn’t on the MAC table because of it). If a packet comes in for the “quiet” vm, then it is dropped because there is no known destination.
Make sense? Do I have wrong information?
I have been told the same, let ESX handle the NIC failover.
There is also a way to disable the MAC
aging in UCS – http://www.unifiedcomputingblog.com/?p=55
Not sure if this would prevent the issue you MAC table issue though.
Rodos, do you see two NICs in Windows only if you add one vNIC to the Service Proflie?
Aaron and Jeremy, the VM MAC aging issue has been repaired in the latest code releases allowing the hardware failover to be used with VMware:
How are setting the fabric failover with the Cisco 3825? Does it will be the same?
I don’t understand the question, where does the 3825 sit in relation to the 3825?
I am trying to install Cisco UCM on vmware workstation ACE Edition give primary dns server failure error is occurred in my home pc, then I try it in my office with fully working dns server but still same issue, i am stuck in this situation.
FYI – the MAC handling with NIC failover is still a very real issue, and is why NIC failover should not be used. I believe Cisco’s bug ID is CSCsw39341, but don’t trust the “fixed in 1.1(1j)” note.
The root of the problem is that after Menlo/Palo fail a NIC from one fabric to the other, the burden is on the server/VM to generate a packet so that the upstream LAN switches move its MACs from one 6100 to the other. However, since the server/VM doesn’t know failover has occured, it doesn’t know to do anything special. Contrast that with OS-level NIC teaming, where the OS/hypervisor sees a link down, and knows to send gratuitious ARP’s or other L2 broadcast traffic out the remaining interfaces on behalf of its MAC(s).
Now, this only affects idle servers. A workaround is to set each server to ping its default gateway ad infinitum.
Lowering the UCS MAC aging does not help, since UCS doesn’t broadcast unknown unicast traffic in end-host mode. Besides, the MAC aging time that’s the biggest concern is the one on the upstream LAN switches. You’d need to set 1-second MAC aging timers on UCS and all upstream switches AND run UCS in switch mode to ensure a 1-second response to failover. That’s not realistic.
Note that Cisco’s second gen Qlogic and Emulex mezz cards no longer have failover.
Moral of the story: don’t use it.
“Rodos, do you see two NICs in Windows only if you add one vNIC to the Service Proflie?”
I can confirm that if you only have one vNIC in the service profile the OS does indeed see 2 NICs – the vNIC with it’s associated Service Profile defined MAC and the other physical NIC with it’s burnt in MAC. This is on the QLogic cards anyway.
If anyone has a way to stop this second NIC being presented it would be great to know how.
Interesting that the next gen ones don’t have this feature anymore.
If failover is enabled on the VIC (Palo) then the OS only sees 1 NIC. If failover is enabled on Menlo (Qlogic/Emulex) then the OS will see 2 NICs. Not sure why this is the case but this is what I have experienced. Looks like the OS wouldn’t see but 1 NIC on all adapters because the failover happens in hardware.
Great post. Helped me understand a lot of stuff and why people do what they do!
Aaron Delp, that made a lot of sense! Thanks!!