Recently I was involved in an implementation of a private cloud based on Hyper-V Server 2012 and System Center 2012 SP1. We’ve build a two node Hyper-V cluster (HP DL 980 servers) dedicated for Fabric Management. Both nodes in the cluster have a total of four 10Gbit interfaces (Emulex). Two of them are combined in a NIC TEAM and used for host networks and the other two are combined in a NIC TEAM for Virtual Machine networks. The TeamingMode is configured as SwitchIndependent and the LoadBalancingAlgorithm is configured with the HyperVPort setting.
To clarify, the network looks like this:
On this Hyper-V cluster we installed a guest cluster that consist of two Windows Server 2012 virtual machines with two vNICs (one LAN adapter and a cluster adapter). During the installation and configuration of this guest cluster both virtual machines reside on one host. As soon as the installation and configuration of the guest cluster was finished we moved one of the virtual machine (that is one of the cluster members) to another host. After the virtual machine was moved to another host the cluster member was evicted in the cluster after a couple of seconds.
Although the cluster member was evicted with a message that network communication was not possible both virtual machine can succesfully ping each other on all networks. DNS was also functioning correct. Also all host networks are available from each host. Moving the virtual machine back to the host on which the other cluster member resides fixed the problem and the cluster returns online with all nodes online.
The cluster validation report on the Hyper-V cluster and on the guest clusters does not point out any problems with the cluster configuration.
Changing teaming modes and/ or load balancing algorithms of the Hyper-V virtual switch does not change (or solve) anything. However when we delete the NIC teams and connect the virtual switch to a single (stand-alone) adapter the problem was gone. With this configuration of single NICs the guest cluster members could resist on different nodes without being evicted in the cluster. As soon as the NIC team was restored and the virtual switch was connected to the NIC team again the cluster member on the other host was evicted again.
Microsoft Premier Support involved
After some troubleshooting days and nights Microsoft Premier Support was contacted and a case was logged with this issue. Premier Support engineers went onsite and performed some serious debug sessions on host and guest clusters. They also tried to simulate the situation of the customer but did not encounter any problems at all with guest clusters. However they were using other NICs than the customer was using. So Premier Support asked us to create a new NIC team upon two different physical adapters (no Emulex NICs). The server was equiped with two Broadcom 1Gbps NICs (which were not in use) and for this test purpose we decide to make a team with these two Broadcom NICs.
After creating the team and configuring the virtual switch to make use of the team that consist of the Broadcom adapters we moved the cluster member to another Hyper-V host and guess what… cluster member was evicted from the cluster again. So it makes no difference if we make use of a NIC team with Emulex or Broadcom interfaces, in both situations the guest cluster will fail.
Premier Support told us that they’re using INTEL NICs in their test scenario and that was the only difference with our setup. We decide to add two INTEL NICs to both Hyper-V hosts, add both NICs to a NIC team and pointed the virtual switch to the ‘INTEL based’ NIC team. Fingers crossed… We moved one of the cluster members to another Hyper-V host and the guest cluster was not evicting any node!!! From now on we could conclude the following:
- When we build a guest cluster (two virtual machines) in a Hyper-V environment and the virtual machines are connected with a virtual switch which is connected to a NIC TEAM with Broadcom or Emulex physical adapters we can not seperate the cluster members. When the cluster members are seperated on two hosts, one of the cluster members will be evicted in the cluster.
- The problem does not exist when the virtual switch is connected to a single (stand-alone) NIC.
- The problem does not exist when the NIC team consist of INTEL adapters.
Right now this case is still under investigation but there is a workaround for NIC teams with Broadcom or Emulex adapters. The workaround is disable checksum offloading on the Hyper-V hosts for the physical NICs that are member of the NIC team:
Get-NetAdapterChecksumOffload -name "NameOfAdapter"
Disable-NetAdapterChecksumOffload -name "NameOfAdapter"
An updated driver or firmware for Broadcom and Emulex is expected to solve this issue.