A couple of week ago Windows Server 2012 R2 and System Center 2012 R2 reached the GA milestone. We started with a LAB environment for validation our designs. During the deployment we were experiencing connectivity issues with VMs and vNICs. At random virtual machine or vNIC would loose connectivity completely. After a simple live migration the virtual machine would resume connectivity. After verifying our VLAN configuration a couple of times things even got more weird. After live migrating the virtual machine back to the host were it lost connectivity, it still was accessible. Most virtual machines were functioning properly and there was no clear pattern in the what and when a virtual machine was having the issue. And without a way to reproduce the issue on demand it was complex to troubleshoot.
A week later I did an implementation at a customer site. The design was based on a two node Windows Server 2012 R2 Hyper-V cluster with System Center workloads and a five node Windows Server 2012 R2 Hyper-V cluster for production workloads. The nodes of the production cluster were deployed using the bare metal deployment process in System Center VMM 2012 R2. All the hosts were deployed successfully, but we were having issues creating a cluster from these nodes. The cluster validation wizard showed connectivity issues between the nodes. As you might know from my previous blog on bare metal deployment, System Center VMM 2012 R2 can only create a NIC Team with the Logical Switch if a vSwitch is created on top of the NIC team. This required vNICs in the ManagementOS for host connectivity. After validating the VLAN configuration we rebooted the host. Connectivity resumed when a host was rebooted , but at random different hosts lost connectivity again. We were experiencing a similar situation as in our lab environment.
The was another similarity in the two environments. The customer site and our lab consisted of an HP BladeSystem c7000 with BL460c Gen8 blades that contained HP FlexFabric 10Gb 2-port 554FLB Adapters. These BladeSystems use Virtual Connect technology for converged networking. We upgraded our Virtual Connect to the latest version 4.10 before implementing Windows Server 2012 R2, but the customer was still running version 3.75. The HP FlexFabric 10Gb 2-port 554FLB Adapter is based on Emulex hardware and an inbox driver was provided by Microsoft with version number 10.0.430.570. After contacting my friend Patrick Lownds at HP he provided me with a link to the Microsoft Windows Server 2012 R2 Supplement for HP Service Pack. Running this did not provide any update to drivers. The details of the HP FlexFabric 10Gb 2-port 554FLB Adapter showed that this is Emulex hardware. A search on the Emulex site provided an newer version of the driver. After installing the new driver with version 10.0.430.1003 the issue occurred again.
We submitted a case with Microsoft and I have been debugging this issue with a Software Development Engineer from Microsoft (who has verified my blog series on NIC Teaming about a year ago) for the last week. I must say Kudos to Silviu for his assistance every evening this week and Don Stanwyck for communicating with HP. I also reached out to a couple of community members to know if the issue sounded familiar. Rob Scheepens (Sr. Support Escalation Engineer at Microsoft Netherlands) was aware of another customer with the same issue on exactly the same hardware and yesterday evening I was contacted by another one. Same issue, same hardware. This morning I was pinged by Kristian Nese who has a repro of the issue with 2x IBM OCe11102-N2-X Emulex 10GbE in a team (created from VMM) with Emulex driver version 10.0.430.570.
The issue is not solved yet but I though that a quick post would prevent a lot of people from wasting valuable time on troubleshooting. Please submit a case at the hardware vendor as this would create more priority at their site. I’ll update the blog with any progress or relevant information.
A possible temporary workaround seems to configure the NIC Team members in Active/Passive. I have not been able to test and confirm this.