In my last blog I sent out a red alert on a killer Windows Update that had not been sufficiently tested. The net result was a full crash of a two-node System Center fabric management cluster. The fabric was still in the making and backups were only provisionally taken in the form of Virtual Machine exports of the most important virtual machines. As fellow Hyper-V MVP Aidan Finn wrote unambiguously: “Something Has Gone Very Wrong With Microsoft Patch Testing”
Where did it go wrong?
I was actually demonstrating the fantastic Cluster Aware Updating functionality in Windows Server 2012 clusters, which would automatically move all VMs off a host, update it, reboot it, live migrate the VMs back to the updated cluster node and move on to the next.
The problematic July Update Rollup KB2855336 – which was one of the updates to be processed – is actually a collection of originally 20 issues that solves problems in several areas. A still unidentified part of that rollup caused a 0x000000D1 Stop error while live migrating a VM on a Windows Server 2012-based server. So Cluster Aware Updating using the Live Migration mechanism to place a host in maintenance mode, combined with the mentioned update, sent shockwaves through the cluster. In this case both cluster nodes crashed within minutes.
Ironically enough this same July Update Rollup also contained an important fix for a problem that has been around for some time: Active Directory database becomes corrupted when a Windows Server 2012-based Hyper-V host server crashes (KB2853952).
Assume that you have a Windows Server 2012-based virtualized domain controller on a Windows Server 2012-based Hyper-V host server. When the Hyper-V host server crashes or encounters a power outage, the Active Directory database may become corrupted.
This issue occurs because the guest system requests the Hyper-V server to turn off disk caching on a disk. However, the Hyper-V server misinterprets the request and keeps disk caching enabled.
If you try to disable the write caching manually you will see this error: “Windows could not change the write-caching setting for the device. Your device might not support this feature or changing the setting.” On a physical domain controller this has never been a problem.
CORRUPTED DOMAIN CONTROLLER
It takes very little imagination to guess what happened to Active Directory if you combine the full STOP of the fabric management cluster and the AD domain controllers that were virtualized on that same Hyper-V cluster without the required updates and hotfixes.