Last week we implemented Update Rollup 7 for System Center Virtual Machine Manager 2012 R2 at one of our customers. After implementation we experienced some strange issues on the NVGRE Gateway Cluster. When a tenant removed his network from the Azure Pack portal, the network is removed from VMM and the VMM Database, but the resource is still online on the NVGRE cluster. This isn’t a problem until a failover occurs. Then the resource and only that resource will fail to start on the other node. Also not a big issue, all other networks comes online and start function normally.
BUT, the cluster role is in a failed state and will start playing tennis between 2 nodes to try to bring the resource completely online. And this becomes annoying. Because with each failover of a node the connection for the tenant VMs drops for a second.
The solution for now is simple but not something you would like to do every day until the fix is there.
Open SQL Management Studio and connect to your VMM Database and check if the ID (id from failed role on the NVGRE cluster) is still in the table:
If the ID is still in your Database something else is wrong. Please check and if you proceed, on own risk.
If the ID is not in the table anymore just delete the failed resource from the NVGRE cluster. Then bring back the cluster on line.
We guess that it might has something to do with the recent support for multiple IP addresses. If you haven’t installed update rollup 7 yet then our advice is to wait until a fix is available.
I would like to mention Bart Leving who helped me with addressing this issue and testing it over and over again.
- Mark Scholman