Hyper-V Cluster Losing connection to it's cluster drives

spiceuser-k8s6 · 2025-05-13T17:14:57.077Z

For some time now I have been battling an issue with my 4 node Hyper-V cluster losing connection randomly and fairly frequently to my cluster storage drives. I’m at my wits end with this one. Once or twice a month I have to completely bring the cluster down and back up to get the drives to re-attach. I am running a quarum witness drive in the cluster.

The san is connected via fiber-channel to my Dell power store, which is chugging along not even noticing the issues.

Any advice I can get on what the issue might be, and possibly how to bring the disks back on line without bringing down the whole house would be much appreciiated. I can’t seem to find anything on the net the exactly like my issue.

Michael

Rod-IT · 2025-05-13T17:48:37.626Z

spiceuser-k8s6:

I am running a quarum witness drive in the cluster.

Can you clarify what this means, in the cluster?

Any clues to the cluster issue in the cluster logs or event logs?

What Hyper-V version and is it Hyper-V server or Server 20xx with Hyper-V?

spiceuser-k8s6 · 2025-05-13T22:00:02.332Z

The quorum drive is clustered. Meaning it has been joined to the cluster so and sits in with the other drives in failover manager so all hosts can access it. If we have set it up wrong forgive me. I’m a VMWare guy. I know from the error logs that it is caused by error messages saying the one drive or the other is having a blip in connectivity (One that I can’t find a cause for) and they become more regular over several days to the point where it’s too much for the cluster tolerances and it shuts down the drives.

I’ve followed all of the standard troubleshooting references for such an issue and can’t seem to find a smoking gun. Or a lucky inadvertent fix.

Alll my servers are matched 1 year old Dell servers matched with a fiber-channel mesh to a Dell PowerStore. I think it’s 16 Gig to the PowerStore out the back and 2 ethernet interfaces for each server. Both 10Gb with one Management and one for VM data. Seems like a pretty standard setup. I can’t seem to find the issue causing the data to slow down enough to cause the virtual environment to cache and hold in some instances. Looking at the freshly updated and verified configuration from Dell on the SAN it appears like the load barely taxes it or the interfaces.

Any additional questions please let me know. Any help or suggestions would be greatly appreciated.

Thanks!

Michael

phildrew · 2025-05-13T22:00:03.959Z

Rod-IT:

Can you clarify what this means, in the cluster?

With a Windows Failover cluster, you need to have quorum - a majority vote. With an even number of nodes, that means there needs to be a tie-breaker mechanism.

There are two options. A disk witness (a LUN/disk) in the cluster (not cluster role), or a file share witness (from an external to the cluster server).

phildrew · 2025-05-13T22:09:59.581Z

I’d check HBA firmware and drivers. See if there are updates, and what’s in the changelog.

I’ve had a NIC that had a bugfix from a firmware and driver update in VMware. Before that it would constantly cause issues with storage. Dell support was unable to do anything to help - VMware support (7 or 8 years ago, so a very different story from dealing with them now) figured it out after a day or two.

Rod-IT · 2025-05-13T22:18:36.539Z

Sorry, to all those who answered what a quorum is, I know what this is, what I wasn’t understanding was the reference to it being clustered, it’s likely just the way it was explained, but I am fully familiar with quorums in a Microsoft cluster setup.

I would review your MPIO and Multipathing.

Ensure your PowerStore OS is 3 or above, this allows manually mapping of active and standby WWNs.

Hyper-V and it’s clustering is more complex than the VMware counterpart, but if you’ve got this far, it’s something simple.

I’d expect something in one of the logs though.