VM File server hangs

mattscan · 2014-04-23T18:16:59.000Z

Okay, we recently migrated our File Server to a Server 2012 R2 Vm, running on ESXi 5.5 and hooked to an EMC VNXe. A few times a day, seemingly random, however, once in the morning, and once in the afternoon (at least) it locks up tighter than a drum. If you are logged in, and have resource monitor going (which I have had for the past few days) you can watch the activity. None of what we are seeing is indicative of an issue.

10-12 MB’s of Disk I/O

10- 15 MB’s of Network I/O

CPU usage might do a quick spike to 100% utilization, but very rare in my observations.

On the VM Host, using VCenter, We have only seen one issue yesterday when the CPU spiked to 100% on the host.

The only error in the event logs on the VM came from today when explorer.exe stopped responding.

Does anyone have any suggestions on where else to start looking? we have checked other tasks and processes, and nothing big is running at the times it locks up.

b-c · 2014-04-23T18:19:46.000Z

which build of 5.5 are you running?
only thing I’m suspecting is physical RAM issue?
have been caught by ShadowCopy before as well on a really old box doing that

I’m starting a few test migrations in the next few days and don’t want to hit the same issue

2012R2 STD as well on 5.5.0 1623387

Ahh didn’t see the VNXe Part…

kelly · 2014-04-23T18:24:45.000Z

Most of my issues with file server performance have come from insufficient RAM allocated for demand. Have you looked at your usage there?

mattscan · 2014-04-23T18:26:07.000Z

The box has 64GB’s of RAM that is maybe under 40% utilization, and the VM has 8GB’s that are assigned to it. Again, no errors on the shadow copy, which is odd.

mattscan · 2014-04-23T18:27:39.000Z

RAM utilization on the box is maybe 33%. I’ve seen it spike to 55% during the issue.

b-c · 2014-04-23T18:27:56.000Z

My Shadow copy didn’t have errors but the outages were at the same time as ShadowCopy’s schedule…

Basically locked up the IO for the disk but doesn’t appear to be the same issue - that was a physical box…

How are you connected to the VNXe box?

FC / ISCSI (prob iscsi)

how many other devices are hitting the VNXe?

kelly · 2014-04-23T18:47:35.000Z

Matthew5502:

RAM utilization on the box is maybe 33%. I’ve seen it spike to 55% during the issue.

What is RAM utilization in the guest during those spikes? 22% of 64 GB is 14 GB. If you only have 8 GB allocated to the file server guest then that may be your culprit.

mattscan · 2014-04-23T18:57:37.000Z

@ B-C I’m moving closer to ShadowCopy as the culprit. I’m going to have my boss turn it off to test tomorrow.

I’ve thought about the VNXe box (ISCSI to answer the question), but we ruled that out before the problem got as bad as it did.

@Sosipater The percentages I posted is for the guest. The server utilization spikes to 75% around those times, but not for the entire time of the outage (3-10 minutes).

DoctorDNS · 2014-04-23T19:20:17.000Z

This feels like hardware. I’d check and reseat anything that can be restated. And l’d do a thorough memory check.

mattscan · 2014-04-23T19:22:53.000Z

That’s on the list as well. Going to try ShadowCopy first, as that’s a very easy thing to turn off versus moving all the VM’s and then shutting down the box for maintenance.

b-c · 2014-04-23T19:25:35.000Z

Matthew5502:

@ B-C I’m moving closer to ShadowCopy as the culprit. I’m going to have my boss turn it off to test tomorrow.

I’ve thought about the VNXe box (ISCSI to answer the question), but we ruled that out before the problem got as bad as it did.

@Sosipater The percentages I posted is for the guest. The server utilization spikes to 75% around those times, but not for the entire time of the outage (3-10 minutes).

How big is the shadow copy volume? is that volume also on the VNXe?

what is the utilization of the VNXe during the spikes?

is the VNXe doing snapshots are the same time?

The SAN is where I’m looking although physical hardware is suspect - but at scheduled times and if those times sync with ShadowCopy then you’ve got some resource / CPU / RAM issues

My fix for it was virtualizing the server (which was a fresh reload and move files into that new machine) and the problem went away in the Virtual DAS solution. Shadow copies Re-Scheduled and several more times a day even.

mattscan · 2014-04-24T13:19:51.000Z

B-C Shadow Copy is also on the VNXe. Local drives would have been my preference, however, it wasn’t my decision, so I’m dealing with what I’ve got. Shadow Copy is off today, so we will see if we still get the issue.

b-c · 2014-04-24T13:53:46.000Z

Shadow copy volume on the same volume or a different one specifically for Shadow copy?

leaving on the same volume works just can create some pretty big spikes like you found.

But really when its all really the same physical disks in a virtual environment I wonder if it matters?

the Virtual experts might have the right planning for that one… (different datastores probably)

mattscan · 2014-04-25T19:33:53.000Z

So running for two days now with Shadow copy off, and no issues with it hanging. Now comes the why was it causing it to hang search.