Odd MDADM RAID-10 Initial Sync Issue

billnutz · 2018-12-18T20:26:03.000Z

Hey everyone,

Having an odd issue in my lab storage server after adding a new array. First some background, and I apologize in advance for the long post, but I hope to provide as much relevant info as possible. Server specs are as follows:

OS
Ubuntu 14.04.5 LTS

Hardware
-2 x Supermicro SC-216 chassis
-1 x LSI SAS9211-8i (Flashed to IT Mode)
-1 x LSI SAS9200-8e (Flashed to IT Mode)
-Supermicro X9SCL Mobo
-Intel(R) Core™ i3-2130 CPU
-4 x 1GB Elpida EBJ10EE8BAFA-DJ-E Memory

MD RAID Groups
-2 x Silicon Power S60 60GB via SATA (RAID-1 OS Drive)
-2 x Samsung SM843T 480GB Enterprise SATA SSD (480GB RAID-1 block storage via LIO)
-2 x Samsung PM853T 960GB Enterprise SATA SSD (960GB RAID-1 block storage via LIO)
-4 x Seagate ST4000LM024 4TB (8TB RAID-10 block storage via LIO)
-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
-4 x Seagate ST5000LM000 5TB (10TB RAID-10 EXT4 via CIFS/Samba)

The EXT4 logical volumes are pooled via MergerFS and exposed via CIFS/Samba as a single ~30TB share. The 5TB drives have not been added to the MergerFS FUSE mount yet, but are slated to be. SC216 Chassis #1 contains the mobo, LSI HBAs, QLogic QLE2462 FC HBA, the Samsumg arrays, and 3 @ (4 x Seagate ST4000LM024 4TB). The LSI SAS9211-8i is connected to the the SAS2 backplane in SC216 #1.

The second SC216 chassis’ SAS2 backplane is connected to the 9200-8e HBA via SFF-8088 to SFF-8088 cable and contains the 1 @ (4 x Seagate ST4000LM024 4TB) and 1 (@ 4 x Seagate ST5000LM000 5TB). I’ve had this system in “production” running flawlessly for almost 2 years now and, up until a long power outage a month or so ago, the server was sitting at over 635 days of continuous uptime. After powering down gracefully with 3 minutes of UPS reserve remaining, I powered it back up 2 hours later and it’s been 100% stable since. Current uptime is 45+ days.

I’ve grown the storage in this server from under 10TB to now over 50TB, with zero downtime. Everything has worked as expected until I added the most recent 4 x Seagate ST5000LM000 5TB array. Up to this point, the Seagate arrays have been very predictable. Now, before anyone comments, I’m well aware that these Seagate drives are not fast. They are not intended to be fast. However, they are fast enough for my needs in RAID-10 (archive/media storage).

Since these are typically shucked Ebay drives, I do test each drive before adding to the array. Typically, the first test is via USB in a Windows machine to ensure they’re not DOA, then I slap each drive individually into the storage server, dd the drive, partition, add a temp file system, and mount the drive to test read/write speeds. If I suspect something is amiss, I’ll run badblocks. Truthfully, though, thus far I’ve had no significant outliers and each drive tests between 120-135MB/s on long sequential writes (up to 40GB). Of the 20 4TB drives + 1 cold spare I have, I have not had a single drive failure (knock, knock).

Additionally, all 5 8TB RAID-10 arrays have assembled in predictable, undramatic fashion. Initial resync speeds for each near-2 RAID-10 array have been ~270-280MB/s, tapering to 230MB/s near the end of the sync. This is in-line with the theoretical maximum write speed of 250-270MB/s for the array, using an average of 125-135MB/s per drive. In fact, the newest array consisting of the 5TB drives started in very similar fashion. All 4 drives tested fine on read/writes. Initial resync hummed along at ~270MB/s for the first several hours. At approximately the 40% mark, the resync slowed dramatically, down to about 50MB/s. Since that time (now at ~86%), the speed has hovered around 40-60MB/s.

md131 : active raid10 sdaa1[0] sdad1[3] sdac1[2] sdab1[1]
9767276032 blocks super 1.2 256K chunks 2 near-copies [4/4] [UUUU]
[=================>…] resync = 85.7% (8371273600/9767276032) finish=527.6min speed=44096K/sec

RAID speed limits have been set as follows:
$cat /proc/sys/dev/raid/speed_limit_min
100000
$ cat /proc/sys/dev/raid/speed_limit_max
500000

I checked dmesg and syslog and there’s nothing of note. Curious, I looked at NMON and saw this:

Lights on the front of the disk array confirm this - sdad is lit constantly, while the others show intermittent, but rapid activity. Clearly sdad is getting hammered with I/O waits. What I don’t get is why the write behavior looks so lopsided. The array was configured as follows:

mdadm --create /dev/md131 --level=10 --chunk=256 --raid-devices=4 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1

So sdab1 should be a mirror of sdaa1, and sdad1 should be a mirror of sdac1. As I undertand the near-2 implementation, the initial zeroing process is done from sdaa1 → sdab1, and sdac1 → sdad1, I would expect to see an equal number of writes on sdab1 as sdad1, but that’s clearly not the case.

Finally, the server is under zero duress. All this thing does is CIFS and block storage LUNs via LIO. I run Zabbix in this environment and perf data looks like this:

CPU iowait time 2018-12-18 15:50:15 0.0084 %
Processor load (1 min average per core) 2018-12-18 15:50:10 0.27
Free memory (%) 2018-12-18 15:50:06 86.37 %
Incoming network traffic on bond0 2018-12-18 15:50:50 5.84 Kbps
Outgoing network traffic on bond0 2018-12-18 15:51:53 9.32 Kbps

All the CPU utilization allocated to building the array at the moment. Periodically, iowaits will jump to ~9%, but then immediately settle.

Drive temps are also fine, FWIW.
sdaa: 27°C sdab: 27°C sdac: 27°C sdad: 26°C

Again, I know these Seagate drives are not fast. I’ve worked on NetApp, EMC VNX, HP LeftHand, HP 3Par, HP Nimble, etc. I know what fast drives are, so please don’t tell me that slow drives are my problem. What I’m experiencing is a discernible change, and that’s what I’m trying to elucidate. I feel like it’s something obvious, but I’m just overlooking it. The only think I can think of is that sdad has some unquantifiable issue, and is lagging way behind the other drives.

If anyone knows of another tool to use or metric I’m failing to look at, please, I’m all ears.

Thanks, folks!

Bill

matt234 · 2018-12-18T21:56:57.000Z

I would suggest using iostat to monitor disk io

iostat -dmx 2

Can you post this and lsblk?

Also why are you using already partitioned space to make md devices?

Here is /proc/mdstat from my 8disk raid10, as well as the usb drive for centos install

No 1 at the end of each device.

[sysadmin@storage ~]$ lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda               8:0    0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm
sdb               8:16   0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm
sdc               8:32   0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm
sdd               8:48   0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm
sde               8:64   0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm
sdf               8:80   0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm
sdh               8:112  0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm
sdi               8:128  0 931.5G  0 disk
├─sdi1            8:129  0     1G  0 part   /boot
└─sdi2            8:130  0 930.5G  0 part
  ├─centos-root 253:0    0    50G  0 lvm    /
  ├─centos-swap 253:1    0   7.9G  0 lvm    [SWAP]
  └─centos-home 253:3    0 872.6G  0 lvm    /home
sdj               8:144  0   2.7T  0 disk
└─md0             9:0    0  10.9T  0 raid10
  └─vg0-lv0     253:2    0  10.9T  0 lvm

You seem to be writing twice to the same physical disk using different paritions in your case now, which would explain the slow sync.

edit: also mean to ask, are you using

watch cat /proc/mdstat

or just typing in cat /proc/mdstat

watch refreshes every 2 seconds, you can get a better view of overall performance if you watch this one for a bit.

billnutz · 2018-12-18T22:22:05.000Z

Yeah, I know iostat. I’ll take a peek and see if it yields anything interesting. As far as the partitioned drives, that was long the standard way of doing things with md RAID for many people. In fact, when drives were small and folks were using fdisk, there was a specific partition type called Linux RAID (#83 maybe?), used exactly for such a purpose. Is it necessary, no, of course not. I do it out of years of habit. That said, I’ve been building mdadm arrays for about 10 or 12 years and I’ve never seen any evidence that it’s problematic to do it that way. The pros and cons have been debated, including here on the Spiceworks forums, but I don’t think I’ve ever seen a truly convincing reason to do one way or the other. Most importantly, ALL of my current arrays and the dozens I’ve built over the years are built the same way, but do not exhibit this behavior.

As far as writing twice, that’s just the nature of NMON. It clearly states at the top “Warning:contains duplicates” (which I accidentally truncated). You can see an example here: Monitor system resources on Linux through nmon | 2DayGeek #

It’s always going to show the same workload on the partition that’s analogous to the disk, so I don’t think there’s any real duplication that’s causing issues here.

Here’s the output of lsblk:

NAME                                         MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda                                            8:0    0  55.9G  0 disk
├─sda1                                         8:1    0   1.9G  0 part
│ └─md0                                        9:0    0   1.9G  0 raid1  [SWAP]
└─sda2                                         8:2    0    54G  0 part
  └─md1                                        9:1    0    54G  0 raid1  /
sdb                                            8:16   0  55.9G  0 disk
├─sdb1                                         8:17   0   1.9G  0 part
│ └─md0                                        9:0    0   1.9G  0 raid1  [SWAP]
└─sdb2                                         8:18   0    54G  0 part
  └─md1                                        9:1    0    54G  0 raid1  /
sdc                                            8:32   0 447.1G  0 disk
└─sdc1                                         8:33   0 447.1G  0 part
  └─md2                                        9:2    0   447G  0 raid1
    ├─vg_vmc01_vm0-lv_vmc01_quorum (dm-4)    252:4    0     2G  0 lvm
    └─vg_vmc01_vm0-lv_vmc01_vm0 (dm-5)       252:5    0   445G  0 lvm
sdd                                            8:48   0 447.1G  0 disk
└─sdd1                                         8:49   0 447.1G  0 part
  └─md2                                        9:2    0   447G  0 raid1
    ├─vg_vmc01_vm0-lv_vmc01_quorum (dm-4)    252:4    0     2G  0 lvm
    └─vg_vmc01_vm0-lv_vmc01_vm0 (dm-5)       252:5    0   445G  0 lvm
sde                                            8:64   0 894.3G  0 disk
└─sde1                                         8:65   0 894.3G  0 part
  └─md4                                        9:4    0 894.1G  0 raid1
    └─vg_vmc02_vm0-lv_vmc02_vm0 (dm-3)       252:3    0 894.1G  0 lvm
sdf                                            8:80   0 894.3G  0 disk
└─sdf1                                         8:81   0 894.3G  0 part
  └─md4                                        9:4    0 894.1G  0 raid1
    └─vg_vmc02_vm0-lv_vmc02_vm0 (dm-3)       252:3    0 894.1G  0 lvm
sdg                                            8:96   0   3.7T  0 disk
└─sdg1                                         8:97   0   3.7T  0 part
  └─md128                                      9:128  0   7.3T  0 raid10
    └─vg_mm05-lv_mm05 (dm-0)                 252:0    0   7.3T  0 lvm    /mnt/multimedia/MM05
sdh                                            8:112  0   3.7T  0 disk
└─sdh1                                         8:113  0   3.7T  0 part
  └─md128                                      9:128  0   7.3T  0 raid10
    └─vg_mm05-lv_mm05 (dm-0)                 252:0    0   7.3T  0 lvm    /mnt/multimedia/MM05
sdi                                            8:128  0   3.7T  0 disk
└─sdi1                                         8:129  0   3.7T  0 part
  └─md128                                      9:128  0   7.3T  0 raid10
    └─vg_mm05-lv_mm05 (dm-0)                 252:0    0   7.3T  0 lvm    /mnt/multimedia/MM05
sdj                                            8:144  0   3.7T  0 disk
└─sdj1                                         8:145  0   3.7T  0 part
  └─md128                                      9:128  0   7.3T  0 raid10
    └─vg_mm05-lv_mm05 (dm-0)                 252:0    0   7.3T  0 lvm    /mnt/multimedia/MM05
sdk                                            8:160  0   3.7T  0 disk
└─sdk1                                         8:161  0   3.7T  0 part
  └─md129                                      9:129  0   7.3T  0 raid10
    └─vg_mm06-lv_mm06 (dm-6)                 252:6    0   7.3T  0 lvm    /mnt/multimedia/MM06
sdl                                            8:176  0   3.7T  0 disk
└─sdl1                                         8:177  0   3.7T  0 part
  └─md129                                      9:129  0   7.3T  0 raid10
    └─vg_mm06-lv_mm06 (dm-6)                 252:6    0   7.3T  0 lvm    /mnt/multimedia/MM06
sdm                                            8:192  0   3.7T  0 disk
└─sdm1                                         8:193  0   3.7T  0 part
  └─md129                                      9:129  0   7.3T  0 raid10
    └─vg_mm06-lv_mm06 (dm-6)                 252:6    0   7.3T  0 lvm    /mnt/multimedia/MM06
sdn                                            8:208  0   3.7T  0 disk
└─sdn1                                         8:209  0   3.7T  0 part
  └─md130                                      9:130  0   7.3T  0 raid10
    └─vg_mm07-lv_mm07 (dm-7)                 252:7    0   7.3T  0 lvm    /mnt/multimedia/MM07
sdo                                            8:224  0   3.7T  0 disk
└─sdo1                                         8:225  0   3.7T  0 part
  └─md130                                      9:130  0   7.3T  0 raid10
    └─vg_mm07-lv_mm07 (dm-7)                 252:7    0   7.3T  0 lvm    /mnt/multimedia/MM07
sdp                                            8:240  0   3.7T  0 disk
└─sdp1                                         8:241  0   3.7T  0 part
  └─md130                                      9:130  0   7.3T  0 raid10
    └─vg_mm07-lv_mm07 (dm-7)                 252:7    0   7.3T  0 lvm    /mnt/multimedia/MM07
sdq                                           65:0    0   3.7T  0 disk
└─sdq1                                        65:1    0   3.7T  0 part
  └─md127                                      9:127  0   7.3T  0 raid10
    └─vg_mm08-lv_mm08 (dm-1)                 252:1    0   7.3T  0 lvm    /mnt/multimedia/MM08
sdr                                           65:16   0   3.7T  0 disk
└─sdr1                                        65:17   0   3.7T  0 part
  └─md127                                      9:127  0   7.3T  0 raid10
    └─vg_mm08-lv_mm08 (dm-1)                 252:1    0   7.3T  0 lvm    /mnt/multimedia/MM08
sds                                           65:32   0   3.7T  0 disk
└─sds1                                        65:33   0   3.7T  0 part
  └─md127                                      9:127  0   7.3T  0 raid10
    └─vg_mm08-lv_mm08 (dm-1)                 252:1    0   7.3T  0 lvm    /mnt/multimedia/MM08
sdt                                           65:48   0   3.7T  0 disk
└─sdt1                                        65:49   0   3.7T  0 part
  └─md127                                      9:127  0   7.3T  0 raid10
    └─vg_mm08-lv_mm08 (dm-1)                 252:1    0   7.3T  0 lvm    /mnt/multimedia/MM08
sdu                                           65:64   0   3.7T  0 disk
└─sdu1                                        65:65   0   3.7T  0 part
  └─md126                                      9:126  0   7.3T  0 raid10
    └─vg_vmc01_data01-lv_vmc01_data01 (dm-2) 252:2    0   7.3T  0 lvm
sdv                                           65:80   0   3.7T  0 disk
└─sdv1                                        65:81   0   3.7T  0 part
  └─md126                                      9:126  0   7.3T  0 raid10
    └─vg_vmc01_data01-lv_vmc01_data01 (dm-2) 252:2    0   7.3T  0 lvm
sdw                                           65:96   0   3.7T  0 disk
└─sdw1                                        65:97   0   3.7T  0 part
  └─md126                                      9:126  0   7.3T  0 raid10
    └─vg_vmc01_data01-lv_vmc01_data01 (dm-2) 252:2    0   7.3T  0 lvm
sdx                                           65:112  0   3.7T  0 disk
└─sdx1                                        65:113  0   3.7T  0 part
  └─md126                                      9:126  0   7.3T  0 raid10
    └─vg_vmc01_data01-lv_vmc01_data01 (dm-2) 252:2    0   7.3T  0 lvm
sdy                                           65:128  0   3.7T  0 disk
└─sdy1                                        65:129  0   3.7T  0 part
  └─md129                                      9:129  0   7.3T  0 raid10
    └─vg_mm06-lv_mm06 (dm-6)                 252:6    0   7.3T  0 lvm    /mnt/multimedia/MM06
sdz                                           65:144  0   3.7T  0 disk
└─sdz1                                        65:145  0   3.7T  0 part
  └─md130                                      9:130  0   7.3T  0 raid10
    └─vg_mm07-lv_mm07 (dm-7)                 252:7    0   7.3T  0 lvm    /mnt/multimedia/MM07
sdaa                                          65:160  0   4.6T  0 disk
└─sdaa1                                       65:161  0   4.6T  0 part
  └─md131                                      9:131  0   9.1T  0 raid10
sdab                                          65:176  0   4.6T  0 disk
└─sdab1                                       65:177  0   4.6T  0 part
  └─md131                                      9:131  0   9.1T  0 raid10
sdac                                          65:192  0   4.6T  0 disk
└─sdac1                                       65:193  0   4.6T  0 part
  └─md131                                      9:131  0   9.1T  0 raid10
sdad                                          65:208  0   4.6T  0 disk
└─sdad1                                       65:209  0   4.6T  0 part
  └─md131                                      9:131  0   9.1T  0 raid10

Thanks!

matt234 · 2018-12-18T23:26:03.000Z

Yes i looked at nmon after posting.

You can also try glances, though it is more a total system overview. I think iostat will give you the most relevant data.

Another thing to consider is that the md_resync will slow down a lot when there are disk errors on one of the disks involved. Could be one of the disks isnt entirely ‘good’.

billnutz · 2018-12-18T23:33:07.000Z

I’m with you - I’m also leaning toward a bad disk. I suppose it makes sense if the first part of the drive was fine before hitting bad blocks nearly half way through. sigh Oh, well, with 20 x 4TB drives and now 4 x 5TB drives that were never meant for this kind of abuse, it was inevitable I suppose. I’ll run some SMART tests and if nothing turns up, spend the holiday running badblocks and report back. Though sdad looks like a good place to start.

Thanks for reading, momurda!