Hey everyone,<\/p>\n
Having an odd issue in my lab storage server after adding a new array. First some background, and I apologize in advance for the long post, but I hope to provide as much relevant info as possible. Server specs are as follows:<\/p>\n
OS
\nUbuntu 14.04.5 LTS<\/p>\n
Hardware
\n-2 x Supermicro SC-216 chassis
\n-1 x LSI SAS9211-8i (Flashed to IT Mode)
\n-1 x LSI SAS9200-8e (Flashed to IT Mode)
\n-Supermicro X9SCL Mobo
\n-Intel(R) Core™ i3-2130 CPU
\n-4 x 1GB Elpida EBJ10EE8BAFA-DJ-E Memory<\/p>\n
MD RAID Groups
\n-2 x Silicon Power S60 60GB via SATA (RAID-1 OS Drive)
\n-2 x Samsung SM843T 480GB Enterprise SATA SSD (480GB RAID-1 block storage via LIO)
\n-2 x Samsung PM853T 960GB Enterprise SATA SSD (960GB RAID-1 block storage via LIO)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 block storage via LIO)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST5000LM000 5TB (10TB RAID-10 EXT4 via CIFS/Samba)<\/p>\n
The EXT4 logical volumes are pooled via MergerFS and exposed via CIFS/Samba as a single ~30TB share. The 5TB drives have not been added to the MergerFS FUSE mount yet, but are slated to be. SC216 Chassis #1<\/span> contains the mobo, LSI HBAs, QLogic QLE2462 FC HBA, the Samsumg arrays, and 3 @ (4 x Seagate ST4000LM024 4TB). The LSI SAS9211-8i is connected to the the SAS2 backplane in SC216 #1<\/span>.<\/p>\n The second SC216 chassis’ SAS2 backplane is connected to the 9200-8e HBA via SFF-8088 to SFF-8088 cable and contains the 1 @ (4 x Seagate ST4000LM024 4TB) and 1 (@ 4 x Seagate ST5000LM000 5TB). I’ve had this system in “production” running flawlessly for almost 2 years now and, up until a long power outage a month or so ago, the server was sitting at over 635 days of continuous uptime. After powering down gracefully with 3 minutes of UPS reserve remaining, I powered it back up 2 hours later and it’s been 100% stable since. Current uptime is 45+ days.<\/p>\n I’ve grown the storage in this server from under 10TB to now over 50TB, with zero downtime. Everything has worked as expected until I added the most recent 4 x Seagate ST5000LM000 5TB array. Up to this point, the Seagate arrays have been very predictable. Now, before anyone comments, I’m well aware that these Seagate drives are not fast. They are not intended to be fast. However, they are fast enough for my needs in RAID-10 (archive/media storage).<\/p>\n Since these are typically shucked Ebay drives, I do test each drive before adding to the array. Typically, the first test is via USB in a Windows machine to ensure they’re not DOA, then I slap each drive individually into the storage server, dd the drive, partition, add a temp file system, and mount the drive to test read/write speeds. If I suspect something is amiss, I’ll run badblocks. Truthfully, though, thus far I’ve had no significant outliers and each drive tests between 120-135MB/s on long sequential writes (up to 40GB). Of the 20 4TB drives + 1 cold spare I have, I have not had a single drive failure (knock, knock).<\/p>\n Additionally, all 5 8TB RAID-10 arrays have assembled in predictable, undramatic fashion. Initial resync speeds for each near-2 RAID-10 array have been ~270-280MB/s, tapering to 230MB/s near the end of the sync. This is in-line with the theoretical maximum write speed of 250-270MB/s for the array, using an average of 125-135MB/s per drive. In fact, the newest array consisting of the 5TB drives started in very similar fashion. All 4 drives tested fine on read/writes. Initial resync hummed along at ~270MB/s for the first several hours. At approximately the 40% mark, the resync slowed dramatically, down to about 50MB/s. Since that time (now at ~86%), the speed has hovered around 40-60MB/s.<\/p>\n md131 : active raid10 sdaa1[0] sdad1[3] sdac1[2] sdab1[1] RAID speed limits have been set as follows: I checked dmesg and syslog and there’s nothing of note. Curious, I looked at NMON and saw this:<\/p>\n │DiskName Busy Read WriteMB|0 |25 |50 |75 100| Lights on the front of the disk array confirm this - sdad is lit constantly, while the others show intermittent, but rapid activity. Clearly sdad is getting hammered with I/O waits. What I don’t get is why the write behavior looks so lopsided. The array was configured as follows:<\/p>\n mdadm --create /dev/md131 --level=10 --chunk=256 --raid-devices=4 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1<\/p>\n So sdab1 should be a mirror of sdaa1, and sdad1 should be a mirror of sdac1. As I undertand the near-2 implementation, the initial zeroing process is done from sdaa1 → sdab1, and sdac1 → sdad1, I would expect to see an equal number of writes on sdab1 as sdad1, but that’s clearly not the case.<\/p>\n Finally, the server is under zero duress. All this thing does is CIFS and block storage LUNs via LIO. I run Zabbix in this environment and perf data looks like this:<\/p>\n CPU iowait time 2018-12-18 15:50:15 0.0084 % All the CPU utilization allocated to building the array at the moment. Periodically, iowaits will jump to ~9%, but then immediately settle.<\/p>\n Drive temps are also fine, FWIW. Again, I know these Seagate drives are not fast. I’ve worked on NetApp, EMC VNX, HP LeftHand, HP 3Par, HP Nimble, etc. I know what fast drives are, so please don’t tell me that slow drives are my problem. What I’m experiencing is a discernible change<\/em>, and that’s what I’m trying to elucidate. I feel like it’s something obvious, but I’m just overlooking it. The only think I can think of is that sdad has some unquantifiable issue, and is lagging way behind the other drives.<\/p>\n If anyone knows of another tool to use or metric I’m failing to look at, please, I’m all ears.<\/p>\n Thanks, folks!<\/p>\n Bill<\/p>","upvoteCount":8,"answerCount":5,"datePublished":"2018-12-18T20:26:03.000Z","author":{"@type":"Person","name":"billnutz","url":"https://community.spiceworks.com/u/billnutz"},"suggestedAnswer":[{"@type":"Answer","text":" Hey everyone,<\/p>\n Having an odd issue in my lab storage server after adding a new array. First some background, and I apologize in advance for the long post, but I hope to provide as much relevant info as possible. Server specs are as follows:<\/p>\n OS Hardware MD RAID Groups The EXT4 logical volumes are pooled via MergerFS and exposed via CIFS/Samba as a single ~30TB share. The 5TB drives have not been added to the MergerFS FUSE mount yet, but are slated to be. SC216 Chassis #1<\/span> contains the mobo, LSI HBAs, QLogic QLE2462 FC HBA, the Samsumg arrays, and 3 @ (4 x Seagate ST4000LM024 4TB). The LSI SAS9211-8i is connected to the the SAS2 backplane in SC216 #1<\/span>.<\/p>\n The second SC216 chassis’ SAS2 backplane is connected to the 9200-8e HBA via SFF-8088 to SFF-8088 cable and contains the 1 @ (4 x Seagate ST4000LM024 4TB) and 1 (@ 4 x Seagate ST5000LM000 5TB). I’ve had this system in “production” running flawlessly for almost 2 years now and, up until a long power outage a month or so ago, the server was sitting at over 635 days of continuous uptime. After powering down gracefully with 3 minutes of UPS reserve remaining, I powered it back up 2 hours later and it’s been 100% stable since. Current uptime is 45+ days.<\/p>\n I’ve grown the storage in this server from under 10TB to now over 50TB, with zero downtime. Everything has worked as expected until I added the most recent 4 x Seagate ST5000LM000 5TB array. Up to this point, the Seagate arrays have been very predictable. Now, before anyone comments, I’m well aware that these Seagate drives are not fast. They are not intended to be fast. However, they are fast enough for my needs in RAID-10 (archive/media storage).<\/p>\n Since these are typically shucked Ebay drives, I do test each drive before adding to the array. Typically, the first test is via USB in a Windows machine to ensure they’re not DOA, then I slap each drive individually into the storage server, dd the drive, partition, add a temp file system, and mount the drive to test read/write speeds. If I suspect something is amiss, I’ll run badblocks. Truthfully, though, thus far I’ve had no significant outliers and each drive tests between 120-135MB/s on long sequential writes (up to 40GB). Of the 20 4TB drives + 1 cold spare I have, I have not had a single drive failure (knock, knock).<\/p>\n Additionally, all 5 8TB RAID-10 arrays have assembled in predictable, undramatic fashion. Initial resync speeds for each near-2 RAID-10 array have been ~270-280MB/s, tapering to 230MB/s near the end of the sync. This is in-line with the theoretical maximum write speed of 250-270MB/s for the array, using an average of 125-135MB/s per drive. In fact, the newest array consisting of the 5TB drives started in very similar fashion. All 4 drives tested fine on read/writes. Initial resync hummed along at ~270MB/s for the first several hours. At approximately the 40% mark, the resync slowed dramatically, down to about 50MB/s. Since that time (now at ~86%), the speed has hovered around 40-60MB/s.<\/p>\n md131 : active raid10 sdaa1[0] sdad1[3] sdac1[2] sdab1[1] RAID speed limits have been set as follows: I checked dmesg and syslog and there’s nothing of note. Curious, I looked at NMON and saw this:<\/p>\n │DiskName Busy Read WriteMB|0 |25 |50 |75 100| Lights on the front of the disk array confirm this - sdad is lit constantly, while the others show intermittent, but rapid activity. Clearly sdad is getting hammered with I/O waits. What I don’t get is why the write behavior looks so lopsided. The array was configured as follows:<\/p>\n mdadm --create /dev/md131 --level=10 --chunk=256 --raid-devices=4 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1<\/p>\n So sdab1 should be a mirror of sdaa1, and sdad1 should be a mirror of sdac1. As I undertand the near-2 implementation, the initial zeroing process is done from sdaa1 → sdab1, and sdac1 → sdad1, I would expect to see an equal number of writes on sdab1 as sdad1, but that’s clearly not the case.<\/p>\n Finally, the server is under zero duress. All this thing does is CIFS and block storage LUNs via LIO. I run Zabbix in this environment and perf data looks like this:<\/p>\n CPU iowait time 2018-12-18 15:50:15 0.0084 % All the CPU utilization allocated to building the array at the moment. Periodically, iowaits will jump to ~9%, but then immediately settle.<\/p>\n Drive temps are also fine, FWIW. Again, I know these Seagate drives are not fast. I’ve worked on NetApp, EMC VNX, HP LeftHand, HP 3Par, HP Nimble, etc. I know what fast drives are, so please don’t tell me that slow drives are my problem. What I’m experiencing is a discernible change<\/em>, and that’s what I’m trying to elucidate. I feel like it’s something obvious, but I’m just overlooking it. The only think I can think of is that sdad has some unquantifiable issue, and is lagging way behind the other drives.<\/p>\n If anyone knows of another tool to use or metric I’m failing to look at, please, I’m all ears.<\/p>\n Thanks, folks!<\/p>\n Bill<\/p>","upvoteCount":8,"datePublished":"2018-12-18T20:26:03.000Z","url":"https://community.spiceworks.com/t/odd-mdadm-raid-10-initial-sync-issue/689298/1","author":{"@type":"Person","name":"billnutz","url":"https://community.spiceworks.com/u/billnutz"}},{"@type":"Answer","text":" I would suggest using iostat to monitor disk io<\/p>\n Can you post this and lsblk<\/strong>?<\/p>\n Also why are you using already partitioned space to make md devices?<\/p>\n Here is /proc/mdstat from my 8disk raid10, as well as the usb drive for centos install<\/p>\n No 1 at the end of each device.<\/p>\n You seem to be writing twice to the same physical disk using different paritions in your case now, which would explain the slow sync.<\/p>\n edit: also mean to ask, are you using<\/p>\n watch cat /proc/mdstat<\/p>\n or just typing in cat /proc/mdstat<\/p>\n watch refreshes every 2 seconds, you can get a better view of overall performance if you watch this one for a bit.<\/p>","upvoteCount":0,"datePublished":"2018-12-18T21:56:57.000Z","url":"https://community.spiceworks.com/t/odd-mdadm-raid-10-initial-sync-issue/689298/2","author":{"@type":"Person","name":"matt234","url":"https://community.spiceworks.com/u/matt234"}},{"@type":"Answer","text":" Yeah, I know iostat. I’ll take a peek and see if it yields anything interesting. As far as the partitioned drives, that was long the standard way of doing things with md RAID for many people. In fact, when drives were small and folks were using fdisk, there was a specific partition type called Linux RAID (#83<\/span> maybe?), used exactly for such a purpose. Is it necessary, no, of course not. I do it out of years of habit. That said, I’ve been building mdadm arrays for about 10 or 12 years and I’ve never seen any evidence that it’s problematic to do it that way. The pros and cons have been debated, including here on the Spiceworks forums, but I don’t think I’ve ever seen a truly convincing reason to do one way or the other. Most importantly, ALL of my current arrays and the dozens I’ve built over the years are built the same way, but do not exhibit this behavior.<\/p>\n
\n9767276032 blocks super 1.2 256K chunks 2 near-copies [4/4] [UUUU]
\n[=================>…] resync = 85.7% (8371273600/9767276032) finish=527.6min speed=44096K/sec<\/p>\n
\n$cat /proc/sys/dev/raid/speed_limit_min
\n100000
\n$ cat /proc/sys/dev/raid/speed_limit_max
\n500000<\/p>\n
\n│sdd 2% 0.0 0.3|W
\n│sdd1 2% 0.0 0.3|W
\n│sdm1 1% 1.4 0.0|R
\n│sdl 1% 0.6 0.0|R
\n│sdl1 1% 0.6 0.0|R
\n│sdc 2% 0.0 0.3|W
\n│sdc1 2% 0.0 0.3|WW
\n│dm-5 2% 0.0 0.3|WW
\n│sdy 2% 0.9 0.0|R
\n│sdy1 2% 0.9 0.0|R
\n│dm-6 2% 4.0 0.0|RR
\n│sdaa 12% 28.6 0.0|RRRRRRR
\n│sdaa1 12% 28.6 0.0|RRRRRRR
\n│sdab 13% 28.6 0.0|RRRRRRR
\n│sdab1 13% 28.6 0.0|RRRRRRR
\n│sdac 11% 28.7 0.0|RRRRRR
\n│sdac1 11% 28.7 0.0|RRRRRR
\n│sdad 100% 25.1 28.1|RRRRRRRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWWWWWW
\n│sdad1 100% 25.1 28.1|RRRRRRRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWWWWWW
\n│Totals Read-MB/s=237.7 Writes-MB/s=58.2 Transfers/sec=2125.6 WWWWWWWWW<\/p>\n
\nProcessor load (1 min average per core) 2018-12-18 15:50:10 0.27
\nFree memory (%) 2018-12-18 15:50:06 86.37 %
\nIncoming network traffic on bond0 2018-12-18 15:50:50 5.84 Kbps
\nOutgoing network traffic on bond0 2018-12-18 15:51:53 9.32 Kbps<\/p>\n
\nsdaa: 27°C sdab: 27°C sdac: 27°C sdad: 26°C<\/p>\n
\nUbuntu 14.04.5 LTS<\/p>\n
\n-2 x Supermicro SC-216 chassis
\n-1 x LSI SAS9211-8i (Flashed to IT Mode)
\n-1 x LSI SAS9200-8e (Flashed to IT Mode)
\n-Supermicro X9SCL Mobo
\n-Intel(R) Core™ i3-2130 CPU
\n-4 x 1GB Elpida EBJ10EE8BAFA-DJ-E Memory<\/p>\n
\n-2 x Silicon Power S60 60GB via SATA (RAID-1 OS Drive)
\n-2 x Samsung SM843T 480GB Enterprise SATA SSD (480GB RAID-1 block storage via LIO)
\n-2 x Samsung PM853T 960GB Enterprise SATA SSD (960GB RAID-1 block storage via LIO)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 block storage via LIO)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST4000LM024 4TB (8TB RAID-10 EXT4 via CIFS/Samba)
\n-4 x Seagate ST5000LM000 5TB (10TB RAID-10 EXT4 via CIFS/Samba)<\/p>\n
\n9767276032 blocks super 1.2 256K chunks 2 near-copies [4/4] [UUUU]
\n[=================>…] resync = 85.7% (8371273600/9767276032) finish=527.6min speed=44096K/sec<\/p>\n
\n$cat /proc/sys/dev/raid/speed_limit_min
\n100000
\n$ cat /proc/sys/dev/raid/speed_limit_max
\n500000<\/p>\n
\n│sdd 2% 0.0 0.3|W
\n│sdd1 2% 0.0 0.3|W
\n│sdm1 1% 1.4 0.0|R
\n│sdl 1% 0.6 0.0|R
\n│sdl1 1% 0.6 0.0|R
\n│sdc 2% 0.0 0.3|W
\n│sdc1 2% 0.0 0.3|WW
\n│dm-5 2% 0.0 0.3|WW
\n│sdy 2% 0.9 0.0|R
\n│sdy1 2% 0.9 0.0|R
\n│dm-6 2% 4.0 0.0|RR
\n│sdaa 12% 28.6 0.0|RRRRRRR
\n│sdaa1 12% 28.6 0.0|RRRRRRR
\n│sdab 13% 28.6 0.0|RRRRRRR
\n│sdab1 13% 28.6 0.0|RRRRRRR
\n│sdac 11% 28.7 0.0|RRRRRR
\n│sdac1 11% 28.7 0.0|RRRRRR
\n│sdad 100% 25.1 28.1|RRRRRRRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWWWWWW
\n│sdad1 100% 25.1 28.1|RRRRRRRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWWWWWW
\n│Totals Read-MB/s=237.7 Writes-MB/s=58.2 Transfers/sec=2125.6 WWWWWWWWW<\/p>\n
\nProcessor load (1 min average per core) 2018-12-18 15:50:10 0.27
\nFree memory (%) 2018-12-18 15:50:06 86.37 %
\nIncoming network traffic on bond0 2018-12-18 15:50:50 5.84 Kbps
\nOutgoing network traffic on bond0 2018-12-18 15:51:53 9.32 Kbps<\/p>\n
\nsdaa: 27°C sdab: 27°C sdac: 27°C sdad: 26°C<\/p>\niostat -dmx 2 \n<\/code><\/pre>\n
[sysadmin@storage ~]$ lsblk\nNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT\nsda 8:0 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\nsdb 8:16 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\nsdc 8:32 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\nsdd 8:48 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\nsde 8:64 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\nsdf 8:80 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\nsdh 8:112 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\nsdi 8:128 0 931.5G 0 disk\n├─sdi1 8:129 0 1G 0 part /boot\n└─sdi2 8:130 0 930.5G 0 part\n ├─centos-root 253:0 0 50G 0 lvm /\n ├─centos-swap 253:1 0 7.9G 0 lvm [SWAP]\n └─centos-home 253:3 0 872.6G 0 lvm /home\nsdj 8:144 0 2.7T 0 disk\n└─md0 9:0 0 10.9T 0 raid10\n └─vg0-lv0 253:2 0 10.9T 0 lvm\n\n<\/code><\/pre>\n