Hi! I’ve been poking at this for over a week, trying to be as careful a possible and getting the info together. It’s in a Dell Precision 7820 using Intel ISM, and blinking it’s power light accordingly. I think I’m in a fairly good place after /dev/sdb went clicky. Maybe it’ll be just some mdadm.conf edits and an --assemble. It’s just a simple single partition, and it contains my /home, so I’m a little anxious.

I’ll start with
cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
unused devices:

and
/etc/mdadm.conf,

# definitions of existing MD arrays
ARRAY metadata=imsm UUID=ad4ce666:0ad11124:6baef5eb:73c408df
ARRAY /dev/md/Volume0 container=ad4ce666:0ad11124:6baef5eb:73c408df member=0 UUID=b56693ad:788460f8:a8b93419:703b25c4

Then I got the info from each drive with ‘mdadm --examine /dev/sd[a-d]’

/dev/sdb is marked in the Intel ISM BIOS utility as a spare, and it shows up as

/dev/sdb:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.0.00
    Orig Family : 00000000
         Family : abe1d35a
     Generation : 00000001
  Creation Time : Unknown
     Attributes : All supported
           UUID : 00000000:00000000:00000000:00000000
       Checksum : 57c3a6b4 correct
    MPB Sectors : 1
          Disks : 1
   RAID Devices : 0

  Disk00 Serial : WD-WCAW32827194
          State : spare
             Id : 01000000
    Usable Size : 1953522958 (931.51 GiB 1000.20 GB)

    Disk Serial : WD-WCAW32827194
          State : spare
             Id : 01000000
    Usable Size : 1953522958 (931.51 GiB 1000.20 GB)

The other drives all match UUIDs and other info, and seem totally sane and healthy. For example,

/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.2.01
    Orig Family : 6453e91b
         Family : 6453e91b
     Generation : 0003c05c
  Creation Time : Unknown
     Attributes : All supported
           UUID : bb0620fd:8274f3f7:498f64a4:dacbd25f
       Checksum : 8e30356f correct
    MPB Sectors : 2
          Disks : 4
   RAID Devices : 1

  Disk00 Serial : WD-WCC3F3ARCLKS
          State : active
             Id : 00000000
    Usable Size : 1953514766 (931.51 GiB 1000.20 GB)

[Volume0]:
       Subarray : 0
           UUID : 5a9f14bb:a252fd06:f08cf7cf:b920b29e
     RAID Level : 10
        Members : 4
          Slots : [__UU]
    Failed disk : 0
      This Slot : 0 (out-of-sync)
    Sector Size : 512
     Array Size : 3906994176 (1863.00 GiB 2000.38 GB)
   Per Dev Size : 1953499136 (931.50 GiB 1000.19 GB)
  Sector Offset : 0
    Num Stripes : 7630848
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : failed
    Dirty State : clean
     RWH Policy : off
      Volume ID : 1

  Disk01 Serial : 57E07H1MS:0
          State : active
             Id : ffffffff
    Usable Size : 1953514766 (931.51 GiB 1000.20 GB)

  Disk02 Serial : WD-WCC6Y0RL73N1
          State : active
             Id : 00000002
    Usable Size : 1953514766 (931.51 GiB 1000.20 GB)

  Disk03 Serial : WD-WCC6Y0LPDK7T
          State : active
             Id : 00000003
    Usable Size : 1953514766 (931.51 GiB 1000.20 GB)

blkid shows all 4 drives as TYPE=“isw_raid_member”
None of the /dev/md* stuff exists.

While getting this together, I did notice that /dev/sdb is 3G/s and the others are 6G/s, so I’ll find another drive and copy the partition table over like I did with this one.

Here’s where I lose a sense of the order of operation. Do I somehow bring the 3 drives up in Raid 1, then get it to recognize the spare and let it rebuild itself? I know the /dev/md? probably needs to show up before anything useful is going to happen, I’ve tried using “mdasm --assemble --scan /dev/md126 /dev/sd[abcd]” (returns “/dev/sd? not identified in config file” for each drive). I have plenty of tabs open for stackexchange and serverfault with clues, but nothing that’s been close enough to be comfortable moving forward.

It’ll be great to recover from this and learn how to deal with it in the future. I’ll be upgrading the drives to 3TB by the end of the year, so having a bit more experience should help with that.

Safety netsand other assistance appreciated.

2 Spice ups

I did find an article called “Rebuilding software RAID without mdadm.conf” that makes sense with a good set. I went ahead and tried --assemble:

mdadm --assemble --uuid=bb0620fd:8274f3f7:498f64a4:dacbd25f --verbose /dev/md0 /dev/sda /dev/sdc /dev/sdd

mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.
mdadm: added /dev/sdc to /dev/md0 as 2
mdadm: added /dev/sdd to /dev/md0 as 3
mdadm: added /dev/sda to /dev/md0 as 0
mdadm: Container /dev/md0 has been assembled with 3 drives

I noted that it hasn’t been started, just assembled.
and then checked the details for /dev/md0

mdadm --detail /dev/md0
/dev/md0:
           Version : imsm
        Raid Level : container
     Total Devices : 3

   Working Devices : 3


              UUID : bb0620fd:8274f3f7:498f64a4:dacbd25f
     Member Arrays :

    Number   Major   Minor   RaidDevice

       -       8       32        -        /dev/sdc
       -       8        0        -        /dev/sda
       -       8       48        -        /dev/sdd

I’ll assume that’s normal when the array isn’t “running”. When I get a proper replacement drive and use sgdisk to copy the partition table to it, I’ll try to “mdadm --manage /dev/md0 --add /dev/sdb” and see what happens. I won’t run “mdadm --detail --scan >> /etc/mdadm/mdadm.conf” till the array is actually running.

Here are a few steps that could help. These were pulled from the sources below.

Linux RAID Wiki
Intel IMSM RAID Recovery – Blog
Arch Wiki on Software RAID

Assemble the Container
bash
mdadm assemble /dev/md0 /dev/sda /dev/sdc /dev/sdd

Bring Up the Member Array (e.g., RAID10 volume)
bash
mdadm assemble run /dev/md126 /dev/sda /dev/sdc /dev/sdd

Add a Replacement Disk
bash
sgdisk replicate=/dev/sdb /dev/sda
sgdisk randomizeguids /dev/sdb
mdadm manage /dev/md126 add /dev/sdb

Monitor Rebuild
bash
watch cat /proc/mdstat

Avoid create unless starting fresh — it overwrites metadata
Use assemble to avoid data loss
Always verify before adding or removing disks

Verify State

bash
cat /proc/mdstat
mdadm detail /dev/md126
mdadm examine /dev/sd[ad]

/dev/md0: IMSM container
/dev/md126 (or similar): actual RAID volume (your data lives here)
Metadata type: isw_raid_member

PostRecovery Steps

Update mdadm.conf
bash
mdadm detail scan /etc/mdadm/mdadm.conf

Rebuild initramfs (for boot arrays)
bash
updateinitramfs u

Thanks for the references, Jack!

I’m spending more time trying to edit my informative reply as to not get “new users can only post 2 links” (there are none) than researching the problem. I’ve been using ‘Intel virtual raid on cpu intel vroc for linux user guide’ pdf for lots of insight.

One rescued line summarizes my next steps. “The “update-subarray” stuff looks like what I’ll need to research next to figure out [Volume0]'s Failed disk and Disk01’s metadata.”

The block I pasted above that follows “The other drives all match UUIDs…” shows the result of
“mdadm -E / dev/ md0”

The info that got lost was

cat / proc/ mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sda[2](S) sdd[1](S) sdc[0](S)
15603 blocks super external:imsm
unused devices: <none>

Three drives, all marked as spares. That’s when I checked the metadata with -E and found the metadata issues with Disk01, first seen in my OP.