A software raid group disk failed in one of my servers yesterday.
The kernel was spewing SCSI errors:
kernel: ata2: status=0xd0 { Busy }
kernel: SCSI error : return code = 0×8000002
# mdadm --display /dev/md0
# mdadm --display /dev/md1
both reported a failed disk sdb*
The procedure to rebuild the md groups is as follows:
Replace bad disk (sdb in this scenario.) Note that if you do not bring down the server to replace the disk, be sure to “remove” the disk from the raid groups using mdadm.
# mdadm --remove /dev/md0 /dev/sdb0
# mdadm --remove /dev/md1 /dev/sdb1
Read the good disk’s partition table (sda in this scenario.)
# fdisk -l /dev/sda
Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 fd Linux raid autodetect
/dev/sda2 14 19457 156183930 fd Linux raid autodetect
Install identical partition table on newly replaced disk. Create partitions that start and end on the same listed cylinders and are of type “fd.” Be sure to set the boot flag, and don’t forget to write the changes.
# fdisk /dev/sdb
Add partitions back to the appropriate raid groups.
# mdadm --add /dev/md0 /dev/sdb0
# mdadm --add /dev/md1 /dev/sdb1
Ensure the raid groups are rebuilding properly.
# mdadm --display /dev/md0
# mdadm --display /dev/md1
