Software raid 1 – Failing and recovering a disk

A software raid group disk failed in one of my servers yesterday.

The kernel was spewing SCSI errors:

kernel: ata2: status=0xd0 { Busy }
kernel: SCSI error : return code = 0×8000002

# mdadm --display /dev/md0
# mdadm --display /dev/md1

both reported a failed disk sdb*

The procedure to rebuild the md groups is as follows:

Replace bad disk (sdb in this scenario.) Note that if you do not bring down the server to replace the disk, be sure to “remove” the disk from the raid groups using mdadm.

# mdadm --remove /dev/md0 /dev/sdb0
# mdadm --remove /dev/md1 /dev/sdb1

Read the good disk’s partition table (sda in this scenario.)

# fdisk -l /dev/sda
Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 fd Linux raid autodetect
/dev/sda2 14 19457 156183930 fd Linux raid autodetect

Install identical partition table on newly replaced disk. Create partitions that start and end on the same listed cylinders and are of type “fd.” Be sure to set the boot flag, and don’t forget to write the changes.

# fdisk /dev/sdb

Add partitions back to the appropriate raid groups.

# mdadm --add /dev/md0 /dev/sdb0
# mdadm --add /dev/md1 /dev/sdb1

Ensure the raid groups are rebuilding properly.

# mdadm --display /dev/md0
# mdadm --display /dev/md1

Tags:

Leave a Reply