Thursday, November 24, 2011

Re-adding SATA disk to software RAID without rebooting...

It happened second time that on cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 one of cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 servers I'm maintaining one of cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 SATA disks suddenly was disconnected from cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 server. Looking into log files, I found cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following error messages:
kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptbase_reply
kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
last message repeated 62 times
and cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365n a lot of messages like cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following one:
kernel: sd 0:0:1:0: SCSI error: return code = 0x00010000
kernel: end_request: I/O error, dev sdb, sector 1264035833
This triggered RAID to log cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following type of messages:
kernel: raid5:md0: read error not correctable (sector 28832 on sdb2)
and finally to remove failed disk from array:
kernel: RAID5 conf printout:
kernel:  --- rd:3 wd:2 fd:1
kernel:  disk 0, o:1, dev:sda2
kernel:  disk 1, o:0, dev:sdb2
kernel:  disk 2, o:1, dev:sdc2
kernel: RAID5 conf printout:
kernel:  --- rd:3 wd:2 fd:1
kernel:  disk 0, o:1, dev:sda2
kernel:  disk 2, o:1, dev:sdc2
I yet need to find out what happened, but in cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 mean time cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 consequence of those error messages was that one disk was disconnected, and removed from RAID array, and I received cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following mail from cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 mdmonitor process on cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 server:
This is an automatically generated mail message from mdadm
running on mail.somedomain

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sdb2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc2[2] sdb2[3](F) sda2[0]
      1952989696 blocks level 5, 256k chunk, algorithm 2 [3/2] [U_U]
     
unused devices:
Since this happened exactly at noot which is a time when everybody uses mail server it isn't exactly an option to reboot cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 server, not unless I absolutely have to. In this case I decided that I'm going to reboot it after work hours and in cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 mean time I can eicá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365r just wait or try to rebuild RAID. If I wait, cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365re is a risk of anocá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365r disk failing and that would bring cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 server down. So, as this happened already, and I knew that cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 disk is OK and it will be re added after reboot, I decided to try to do that immediately and on a live system.

So, cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 first thing is to request kernel to rescan SATA/SCSI bus in order to find "new" devices. This is done using cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following command:
 echo "- - -" > /sys/class/scsi_host/host0/scan
After this, disk reappeared, but cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 problem was that cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 name now is /dev/sde and not /dev/sdb. To get disk always cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 same name I would need to mess with udev, which I was not prepared to do now. (And, btw, I have recently read about a patch that allows you to do just that, to rename existing device, but I think it was rejected on cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 ground that this kind of stuff is better done in user space, i.e. modifying udev rules.)

Now, cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 only problem was to "convice" RAID subsystem to re add disk. I thought that it would find disk and attach it, but eventually, I just used cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following command:
mdadm --manage /dev/md0 --add /dev/sde2
The command notified me that cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 disk was already a member of array and that it is being re-added. Afterwords, sync process was started, that will take some time:
 # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde2[3] sdc2[2] sdb2[4](F) sda2[0]
      1952989696 blocks level 5, 256k chunk, algorithm 2 [3/2] [U_U]
      [=>...................]  recovery =  7.6% (74281344/976494848) finish=204.9min speed=73355K/sec
    
unused devices:
It would be ideal for transient errors, like this one, that RAID subsystem memorizes only changes and when cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 disk is readded to apply only those changes. But, I didn't managed to find a way how to do that, and I also think that that functionality is no implemented at all.

Anyway, after synchronization process finished this is cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 content of /proc/mdstat file:
#cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde2[1] sdc2[2] sdb2[3](F) sda2[0]
      1952989696 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]
     
unused devices:
As you can see sdb2 is still here. Trying to remove it isn't possible because cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365re is no corresponding device node:
# mdadm --manage /dev/md0 -r /dev/sdb2
mdadm: cannot find /dev/sdb2: No such file or directory
[root@mail ~]# mdadm --manage /dev/md0 -r sdb2
mdadm: cannot find sdb2: No such file or directory
So, I decided to wait until reboot.

Edit: I did reboot few days ago, and after reboot everything came to normal state, i.e. it was before disk was removed from array!

[201211114] Update: Again this happened almost exactly at noon. Here is what was recorded in log files:
Nov 14 12:00:02 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptbase_reply
Nov 14 12:00:07 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: raid5: Disk failure on sdc2, disabling device. Operation continuing on 2 devices
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: raid5:md0: read error not correctable (sector 1263629840 on sdc2).
Nov 14 12:00:08 mail kernel: RAID5 conf printout:
Nov 14 12:00:08 mail kernel:  --- rd:3 wd:2 fd:1
Nov 14 12:00:08 mail kernel:  disk 0, o:1, dev:sda2
Nov 14 12:00:08 mail kernel:  disk 1, o:1, dev:sdb2
Nov 14 12:00:08 mail kernel:  disk 2, o:0, dev:sdc2
Nov 14 12:00:08 mail kernel: RAID5 conf printout:
Nov 14 12:00:08 mail kernel:  --- rd:3 wd:2 fd:1
Nov 14 12:00:08 mail kernel:  disk 0, o:1, dev:sda2
Nov 14 12:00:08 mail kernel:  disk 1, o:1, dev:sdb2
And cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365n, cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 system by itself re-scanned array, but it didn't re add disk to array:
Nov 14 12:00:44 mail kernel: mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 6, phy 2, sas_addr 0x8a843926a69f9691
Nov 14 12:00:44 mail kernel:   Vendor: ATA       Model: WDC WD1001FALS-0  Rev: 0K05
Nov 14 12:00:44 mail kernel:   Type:   Direct-Access                      ANSI SCSI revision: 05
Nov 14 12:00:44 mail kernel: SCSI device sde: 1953525168 512-byte hdwr sectors (1000205 MB)
Nov 14 12:00:44 mail kernel: sde: Write Protect is off
Nov 14 12:00:44 mail kernel: SCSI device sde: drive cache: write back
Nov 14 12:00:44 mail kernel: SCSI device sde: 1953525168 512-byte hdwr sectors (1000205 MB)
Nov 14 12:00:44 mail kernel: sde: Write Protect is off
Nov 14 12:00:44 mail kernel: SCSI device sde: drive cache: write back
Nov 14 12:00:44 mail kernel:  sde: sde1 sde2
Nov 14 12:00:44 mail kernel: sd 0:0:4:0: Attached scsi disk sde
Nov 14 12:00:44 mail kernel: sd 0:0:4:0: Attached scsi generic sg2 type 0
So I had to manually issue cá cược thể thao bet365_cách nạp tiền vào bet365_ đăng ký bet365 following command:
mdadm --manage /dev/md0 --add /dev/sde2

No comments:

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)

Blog Archive