Drive Replacement on the MD1000

Obviously, this needs some testing...

A drive has failed and XFS got errors and took the filesystem offline.

  • unmounted filesystem

root@msufs02 ~# omconfig storage pdisk action=offline controller=1 pdisk=1:0:14
Operation disabled. Read, action=offline
.Refer to the documentation for more information.


root@msufs02 ~# omconfig storage pdisk action=online controller=1 pdisk=1:0:14
Operation disabled. Read, action=online
.Refer to the documentation for more information.

root@msufs02 ~# omconfig storage pdisk controller=1 pdisk=1:0:14 action=remove
Operation not supported. Read, action=remove
Refer to the documentation for more information.

root@msufs02 ~# omconfig storage pdisk controller=1 pdisk=1:0:14 action=rebuild
Operation disabled. Read, action=rebuild
.Refer to the documentation for more information.


root@msufs02 ~# omconfig storage controller controller=1 action=rescan
Operation not supported. Read, action=rescan
Refer to the documentation for more information.

root@msufs02 ~# omconfig storage globalinfo action=globalrescan
Operation not supported. Read, action=globalrescan
Refer to the documentation for more information.


April 29, 2008 disk failure and controller error:

Summary on this.

- about 7am this morning, the controller gave errors and FS went offline, see /var/log/messages below.
- I noticed a message on the console on msurox about IRQ #17 ignored on msufs02.  syslog would have passed this message through.  I investigated a bit
- Shawn noticed the oddity that drive 1:0:0 in this array was listed as "1:0"
- drive 1:0:0 was not listed in the output of omreport storage pdisk vdisk=0 controller=1   Only 29 drives (including 1:0:14) were listed.  The LED on this drive showed solid green (normal).
- Drive 1:0:14 had flashing LED, I hot unplugged it (it was not running) and replaced with new drive.  New drive stayed off (LEDs blank)
- investigated various omconfig commands to bring drive online.  Didn't try the hotspare command.
- shutdown pe2950
- power cycled md1000
- started 2950 and went into LSI BIOS config (control-R)
- all drives have solid green LED
- disk 1:0:0 is listed in vdisk
- disk 1:0:14 is ready and not in a vdisk
- vdisk is "partially degraded"
- ran consistency check, see that drive 1:0:0 is being accessed and :14 is not. stop check 'cause don't want to wait a day for it to finish
- added disk 1:0:14 as a global hot spare; rebuild begins
- exit
- control-alt-delete to reboot (this ain't windows!)
- rebuild contrinues, pausing during init of controller during boot
- OS comes up normally and mounts drive
- omreport now reports expected result for omreport storage pdisk vdisk=0 controller=1

We will do a test of the removed disk to verify that it has problems before returning to Dell, using spinrite http://www.grc.com/sr/spinrite.htm

So it seems that the controller got into a weird state.  Not clear if this is due to disk failing.  Even with two disks offline, shouldn't the array have remained available?

For the next failed disk, I'll follow this procedure:

omconfig storage pdisk action=remove {diskID}
R&R the drive
? omconfig storage pdisk action=assignglobalhotspare {diskID}

-Tom 

-- TomRockwell - 29 Apr 2008
Topic revision: r2 - 29 Apr 2008, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback