Background Info: Unexpected Power Loss on file servers

During backup generator test on 12-may-09 at the MSU BPS bldg , most UPSs received an errant EPO (Emergency Power Off) signal and dropped power to file servers msufs02/03/04/05

(Tom can correct and add details to this part) Two shelves had 2 drives that fell off their raid configuration. One started rebuilding by itself. The other shelf had 2 disks showing as Foreign Config; the foreign config was removed; the disks were reinserted as hot spares and the controller automatically rebuilt the array.

Two shelves had 3 drives (the same 3 drive IDs #0,7,8) in Foreign Config State:
  • MD1000 SN FQ8MWD1 on msufs02 as Shelf C aka msufs02_1 Controller 1 Slot 3 Connector 0
  • MD1000 SN 8CJKWD1 on msufs04 as Shelf D aka msufs04_2 Controller 1 Slot 3 Connector 1
  • note that controller above is numbered id=1 as seen from the BIOS, but becomes controller=0 in omreport

We tried the same method on msufs04 (of removing the Foreign Config), but that was a mistake for a raid6 array (with only 2 parity disks), and, as we know now, also not the best practice in any case.

Both shelves were turned off before booting linux. The boot process was otherwise hanging at trying to "spin up" the virtual disk.

Recovering from Drives showing up as "Foreign Config" in Perc 6e

This method was used after calling Dell Support on 12-may-2009 to recover msufs02.

This is from memory, and may not be 100% accurate:

  • Boot into the the Perc BIOS (using CTRL-R when prompted during the boot sequence)
  • Select the proper controller (using arrow keys, then Enter)
  • Select the Foreign Config menu tab (use CTRL-N and CTRL-P)
    • Note that this menu tab only appears if there are drives in "Foreign Config"
  • In the tree view, highlight the Controller node (not the disk group node, the controller node) and press F2
  • From the pop-up menu select "Import Foreign Config" (or maybe it is "Foreign Config", Right-Arrow, "Import"), then Enter
    • Note that the "Foreign Config" menu tab has disappeared
  • That should be all that's needed, you can ESC out to the point where you can reboot

This is the method recommended to try first for recovering disks in "Foreign Config" state.

Recovering from Drives with lost configuration

This method was used to recover msufs04, as walked through by Dell Support on 13-may-2009. It amounts to manually recreating the array headers, without initializing the array. One tech support person warned us that there typically was a 50-50 chance of success. Success seems to hang on knowing the exact original configuration. Having all the disks in one raid6 array made things easier in our case.

The vdisk creation command was:
cf. /home/install/tools/fss-setup-perc6.sh

do_omconfig storage controller action=createvdisk controller=1 \
size=max raid=r6 \
pdisk=\
1:0,\
1:1,\
1:2,\
1:3,\
1:4,\
1:5,\
1:6,\
1:7,\
1:8,\
1:9,\
1:10,\
1:11,\
1:12,\
1:13,\
1:14 \
stripesize=512kb readpolicy=ra writepolicy=wb name="Virtual Disk 1"

(note: Figured out the vdisk name using "omreport storage vdisk controller=0" on another msufs0x)

note: used ipmi connection to view the BIOS from quiet office during service call: [root@msurox ~]# sh /home/install/tools/ipmi.sh rac-msufs04 sol activate

This sequence is from memory, and may not be 100% accurate:

  • Boot into the the Perc BIOS (using CTRL-R when prompted during the boot sequence)
  • Select the proper controller (using arrow keys, then Enter)
  • In the "PD Mgmt" menu tab, the disks 01:00:xx were a mix of 12 "Foreign" and 3 "Ready"
  • Select the Foreign Config menu tab (using CTRL-N and CTRL-P)
    • Note that this menu tab only appears if there are drives in "Foreign Config"
  • The second disk shows as Foreign
  • In the tree view, highlight the Controller node and press F2
  • From the pop-up menu select "Foreign Config", right-arrow, "Remove" (or maybe it was "Clear")
    • Note that the "Foreign Config" menu tab has disappeared
  • Select the "VD Mgmt" menu tab, and there is only one Vdisk listed
  • Select the controller node, press F2
  • From the small pop-up menu select "Create VDisk"
  • From the large pop-up dialog:
    • Press Enter to change the RAID level to RAID6
    • Press down-arrow and this selects the first disk in the array list
    • Press Space on each drive in the same order as originally created
    • Each time you press Space, an "x" shows the drive as included, and the next drive is highlighted
    • After the last drive has been added press Tab to select the "VD Size", leave it as default
    • Tab and type in the "VD Name" as "Virtual Disk 1"
    • Tab and Enter to change the stripe size to 512kb
    • Tab and Enter to select ReadAhead
    • Tab and Enter to select WriteBack
    • Tab ~4 times to skip the rest of the options and press Enter on "OK"
    • Another pop-up menu warns you to need to initialize the array unless you are recovering, bla, bla. We do not want to initialize in this case
  • We are back at the "VD Mgmt" menu tab, and there are now 2 vdisk listed
  • That should be all that's needed, you can ESC out to the point where you can reboot
  • This all worked and the /dcache1 area was mounted on msufs04

Success seems to be dependent on knowing the exact order of the physical disks in the virtual disk. Dell support wanted to run something called "dset" downloaded from support.dell.com/dset that would supposedly tell everything about the disk arrays, including whether hot spares had ever been used. We ran it on msufs05 (yes, 05, for reference, while msufs04 was in the BIOS), but it took very very long, and crashed with a python error. It was supposed to create a dset* file in /var/log/ but we never found it, cf below.

It is not clear (to me) whether this "blind" approach would always work, especially if 2 disks were ever taken out at the same time and re-introduced as hot spares. There may be no guarantee that these 2 disks would be added back into the array in the right order. We were lucky this time...

[root@msufs05 ~]# ls -l /root/delldset*
-rwxr-xr-x  1 root root 24821407 May 13 11:59 /root/delldset_v1.7.0.119.bin

output:

Dell System E-Support Tool
@Copyright Dell Inc. 2004-2008  Version 1.7 build 119
* Getting Linux system summary information ...
Gathering Network Information ...
Gathering OS Summary Information ...
* Getting Linux operating system configuration information ...
Gathering Boot Information ...
Gathering Module Information ...
Gathering Memory Information ...
Gathering Storage Information ...
Gathering Network Information
Gathering Summary Information ...
Processing Syslog ...
* Collecting Dell OpenManage information ... [Please wait]
NOTE: There was a problem loading the OMSA driver/service
* Gathering chassis information...
* Gathering System Information...
* Gathering Motherboard Information...
sh: ./prereqcheck/sysreport: No such file or directory
* Gathering storage information...
Note: Scanning for supported SCSI or RAID controllers... [please wait]
* Collecting storage information...
Traceback (most recent call last):
  File "<string>", line 172, in <module>
  File "/root/bin/builddellsysteminfo/out1.pyz/dsetcmd", line 86, in Execute
  File "/root/bin/builddellsysteminfo/out1.pyz/dsetctrl", line 83, in CreateReport
  File "/root/bin/builddellsysteminfo/out1.pyz/dsetctrl", line 178, in __gethw
  File "/root/bin/builddellsysteminfo/out1.pyz/processing.storage.dsetstor", line 33, in GetData
  File "/root/bin/builddellsysteminfo/out1.pyz/processing.storage.dsetscsi", line 49, in GetData
  File "/root/bin/builddellsysteminfo/out1.pyz/shutil", line 46, in copyfile
IOError: [Errno 2] No such file or directory: 'gui/holder.xml'

-- PhilippeLaurens - 13 May 2009
Topic revision: r3 - 15 May 2009, PhilippeLaurens
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback