Links

Physical Inspection

Have suffered shipping damage on an MD1200, check that mounting tabs are ok and drives removal is not obstructed.

RAID Initialization

The shelf should be fully initialized before putting into production.

A full initialization, which will blank out any existing data on shelf can be run with "slow initialization" option in BIOS setup tool. The shelf (vdisk) will be unavailable to the OS until this completes.

Background initialization will automatically run on any new vdisk. The rate that this proceeds at can be adjusted to balance with other use of the controller, the default rate is 30% (?). If nothing else has priority, do:

omconfig storage controller controller=0 action=setbgirate rate=100

The init seems to take about as long as reading or writing once through all the drives.

There is some confusion over what RAID init does. How are bits on the drives actually changed during a BGI? (June 6, 2010 Have asked Dell is data is rewritten during background init; tech will email answer if he can find it...)

Note that the BGI does not need to complete before the array can be used --- but user use of the array will take priority and halt or slow the BGI. It is possible to write data on parts of the array before the BGI has gotten to that part of the shelf. Writes of course involve an update (creation) of RAID parity blocks.

Fill filesystem

It isn't clear if a background initialiazation rewrites the user data areas of the array or not. So, suggest filling filesystem once and reading files verifying their checksums.

Ah, think I lost the script I used for writting, but basically, just used perl to fill the filesystem with 4GB files that were just 1s and 0s (binary). Each file hould have the same md5sum (except the truncated final files).

find /mnt/msufs10_a | xargs -I QQQ md5sum QQQ > msufs10_a.md5sum &

Patrol Read

Seems good to complete one complete patrol read pass if there is time. This may show media errors, or even result in ejecting a disk from an array.

Patrol reads can be initiated using omconfig

omconfig storage controller controller=0 action=startpatrolread

See that patrol read is finished:

[root@msufs10 ~]# grep 99.99 /var/log/lsi_0605.log     
06/04/10 21:10:29: EVT#04040-06/04/10 21:10:29:  94=Patrol Read progress on PD 22(e0x29/s7) is 99.99%(65429s)
06/04/10 21:17:43: EVT#04049-06/04/10 21:17:43:  94=Patrol Read progress on PD 24(e0x29/s9) is 99.99%(327s)
06/04/10 21:21:28: EVT#04055-06/04/10 21:21:28:  94=Patrol Read progress on PD 25(e0x29/s1) is 99.99%(552s)
06/04/10 21:21:46: EVT#04056-06/04/10 21:21:46:  94=Patrol Read progress on PD 28(e0x29/s11) is 99.99%(570s)
06/04/10 21:23:49: EVT#04061-06/04/10 21:23:49:  94=Patrol Read progress on PD 2a(e0x36/s2) is 99.99%(693s)
06/04/10 21:24:08: EVT#04062-06/04/10 21:24:08:  94=Patrol Read progress on PD 2c(e0x36/s4) is 99.99%(712s)
06/04/10 21:24:41: EVT#04063-06/04/10 21:24:41:  94=Patrol Read progress on PD 33(e0x36/s0) is 99.99%(745s)
06/04/10 21:25:00: EVT#04064-06/04/10 21:25:00:  94=Patrol Read progress on PD 23(e0x29/s6) is 99.99%(764s)
06/04/10 21:26:08: EVT#04065-06/04/10 21:26:08:  94=Patrol Read progress on PD 2b(e0x36/s5) is 99.99%(832s)
06/04/10 21:28:06: EVT#04066-06/04/10 21:28:06:  94=Patrol Read progress on PD 21(e0x29/s8) is 99.99%(950s)
06/04/10 21:28:27: EVT#04067-06/04/10 21:28:27:  94=Patrol Read progress on PD 1d(e0x29/s2) is 99.99%(971s)
06/04/10 21:28:45: EVT#04068-06/04/10 21:28:45:  94=Patrol Read progress on PD 2e(e0x36/s8) is 99.99%(989s)
06/04/10 21:29:51: EVT#04069-06/04/10 21:29:51:  94=Patrol Read progress on PD 20(e0x29/s3) is 99.99%(1055s)
06/04/10 21:30:11: EVT#04070-06/04/10 21:30:11:  94=Patrol Read progress on PD 1e(e0x29/s5) is 99.99%(1075s)
06/04/10 21:31:01: EVT#04071-06/04/10 21:31:01:  94=Patrol Read progress on PD 32(e0x36/s1) is 99.99%(1125s)
06/04/10 21:31:58: EVT#04073-06/04/10 21:31:58:  94=Patrol Read progress on PD 35(e0x36/s11) is 99.99%(1182s)
06/04/10 21:32:31: EVT#04074-06/04/10 21:32:31:  94=Patrol Read progress on PD 27(e0x29/s10) is 99.99%(1215s)
06/04/10 21:32:59: EVT#04075-06/04/10 21:32:59:  94=Patrol Read progress on PD 26(e0x29/s0) is 99.99%(1243s)
06/04/10 21:33:25: EVT#04076-06/04/10 21:33:25:  94=Patrol Read progress on PD 1f(e0x29/s4) is 99.99%(1269s)
06/04/10 21:34:08: EVT#04077-06/04/10 21:34:08:  94=Patrol Read progress on PD 34(e0x36/s10) is 99.99%(1312s)
06/04/10 21:34:26: EVT#04078-06/04/10 21:34:26:  94=Patrol Read progress on PD 31(e0x36/s9) is 99.99%(1330s)
06/04/10 21:35:53: EVT#04080-06/04/10 21:35:53:  94=Patrol Read progress on PD 30(e0x36/s6) is 99.99%(1417s)
06/04/10 21:36:51: EVT#04081-06/04/10 21:36:51:  94=Patrol Read progress on PD 2d(e0x36/s3) is 99.99%(1475s)
06/04/10 21:37:21: EVT#04082-06/04/10 21:37:21:  94=Patrol Read progress on PD 2f(e0x36/s7) is 99.99%(1505s)
06/04/10 22:35:55: EVT#04129-06/04/10 22:35:55:  94=Patrol Read progress on PD 03(e0x0f/s2) is 99.99%(5019s)
06/04/10 22:55:15: EVT#04130-06/04/10 22:55:15:  94=Patrol Read progress on PD 0d(e0x0f/s10) is 99.99%(6179s)
06/04/10 22:57:30: EVT#04131-06/04/10 22:57:30:  94=Patrol Read progress on PD 0a(e0x0f/s9) is 99.99%(6314s)
06/04/10 22:57:42: EVT#04132-06/04/10 22:57:42:  94=Patrol Read progress on PD 19(e0x1c/s0) is 99.99%(6326s)
06/04/10 22:59:42: EVT#04133-06/04/10 22:59:42:  94=Patrol Read progress on PD 05(e0x0f/s4) is 99.99%(6446s)
06/04/10 23:01:08: EVT#04134-06/04/10 23:01:08:  94=Patrol Read progress on PD 0b(e0x0f/s1) is 99.99%(6532s)
06/04/10 23:03:59: EVT#04135-06/04/10 23:03:59:  94=Patrol Read progress on PD 17(e0x1c/s9) is 99.99%(6703s)
06/04/10 23:04:11: EVT#04136-06/04/10 23:04:11:  94=Patrol Read progress on PD 0e(e0x0f/s11) is 99.99%(6715s)
06/04/10 23:04:34: EVT#04137-06/04/10 23:04:34:  94=Patrol Read progress on PD 14(e0x1c/s8) is 99.99%(6738s)
06/04/10 23:04:57: EVT#04138-06/04/10 23:04:57:  94=Patrol Read progress on PD 0c(e0x0f/s0) is 99.99%(6761s)
06/04/10 23:05:03: EVT#04139-06/04/10 23:05:03:  94=Patrol Read progress on PD 09(e0x0f/s6) is 99.99%(6767s)
06/04/10 23:06:19: EVT#04140-06/04/10 23:06:19:  94=Patrol Read progress on PD 16(e0x1c/s6) is 99.99%(6843s)
06/04/10 23:06:23: EVT#04141-06/04/10 23:06:23:  94=Patrol Read progress on PD 07(e0x0f/s8) is 99.99%(6847s)
06/04/10 23:06:50: EVT#04142-06/04/10 23:06:50:  94=Patrol Read progress on PD 08(e0x0f/s7) is 99.99%(6874s)
06/04/10 23:08:12: EVT#04143-06/04/10 23:08:12:  94=Patrol Read progress on PD 06(e0x0f/s3) is 99.99%(6956s)
06/04/10 23:08:34: EVT#04144-06/04/10 23:08:34:  94=Patrol Read progress on PD 04(e0x0f/s5) is 99.99%(6978s)
06/04/10 23:12:42: EVT#04145-06/04/10 23:12:42:  94=Patrol Read progress on PD 12(e0x1c/s4) is 99.99%(7226s)
06/04/10 23:13:26: EVT#04146-06/04/10 23:13:26:  94=Patrol Read progress on PD 15(e0x1c/s7) is 99.99%(7270s)
06/04/10 23:13:32: EVT#04147-06/04/10 23:13:32:  94=Patrol Read progress on PD 1b(e0x1c/s11) is 99.99%(7276s)
06/04/10 23:14:19: EVT#04148-06/04/10 23:14:19:  94=Patrol Read progress on PD 10(e0x1c/s2) is 99.99%(7323s)
06/04/10 23:15:07: EVT#04149-06/04/10 23:15:07:  94=Patrol Read progress on PD 13(e0x1c/s3) is 99.99%(7371s)
06/04/10 23:16:35: EVT#04150-06/04/10 23:16:35:  94=Patrol Read progress on PD 11(e0x1c/s5) is 99.99%(7459s)
06/04/10 23:18:08: EVT#04151-06/04/10 23:18:08:  94=Patrol Read progress on PD 18(e0x1c/s1) is 99.99%(7552s)
06/04/10 23:19:31: EVT#04152-06/04/10 23:19:31:  94=Patrol Read progress on PD 1a(e0x1c/s10) is 99.99%(7635s)

Dump controller logs and look for errors

Use omconfig to dump the controller's internal log:

omconfig storage controller controller=0 action=exportlog

Then look for a file /var/log/lsi_MMDD.log

Some strings to "grep -i" for: "medium", "error", "warning", "unexpected"

Device IDs

The controller uses a HEX ID for the enclosures and shelves. On the test setup with 4 12 disk shelves, see this:

T29: Total Device = 52  

T29: PD   Flags    State Type Size     S N Vendor   Product          Rev  P C ID SAS Addr         Port Phy DevH BFw  BRev

T29: ---  -------- ----- ---- -------- - - -------- ---------------- ---- - - -- ---------------- ---- --- ---- ---- ----

T29: 3    f1c0000f 00020 00   e8e088af 0 0 0 SEAGATE  ST32000444SS     KS65 0 0 0d 5000c500103604ba 00   18  0d    NA   NA

T29:                                                                      1 0 15 5000c500103604b9 01   18  15

T29: 4    f1c0000f 00020 00   e8e088af 0 0 0 SEAGATE  ST32000444SS     KS65 0 0 0e 5000c5001044d6c6 00   19  0e    NA   NA

T29:                                                                      1 0 16 5000c5001044d6c5 01   19  16

T29: 5    f1c0000f 00020 00   e8e088af 0 0 0 SEAGATE  ST32000444SS     KS65 0 0 0f 5000c5001044ff7e 00   1a  0f    NA   NA

T29:                                                                      1 0 17 5000c5001044ff7d 01   1a  17

.
.
.
T29: 36   01c0000f 00020 0d   0 0 0 0 DELL     MD1200           1.01 0 0 6b 500c04f2a1a932bd 00   24  6b    NA   NA

T29:                                                                      1 0 78 500c04f2a1a9323d 01   24  78

T29: 100  00400005 00020 03   0 0 0 0 LSI      SMP/SGPIO/SEP    0729 0 0 ffff                0 00   ff  00    NA   NA



Lower in the logs, you can find a set of entries that include both the PD index in HEX and the Encolsure ID/Slot number:

T29: EVT#01168-T29:  91=Inserted: PD 03(e0x0f/s2)

T29: EVT#01169-T29: 247=Inserted: PD 03(e0x0f/s2) Info: enclPd=0f, scsiType=0, portMap=10, sasAddr=5000c500103604ba,5000c500103604b9

Medium Errors

05/31/10  1:42:03: EVT#01493-05/31/10  1:42:03:  47=Background Initialization corrected medium error (VD 01/1 at a75eef08, PD 17(e0x1c/s9) at a75eef08)
05/31/10  1:42:03: EVT#01494-05/31/10  1:42:03:  47=Background Initialization corrected medium error (VD 01/1 at a75eef09, PD 17(e0x1c/s9) at a75eef09)
05/31/10  1:42:03: EVT#01495-05/31/10  1:42:03:  47=Background Initialization corrected medium error (VD 01/1 at a75eef47, PD 17(e0x1c/s9) at a75eef47)
05/31/10  1:42:03: EVT#01496-05/31/10  1:42:03:  47=Background Initialization corrected medium error (VD 01/1 at a75eef6d, PD 17(e0x1c/s9) at a75eef6d)

06/04/10 19:39:34: DEV_REC:Medium Error DevId[21] devHandle 48 RDM=807ab800 retires=0
06/04/10 19:39:34: prCallback: Medium Error on pd=21, StartLba=a7856f9b, ErrLba=a7856fda
06/04/10 19:39:34: EVT#03907-06/04/10 19:39:34: 110=Corrected medium error during recovery on PD 21(e0x29/s8) at a7856fda
06/04/10 19:39:34: EVT#03908-06/04/10 19:39:34:  93=Patrol Read corrected medium error on PD 21(e0x29/s8) at a7856fda

Update firmware

Update firmware to current levels.

Contacting Dell

Have contacted Dell about corrected medium errors on new disks... No warranty replacement until (or if) drive is ejected from array by controller.

-- TomRockwell - 07 Jun 2010
Topic revision: r3 - 31 Jul 2013 - 21:22:07 - JamesKoll
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback