This page is obsolete
Hardware maintenance is now logged at http://glpi.aglt2.org/
  MSU Hardware Repairs 
Until we have a better system, I'm recording hardware repairs here.  More general service or changes are also recorded in a section below.
  Kernel Panics by Date 
Kernel crashes not attributed to a specific hardware problem.  (Machine check exceptions should go in other section).
 
-  Jan 2008, dc2-102-4 is crashing while running Atlas jobs, has done so a few times.
-  Jan ??? 2008, dc2-102-5 crashed.  rebuilt.
-  Jan 2008. Lots of "Web100" crashes found.  Changed to a different kernel to resolve.
-  Mar 10, 2008. dc2-102-4 "nfs_access_zap_cache".  Another kernel crash on this stupid node   
  Hardware Errors and/or Parts Replacement by Node 
  dc2-102-4 
When the mpmemory program is run from the Dell Diagnostic, it says that DIMM4 and DIMM8 transitioned...
SEL shows errors:
bash-3.00# ipmitool -H 10.10.3.251 -U root -P PASSWORD sel list
   1 | 02/11/2008 | 03:57:38 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
   2 | 02/11/2008 | 05:13:15 | Memory #0x1b | Transition to Non-critical from OK
   3 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Asserted
   4 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Deasserted
   5 | 02/11/2008 | 06:39:04 | Memory #0x1b | Transition to Non-critical from OK
   6 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Asserted
   7 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Deasserted
bash-3.00# ipmitool -H 10.10.3.251 -U root -P PASSWORD sel elist
   1 | 02/11/2008 | 03:57:38 | Event Logging Disabled SEL | Log area reset/cleared | Asserted
   2 | 02/11/2008 | 05:13:15 | Memory Mem ECC Warning | Transition to Non-critical from OK
   3 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Asserted
   4 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Deasserted
   5 | 02/11/2008 | 06:39:04 | Memory Mem ECC Warning | Transition to Non-critical from OK
   6 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Asserted
   7 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Deasserted
  dc2-104-?  
Dec 15, 2007 Replaced DIMM
  dc2-104-? 
 
-  Dec 15, 2007 Replaced DIMM
  msufs01 
 
-  Jan 11, 2008 msufs01, bottom md1000 shelf, replaced left (active) EMM
  msufs04  
 
-  Jan 16, 2008 perc 5/e card replaced
  Cluster Service or Changes 
  2008 
 
-  Feb 28, 2008 
-  Put one Gore cable into stacking ring
-  Gore 10GE cables on msufs04 and 05
-  re-ziptied power cords on compute nodes to make them truly secure
 
-  Mar 7, 2008 Installed perc6/e cards in msufs01-msufs05.  Rearranged SAS cables to bundle at left rear side. Labeled shelves and cables A thru D.
-  Mar 7, 2008 Put Gore CX4 cables on msufs01-msufs03
-- 
TomRockwell - 16 Jan 2008