Drill Preparation*

Check that switches have settings saved
List machines with known issues
Notify users of downtime (done)
Contact George, Ehren, LONCAPA (Stuart and Gerd) to notify them of work (done)
Contact fire dept to see if they want someone on hand? (probably not)
Preparations to video tape and document procedure
Ensure Senna will boot without network
Plug Senna into Juniper
Clean humidifier on Tuesday

Shutdown steps

Begin idling jobs down 2 days in advance
Shutdown dcache servers gracefully, setting the pool to rdonly, see ShutDownPoolNode)
Shutdown worker nodes down gracefully, except for one rack* (begin idling down worker nodes with ClusterControl about 48 hours in advance)
Shutdown UPS gracefully
Turn off power to racks 101-110, 111-120. DO NOT turn off racks 121-122 (LONCAPA)

Minimal Recovery steps

Check VESDA. central PDU, and CRAC status
Turn power back on
1. If power outage is caused by an EPO and you do not understand the cause, then you may want to power off all of the whips and switch on racks individually.
Turn on UPS
Start rack 101, verify senna and cap are up, verify KVM is up
Check that networking is functioning
1. traceroute 10.10.1.183 (bambi.local)
2. traceoute 10.10.136.11 (blade switch msu-sw-bn1)
3. traceroute 10.10.136.20 (dell switch msu-sw-101-top)
Bring up VMWare rack, turn on VM hardware, verify VMs are running
1. Check that DNS is working
2. Verify the following VMs are running:
  1. msuinfo, msuinfox
  2. msurxx
  3. msut3-rx6
  4. msurx6
  5. dcdmsu01,2,3
  6. msu-winsvc2
  7. omd-msu (will help diagnose problems)
Open up a browser window pointing to https://omd-msu.aglt2.org/aglt2msu/check_mk/index.py. It will be a mess of red right now, but should get better as the recovery goes on. If you can't figure out why something is going wrong, see if you can find the host in question and if it has any strange error messages.
Bring up PerfSonar boxes
1. Verify they are up
  1. ping psmsu01-06
Verify entire network is working
1. ping the blade switches
  1. python pingtest.py bladeswitches.csv
2. ping dell switches
  1. python pingtest.py dellswitches.csv
3. ping PDUs
  1. python pingtest.py pdus.csv
4. ping UPSs
  1. python pingtest.py upss.csv
Bring up minimal HEP cluster functionality
1. Verify hep1 is up, can log into machines and see home area
2. Bring up raida file server (hep4)
3. Verify /work/raida is visible from HEPCluster
Bring up dcache servers
1. UM uses MSU dCache, so this is a priority
2. Power on a rack of machines
  1. Ping the host
  2. Check that attached storage is mounted properly
    1. ssh root@hostname mount
3. Verify that they can see both network's gateways
  1. ssh root@hostname ping 10.10.128.1
  2. ssh root@hostname ping 192.41.236.1
4. dCache should start up automatically on boot, verify with UM
Power on and verify squids are running
1. ping -c 1 cache0
2. ping -c 1 cache1
Bring up tier 3 racks
1. Check that msut3-rx6/msu3/msu4/msut3-condor/data servers(list) are available
  1. python pingtest.py tier3servers.csv
2. Start green
3. Are work and home directories mounted correctly?
  1. /home/
  2. /msu/data/t3work1-8
  3. /msu/data/t3work1-4
  4. /msu/data/martin
  5. /msu/opt/cern/
Bring up compute nodes
1. Run shoot-node to all worker nodes*
2. Power on a rack
3. Start with ClusterControl
4. Verify with UM that they are running properly
Bring up tier 3 nodes
1. Decide whether to rebuild nodes*
2. Turn on nodes
3. Check that nodes are online
  1. python pingtest.py tier3nodes.csv
4. Check node status
  1. ssh root@mxut3-rx6
    1. rocks run host rack115 'hostname;/etc/check-v2.sh' >rack115status.txt
    2. rocks run host rack113 'hostname;/etc/check-v2.sh' >rack113status.txt
  2. There are probably a couple problem hosts, but as long as most are okay, go ahead and start the services.
5. Start condor services
  1. ssh root@msut3-condor 'condor service start'
  2. ssh root@msut3-rx6
    1. rocks run host rack115 '/etc/check-v2.sh && condor service start'
    2. rocks run host rack113 '/etc/check-v2.sh && condor service start'
6. Verify that a job can be submitted and run successfully
  1. From green
```
/msu/data/t3work1/scripts/runcommand.sh echo Hello world
Your job has been submitted. Details of your job can be found at
/msu/data/t3work7/tmp/kollshJn54Tuj/info.txt

The job output will be printed below:
---------------
Hello world
---------------
Reminder: Your job details are at
/msu/data/t3work7/tmp/kollshJn54Tuj/info.txt
Job 137895 has finished running.
```

Further checks

Verify noncritical servers are up and running
1. msucfe
2. hep2
3. hep3
4. hep5
5. www-aglt2-org
6. msubck
7. white
8. omd-msu
9. msu-cobbler
10. msurx6
11. blue
12. vSRA
13. OpenManage_MSU-2.0.0
14. psmsuvm01
15. psmsuvm02
16. hx1, hx3 (needed by pumplin for compiling)
Verify HEP cluster services are working
1. Printing
2. Web server (VM hep-pa-msu-edu)
Verify tier 3 services are working on green
1. cvmfs
```
s /cvmfs/atlas.cern.ch/
```
  Should see "repo" listed

Notes

dCache pool nodes may have buggy network card, doesn't see 10G, needs reboot (or reload kernel module for card [MIRCOM])
Use a script on msut3-rx6 at /export/rocks/install/tools to turn on PDU outlets
- ```
   sh power.sh (nodename)   
```
The following one liner may be handy for automating commands for many nodes:
- ```
seq 1 42 | xargs -I QQQ echo cc-102-QQQ 
```
The order in which servers come up impact the HEP desktop cluster services. In general, the yp server must come up before other servers, specifically hep4 (raid nfs export), and hep1 (service home web areas). If it doesn't come up first, you have to manually restart some services on messed up services.
- hep4 -
```
service ypbind restart; service nfs restart
```
- hep1 -
```
service ypbind restart; service httpd restart
```

More information

December2012Drill

-- JamesKoll - 11 Dec 2012

Topic revision: r12 - 21 Jun 2014, JamesKoll

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback