You are here: Foswiki>AGLT2 Web>MSUFireDrill (21 Jun 2014, JamesKoll)Edit Attach
These instructions use scripts and files found on senna at /home/koll/

* indicates step contains drill-specific instructions

Drill Preparation*

  • Check that switches have settings saved
  • List machines with known issues
  • Notify users of downtime (done)
  • Contact George, Ehren, LONCAPA (Stuart and Gerd) to notify them of work (done)
  • Contact fire dept to see if they want someone on hand? (probably not)
  • Preparations to video tape and document procedure
  • Ensure Senna will boot without network
  • Plug Senna into Juniper
  • Clean humidifier on Tuesday

Shutdown steps

  1. Begin idling jobs down 2 days in advance
  2. Shutdown dcache servers gracefully, setting the pool to rdonly, see ShutDownPoolNode)
  3. Shutdown worker nodes down gracefully, except for one rack* (begin idling down worker nodes with ClusterControl about 48 hours in advance)
  4. Shutdown UPS gracefully
  5. Turn off power to racks 101-110, 111-120. DO NOT turn off racks 121-122 (LONCAPA)

Minimal Recovery steps

  1. Check VESDA. central PDU, and CRAC status
  2. Turn power back on
    1. If power outage is caused by an EPO and you do not understand the cause, then you may want to power off all of the whips and switch on racks individually.
  3. Turn on UPS
  4. Start rack 101, verify senna and cap are up, verify KVM is up
  5. Check that networking is functioning
    1. traceroute 10.10.1.183 (bambi.local)
    2. traceoute 10.10.136.11 (blade switch msu-sw-bn1)
    3. traceroute 10.10.136.20 (dell switch msu-sw-101-top)
  6. Bring up VMWare rack, turn on VM hardware, verify VMs are running
    1. Check that DNS is working
    2. Verify the following VMs are running:
      1. msuinfo, msuinfox
      2. msurxx
      3. msut3-rx6
      4. msurx6
      5. dcdmsu01,2,3
      6. msu-winsvc2
      7. omd-msu (will help diagnose problems)
  7. Open up a browser window pointing to https://omd-msu.aglt2.org/aglt2msu/check_mk/index.py. It will be a mess of red right now, but should get better as the recovery goes on. If you can't figure out why something is going wrong, see if you can find the host in question and if it has any strange error messages.
  8. Bring up PerfSonar boxes
    1. Verify they are up
      1. ping psmsu01-06
  9. Verify entire network is working
    1. ping the blade switches
      1. python pingtest.py bladeswitches.csv
    2. ping dell switches
      1. python pingtest.py dellswitches.csv
    3. ping PDUs
      1. python pingtest.py pdus.csv
    4. ping UPSs
      1. python pingtest.py upss.csv
  10. Bring up minimal HEP cluster functionality
    1. Verify hep1 is up, can log into machines and see home area
    2. Bring up raida file server (hep4)
    3. Verify /work/raida is visible from HEPCluster
  11. Bring up dcache servers
    1. UM uses MSU dCache, so this is a priority
    2. Power on a rack of machines
      1. Ping the host
      2. Check that attached storage is mounted properly
        1. ssh root@hostname mount
    3. Verify that they can see both network's gateways
      1. ssh root@hostname ping 10.10.128.1
      2. ssh root@hostname ping 192.41.236.1
    4. dCache should start up automatically on boot, verify with UM
  12. Power on and verify squids are running
    1. ping -c 1 cache0
    2. ping -c 1 cache1
  13. Bring up tier 3 racks
    1. Check that msut3-rx6/msu3/msu4/msut3-condor/data servers(list) are available
      1. python pingtest.py tier3servers.csv
    2. Start green
    3. Are work and home directories mounted correctly?
      1. /home/
      2. /msu/data/t3work1-8
      3. /msu/data/t3work1-4
      4. /msu/data/martin
      5. /msu/opt/cern/
  14. Bring up compute nodes
    1. Run shoot-node to all worker nodes*
    2. Power on a rack
    3. Start with ClusterControl
    4. Verify with UM that they are running properly
  15. Bring up tier 3 nodes
    1. Decide whether to rebuild nodes*
    2. Turn on nodes
    3. Check that nodes are online
      1. python pingtest.py tier3nodes.csv
    4. Check node status
      1. ssh root@mxut3-rx6
        1. rocks run host rack115 'hostname;/etc/check-v2.sh' >rack115status.txt
        2. rocks run host rack113 'hostname;/etc/check-v2.sh' >rack113status.txt
      2. There are probably a couple problem hosts, but as long as most are okay, go ahead and start the services.
    5. Start condor services
      1. ssh root@msut3-condor 'condor service start'
      2. ssh root@msut3-rx6
        1. rocks run host rack115 '/etc/check-v2.sh && condor service start'
        2. rocks run host rack113 '/etc/check-v2.sh && condor service start'
    6. Verify that a job can be submitted and run successfully
      1. From green
        /msu/data/t3work1/scripts/runcommand.sh echo Hello world
        Your job has been submitted. Details of your job can be found at
        /msu/data/t3work7/tmp/kollshJn54Tuj/info.txt
        
        The job output will be printed below:
        ---------------
        Hello world
        ---------------
        Reminder: Your job details are at
        /msu/data/t3work7/tmp/kollshJn54Tuj/info.txt
        Job 137895 has finished running.

Further checks

  1. Verify noncritical servers are up and running
    1. msucfe
    2. hep2
    3. hep3
    4. hep5
    5. www-aglt2-org
    6. msubck
    7. white
    8. omd-msu
    9. msu-cobbler
    10. msurx6
    11. blue
    12. vSRA
    13. OpenManage_MSU-2.0.0
    14. psmsuvm01
    15. psmsuvm02
    16. hx1, hx3 (needed by pumplin for compiling)
  2. Verify HEP cluster services are working
    1. Printing
    2. Web server (VM hep-pa-msu-edu)
  3. Verify tier 3 services are working on green
    1. cvmfs
      s /cvmfs/atlas.cern.ch/
      Should see "repo" listed

Notes

  • dCache pool nodes may have buggy network card, doesn't see 10G, needs reboot (or reload kernel module for card [MIRCOM])
  • Use a script on msut3-rx6 at /export/rocks/install/tools to turn on PDU outlets
    •    sh power.sh (nodename)   
  • The following one liner may be handy for automating commands for many nodes:
    • seq 1 42 | xargs -I QQQ echo cc-102-QQQ 
  • The order in which servers come up impact the HEP desktop cluster services. In general, the yp server must come up before other servers, specifically hep4 (raid nfs export), and hep1 (service home web areas). If it doesn't come up first, you have to manually restart some services on messed up services.
    • hep4 -
      service ypbind restart; service nfs restart
    • hep1 -
      service ypbind restart; service httpd restart

More information

December2012Drill

-- JamesKoll - 11 Dec 2012
Topic revision: r12 - 21 Jun 2014, JamesKoll
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback