Shutdown/Startup procedures for AGLT2 Clusters

Procedures to cleanly bring All AGLT2 activity to a halt

  • service cfengine3 stop
    • This prevents changes below from being over-written
  • Modify /etc/cron.d/modify_quota task on aglbatch to have fixed, low quotas for non-ATLAS jobs.
    • Change the script argument to zero zero from one
    • 27,57 * * * * /bin/bash /root/condor_quota_config/new_modify_quota.sh 0 0
  • Stop auto pilots to AGLT2 and ANALY_AGLT2
    • May want to delay ANALY shutoff, depending on queing information
    • AtlasQueueControl details this procedure
  • Notify pandashift and adcos that pilots are stopped, and the reason
    • Identify yourself as with AGLT2, eg, include your DN
  • After pilots are ascertained to be stopped and there are no further Idle jobs on gate04, do a peaceful shutdown for Condor on all worker nodes
    • cluster_control from a participating machine (aglbatch, umt3int0X) is best * Option 9 with full machine list, for example, two parallel runs, one with UM_machines.txt, one with MSU_machines.txt * Option 2, and specify the Email address to receive shutdown notifications

Now, wait for the actual down-time to arrive.

Stopping ONLY MSU machines, or ONLY UM machines

  • If storage is to be shut down, set an outage in OIM starting a few hours prior to the expected outage time, ending an hour or so afterwards
  • >36 hours prior to "off" time, log in to AGIS and change "maxtime" parameter from 259000 to 86400 for all Panda Queues of ours, specificaly AGLT2_SL6, ANALY_AGLT2_SL6 and AGLT2_MCORE (in the case of MSU only).
  • ~26 hours prior to "off" time, which gives a few hours cushion, start the relevant machines idling down
    • Use cluster_control option 2 with the list of machines as above
      • OR
    • Log in to each machine and do "condor_off -peaceful -subsys startd"
      • This method will NOT update the cluster_control DB, so next run will show all the machines idled
  • log in to AGIS again and change "maxtime" parameter back from 86400 to 259000 for all changed queues.
  • A few hours prior to the expected power outage for storage servers, log in to head01 and set all affected disks "rdonly" from the PoolManager
  • When ready to shut down everything,
    • Power off the WN via "shutdown -h now"
    • Power off the storage servers

Before powering off the storage servers, it is probably a good idea to set queues offline so that jobs won't fail by trying to fetch files from the offline pools.
  • curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=ANALY_AGLT2_SL6-condor'
  • Same command for AGLT2_SL6-condor
  • Notify ADCOS shifts <atlas-project-adc-operations-shifts@cern.ch> that you have set the queues temporarily offline, and why

When power is back
  • If pool servers were shut down, when they are back up, set the disks back to "notrdonly" on head01 PoolManager
  • If the Panda queues were set offline, set them back online (tpmes=setonline)
    • Notify ADCOS that the queues are back online
  • Power up all WN that were shut down
  • Run "/etc/check.sh" on all machines where condor should come back up. There are 2 ways to do this.
    • Use cluster_control option 3 on the machines
    • log in, run it, and then if result is zero, do "service condor start"
      • This method will not update the cluster_control DB
  • service cfengine3 start
    • At the next run, modified cron scripts will be reverted to their original content

-- BobBall - 13 Feb 2011
Topic revision: r3 - 10 Dec 2013, BobBall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback