Procedure for rebuilding a compute node

In general, compute node rebuilding is fairly easy and the ROCKS should be maintained so that compute nodes can be rebuild whenever, however there are some difficulties with rebuilding nodes in a production cluster and there are some problems that may arise and be checked for.

Steps to rebuild an active compute node (example for dc2-102-1):

  • verify that the ROCKS database is set (various checks)
    • minimally, verify that dbreport kickstart dc2-102-1 works
  • verify that the node is set to PXE boot
    • for a Dell node: ssh dc2-102-1 /home/install/tools/set-bootseq
  • set the node to install in the ROCKS db
    • rocks set host pxeboot dc2-102-1 action=install
  • shutdown node's condor processes peacefully (must be done on condor admin node). this immediately prevents new jobs from starting, and the condor_startd will exit once all existing jobs have finished naturally.
    • umopt1# condor_off -peaceful c-102-1
  • the condor_waiter script will watch for condor_startd to exit and then run the command give. so the following will reboot the node once all jobs have peacefully ended:
    • ssh dc2-102-1 '/home/install/tools/condor_waiter "shutdown -r now"'

-- TomRockwell - 02 Mar 2008

This topic: AGLT2 > WebHome > MaintenanceProcedures > RebuildComputeNode
Topic revision: 09 Jul 2008, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback