Procedure for rebuilding a compute node
In general, compute node rebuilding is fairly easy and the ROCKS should be maintained so that compute nodes can be rebuild whenever, however there are some difficulties with rebuilding nodes in a production cluster and there are some problems that may arise and be checked for.
Steps to rebuild an active compute node (example for dc2-102-1):
- verify that the ROCKS database is set (various checks)
- minimally, verify that
dbreport kickstart dc2-102-1 works
- verify that the node is set to PXE boot
- for a Dell node:
ssh dc2-102-1 /home/install/tools/set-bootseq
- set the node to install in the ROCKS db
- shutdown node's condor processes peacefully (must be done on condor admin node). this immediately prevents new jobs from starting, and the condor_startd will exit once all existing jobs have finished naturally.
umopt1# condor_off -peaceful c-102-1
- the condor_waiter script will watch for condor_startd to exit and then run the command give. so the following will reboot the node once all jobs have peacefully ended:
ssh dc2-102-1 '/home/install/tools/condor_waiter "shutdown -r now"'
- 02 Mar 2008