Build/Rebuild a ROCKS Worker Node
Recipe for rebuilding a ROCKS worker node. Includes defining the node on frontend using "nodeinfo" setup. ROCKS default way of adding nodes uses "insert-ethers", however that is more targeted at builds of new hosts (uses MAC discovery). This procedure is more reliable for moving existing nodes to a new frontend. More details on which are located in the reference
at the end of this document.
NOTE that this section may well be out of date, and in fact differs dramatically from that used recently at UM. See below for more recent steps.
- see that the node is not defined anywhere already (no other machine will provide DHCP response of it, note that extra DHCP responses can be quite confusing
- rocks remove host cc-115-5
- rocks sync config
- Ensure that the node is registered in the .csv table at /export/rocks/install/tools/nodeinfo/nodeinfo.csv
- on ROCKS frontend to declare host to, run /export/rocks/install/tools/rocks-add-host cc-115-5
- can see that host is defined with "rocks list host" or "rocks dump | grep cc-115-5"
- node boot action should be "install", can list with "rocks list host boot"
- do "shoot-node" or "shoot-node-repartition" to initiate rebuild
- repartition wipes the drives, is used when a filesystem is possibly corrupted or if the partitioning is changed
- if node is accessible via ssh, the shoot-node command will try to connect and reboot it
- if node is offline, power it up
- /export/rocks/install/tools/shoot-node cc-115-5
- /export/rocks/install/tools/shoot-node-repartition cc-115-5
- can ssh to node during build:
- ssh -p 2200 cc-115-5
- Add a sleep timer to the post-install script at (where?)
- ROCKS launches a vnc window with the installer GUI
- Verify that cfengine completed
- You could tail /var/cfengine/promise_summary.log
- Run a ps command and look for all "rocks post" scripts to have finished
- The full cf3 run log is in file "/tmp/cf3_initial_run.log.debug"
- Run check script (/etc/check-v2.sh)
- At UM, first mount lustre
- service lustre_mount_umt3 start
- The outputs should all be 0
- If not, see reference here
- start condor (service condor start)
Newer directions (10/16/2014) for building a WN from scratch
These directions are tasked with the newest R620 in mind. The assumption is that the IP address set has never before been used. Also, it is not clear how much of this applies to the MSU T3.
- Get the public IP assigned (ask Bob or Shawn once the IP is selected from the free set) and update local DNS in svn
- Wait for next dns server cf3 runs to complete
- Update the cluster-control DB and nodeinfo base csv file
- These are in svn. The cluster-control nodeinfo.csv file should be modified both in prototype in svn, and in the cluster-control directory on data08
- Update the appropriate machine lists, for example, UM_machines.txt and machines.txt
- Add the machine to the Rocks DB. See sample command sets at this URL
- The particulars of the IP and Mac addresses should be adjusted accordingly
- Sync the rocks DB to the dhcpd service (rocks sync config)
- Make sure the chosen host name is properly detailed in the cf3 file condor_t2.cf
- condor_msut3.cf for the MSU T3
- Boot the machine to BIOS and make all appropriate changes as detailed here
- PXE boot the machine to build it within Rocks.
- Do the Post Build section above
- The check scripts must be updated. For example, some of the scripts check for specific versions of software that may have changed.
- Most tier 3 nodes are still configured to be built from the tier 2 frontend. They should be moved over to the new tier 3 frontend at some point.
The usage of nodeinfo is encapsulated in the script /export/rocks/install/tools/rocks-add-host, which uses the python nodeinfo library at /export/rocks/install/tools/nodeinfo/nodeinfo.py. It makes use of a CSV text file that contains information on each node, The csv file is located at
An example line from the CSV is shown below:
The elements of this CSV are defined below:
||Private hostname |
|c-115-5 ||Public hostname |
|msulocal ||Private domain |
| ||Condor config?? |
|NO ||Machine is up? |
|NO ||Condor is on? |
|115 ||Rocks rack |
|5 ||Rocks rank? |
|T3 ||Rocks membership |
|10.10.128.250 ||Private IP |
|18.104.22.168 ||Public IP |
|00:21:9b:92:1d:3c ||Private MAC |
|00:21:9b:92:1d:3e ||Public MAC |
|22.214.171.124 ||Ganglia mcast? |
|rac-115-5 ||RAC name? |
|10.10.130.250 ||RAC IP? |
|??? ||RAC MAC? |
The check script runs a set of scripts that check the configuration of a freshly built machine to make sure it was setup correctly. The check script is located at
It runs all of the scripts located in the directory
The output from these scripts is redirected to
The outputs from each script should be 0. If not, the specific script that failed should be investigated.
- 06 Nov 2012 -- JamesKoll
- 07 Nov 2012