ClusterControl < AGLT2

You are here: Foswiki>AGLT2 Web>ClusterControl (10 Dec 2013, BobBall)Edit Attach

Cluster Control

Cluster Control

This is the main page for information about the Cluster Monitoring and Control tools. At this time, the state of two conditions is maintained an monitored using these tools. Those conditions are

Power state, as reflected by response to ping packets
Condor state, ie, is the worker node accepting jobs to run

NOTE that this is a MONITORING TOOL ONLY intended to maintain up to date lists of these status indicators. However, other tools such as the push_cmd.sh script (see next paragraph) may inititate actions based upon these conditions that will ultimately change these conditions, eg, running the enhanced mode of the eval_T2_condor_ready.sh script on a worker node may result in setting the Condor state to running.

In addition, a tool (push_cmd.sh) is provided that takes, as input, any command that should be executable via ssh and runs it on the provided list of machines. The list is filtered by default to run on only those machines that have a Condor state accepting jobs to run. Further filters and options are provided by the help option (-h) of the tool.

Installation

InstallPage The installation page is password protected

Setup Script

Following installation, the set of machines selected, all of which must have the Condor schedd running, will add the Cluster control variables via script named clusco_setup.sh (bash only) that is placed in /etc/profile.d. In particular the path to the Cluster Control commands is added to the PATH variable.

Control Interface and command scripts

This interface is not yet implemented. When complete, it will provide access to the command scripts below.

get

This provides access to two shell scripts, one each to get the Power State or Condor State of the indicated machine.

get_condor_state.sh

Usage: /atlas/data08/manage/cluster/get_condor_state.sh machine
       Return the condor run/stop state of the specified machine.
       Returns zero of stopped/unknown, 1 if startd is running

get_power_state.sh

Usage: /atlas/data08/manage/cluster/get_power_state.sh machine
       Return the power on/off state of the specified machine.
       Returns zero if off/unknown, 1 if machine is running

set

This provides access to two shell scripts, one each to set the Power State or Condor State of the indicated machine.

set_condor_state.sh

Usage: /atlas/data08/manage/cluster/set_condor_state.sh machine state
       Set the indicated Condor state as up or down
       State should be yes(up) or no(down) and is case independent
       If marking UP (yes), also set the power state UP

set_power_state.sh

Usage: /atlas/data08/manage/cluster/set_power_state.sh machine state
       Set the indicated machine as up or down
       State should be yes or no (case independent)
       Machines marked down are also marked with condor off
       No condor change is made to machines marked up

cluster_control

The controlling script cluster_control is the best interface to all of the above commands, plus the push_cmd.sh documented below. It prompts for an option selection from a list, and then for any additional parameters that may be needed to perform the desired actions. There are two modes in which it will operate, interactive and command-line. The interactive version, ie invoked with no arguments whatsoever, presents a list of options (below) and then prompts for any needed arguments.

*****************
Choose an Option
   1. What are the assumptions here?
   2. Peacefully Idle a Worker Node
   3. Start up Condor on a Worker Node
   4. Immediately Stop Condor on a Worker Node
   5. Get Power state of a Worker Node from the DB
   6. Change DB Power state of a Worker Node
   7. Get Condor state of a Worker Node from the DB
   8. Change DB Condor state of a Worker Node
   9. If doing many machines, name of file containing list []
   10. Execute a command on the machine list above
   11. Print help text
   12. Quit

The command-line version is designed for inclusion within scripts. Both versions take either an input file containing list of machines to operate upon, one machine per line, or a comma-separated list of machines. A single wildcard (*) in a each "machine" in the list is allowed. Special machine list names are T2WN, UMWN and MSUWN.

NOTE: Both the interactive and command line versions of cluster_control use ONLY the local network machine names when given a list of machines to operate upon.

Option 10 is not (yet) available in the command-line version. Option 9 is not implemented in this mode.

 
cluster_control -h
Usage:  cluster_control  Full Interactive invocation, no arguments accepted
        cluster_control [-h|--help]
                        -c|--command <command number>
                        -f|--filter machine_names
                        [-s|--state ON|OFF]
                        [-e|--email email_address]

          command and filter are always required if any arguments are supplied
          at all, with the exception of the help command

          -h|--help       Print this help text and exit
          -c|--command <command_number>   The command to execute.  run the
                          full interactive command to see the legal list
          -f|--filter machine_names   Execute command_number on these machines.
                          This is a comma separated list, with each entry
                          including up to one wildcard.  There are 3 special
                          wildcards that over-ride any single machine(s);
                          T2WN, MSUWN and UMWN.  The first is exclusive, the
                          last 2 can be combined (but just equal T2WN then)
          -s|--state ON|OFF   The state to set in commands that require
                          this argument.
          -e|--email email_address   The address to which command completion
                          notifications are sent, for commands that require
                          an email address

          Unknown or invalid options cause cluster_control to abort with an
          error code.

Example commands

cluster_control -c 5 -f bl-2-19,bl-2-2
cluster_control -c 2 -f UMWN -e ball@umich.edu

Maintenance Scripts

Various scripts are provided.

crontask.sh

This is a task that can be run daily via crontab entry to validate the state of the maintained DataBase. No actual changes to the DataBase are made. Utilizes seed_up_states.sh in a "safe" mode.

crontask.sh reports

The task is set up to send an Email (default to aglt2-hardware list) with subject "Cluster Control DB Inconsistencies" if a discrepancy of any kind is found. An example of such an Email is shown here.

        For more information
 See https://hep.pa.msu.edu/twiki/bin/view/AGLT2/ClusterControl

 LocalName   PublicName  Subnet   Type    POWER  CONDOR
65c65
< "bl-6-8","bl-6-8","local","T2_T3_Share","YES","YES"
---
> > "bl-6-8","bl-6-8","local","T2_T3_Share","YES","NO"
293c293
< "cc-106-41","c-106-41","msulocal","T2only","YES","YES"
---
> > "cc-106-41","c-106-41","msulocal","T2only","NO","YES"
328c328
< "cc-117-2","c-117-2","msulocal","ALL12","YES","YES"
---
> > "cc-117-2","c-117-2","msulocal","ALL12","NO","YES"

In the above report:

bl-6-8, Condor was stopped in such a way that the DB entry was not updated.
- Solution: set_condor_state.sh bl-6-8 NO
cc-106-41 and cc-117-2, Condor is apparently still running (reported via condor_status), but the machine shows down. The test for "down" consists of sending, and receiving, exactly 4 ping packets. Clearly at least one packet was lost for each of these 2 machines.
- Solution: No action required, problem was certainly transient

seed_up_states.sh

Take the DB and iterate through all machines in the list, checking their Power and Condor states. Used both to set the initial state of the maintained DB, and monitor it via crontask.sh (above).

Admins, run this with caution.

Action Scripts

push_cmd.sh

Tool that will take lists of machines, and run them either locally or remotely via ssh.

Usage:  /atlas/data08/manage/cluster/push_cmd.sh [-f filename|-m machine_list] \
            [-e UP|CONDOR|NONE] [-l|-r] [-h] command
  /atlas/data08/manage/cluster/push_cmd.sh executes commands on all the machines in a list.
  -h gives a brief command summary.
  If command type is -l, use ssh on the local NIC of the named host.
  If command type is -r (the default), run the command on this
     local host aglbatch, using the public NIC name of hosts from the
     named file as the last argument of the command.
  Use -f <filename> to name a file with a list of host names.
     Enter one host per line in the file, either public name or private.
     The domainname is not required.
     The default file name is $CWD/machines.txt
  Use -m <machine_list> to pass a list of comma-separated host names
     with each entry including up to one wildcard.  There are
     3 special wildcards that over-ride any single machine(s);
     T2WN, MSUWN and UMWN.  The first is exclusive, the
     last 2 can be combined (but just equal T2WN then).
  The -m and -f options are mutually exclusive.
  Use -e [UP|CONDOR|NONE] to exclude machines that are not UP,
     or not running CONDOR, or just exclude NONE at all in the list.
     ON is a synonym for UP
     RUN is a synonym for CONDOR
     By default, any machine not running CONDOR jobs will be excluded
  The balance of the command line will be the command to execute.
  For interactive machines, the $USER account will be used with sudo
  Command failures and machine exclusions are logged to
         $CWD/failed.list

Example commands

push_cmd.sh -f UM_machines.txt -e UP -l "date"

-- BobBall - 30 May 2011

Topic revision: r8 - 10 Dec 2013, BobBall

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback