CondorMain < AGLT2

You are here: Foswiki>AGLT2 Web>CondorMain (10 Dec 2013, BobBall)Edit Attach

Condor Batch System
Site Startup/Shutdown procedures
- Site Startup
- Site Shutdown
  - Example Controlled shut down/restart, June 30, 2011
Quotas and Balancing of Condor Users

Condor Batch System

This is the main page for administrative info about the Condor batch system(s) in use at AGLT2. User info is at CondorUser.

A description of the Condor setup at UM Condor Configuration

Rebooting Frontend While Preserving Running Jobs

The submit host maintains the condor shadow processes, one per job. There is no way to restart this host while jobs are running. If this is a requirement, get another batch system :-O

Client Health Check

Wish to have some health check for clients that has the capability of stopping jobs from starting if problems are found with the host.

How clients installed through ROCKS

The ROCKS 5 installer has a node condor-worker.xml that installs the Condor rpm and the various, ancillary files needed for condor, eg, setups in /etc/profile.d.

The soft link "/opt/condor" always points to the version-dependent installation directory.

The directory /tmp/condor contains log, spool and execute sub-directories, and is excluded from tmpwatch cleanup.

Each worker node is individually configured for jobs via the condor_config.local file. The prototype for this file is specified using the host-specific Rocks 5 attribute "condor_conf" that points to a file in the include/file/condor directory of the Rocks 5 install tree. This prototype file is copied, with node- and site-specific modifications, to the /opt/condor/etc directory of the worker node during the Rocks 5 build process.

Changing the Condor configuration type

A text database file, nodeinfo.csv, is maintained with the current state of the condor configuration attribute "condor_conf" for all Rocks 5 worker nodes. If the value of the attribute for a node is changed, the /export/rocks/install/tools/nodeinfo/assemble_nodeinfo_csv.sh script must be run by an admin user (not root). This will generate a new version of the "nodeinfo.csv" file in a directory writable by the user, which must in turn be copied into the nodeinfo directory, replacing the now out of date version.

At any time, Condor can be stopped on a node, the nodeinfo.csv entry on the head node for the node can be changed to another type, and the condor_config.local file can be re-written using the following command:
/home/install/extras/condor_files/config_condor_prio.sh
This command remains to be updated for Rocks 5
When Condor is then restarted, the new configuration type will be in effect.

Condor startup changes, January 31, 2009

At Tom Rockwell's suggestion, the condor_config.local file on all compute nodes now includes the line
START_DAEMONS = False
This means that only the condor_master process starts on a compute node when the node boots and/or Condor is started. In order to actually let the node begin to run Condor jobs, the following command must be issued from the Condor master node (aglbatch.aglt2.org, or msu-osg.aglt2.org)
condor_on -name <hostname>.aglt2.org

Shutting down Condor on a compute node without killing running jobs

A script has been written (actually 2 scripts, with the first transparently running the second) that will "peacefully" set a stop for the Condor processes on the designated compute node, wait until this is accomplished, then completely stop Condor on the node and Email the designated contact person (input to the script) that the node is idle. This is run by issuing the command
/home/install/extras/bin/notify_peaceful.sh
This command remains to be updated for Rocks 5
This can only be run by the root account, and only from aglbatch or msu-osg in the case of the MSU Tier-3. On aglbatch an alias is defined for this, 'node_stop'. The script will prompt for the required input if no arguments are supplied. An example of such a command is
node_stop bl-6-3 ball@umich.edu

This script internally makes use of 2 Condor commands:
condor_off -peaceful -subsystem startd -name <hostname>.aglt2.org
condor_off -subsystem master -name <hostname>.aglt2.org
condor_status is also used to get a count of active job slots, in any state, waiting for the value to drop to zero.

An example run of this script is shown below. User input to the scripted is bold highlighted.

[aglbatch:~]# node_stop Enter name of machine (either public or private) to be peacefully shut down in Condor: c-6-32 Sent "Kill-All-Daemons-Peacefully" command to master c-6-32.aglt2.org Enter notification Email, eg, ball@umich.edu: ball@umich.edu

Tom Rockwell has also found a one-line command to stop ALL nodes from starting new jobs, allowing running jobs to complete, while at the same time keeping the submit hosts from killing all of their shadow processes. This command is

condor_off -all -peaceful -subsystem startd

NOTE: This command will affect ALL nodes on the cluster

Using Cluster Control

As of Spring, 2012, the commands above have been incorporated into the Cluster Control suite of commands. From any UM Interactive machine, or from aglbatch, the cluster_control command can be issued, and the peaceful node shutdown option selected.

Constraints on Condor commands

It is possible to limit the output of condor_q and condor_status based upon adding a constraint to the command. For example, you can look at only Idle jobs, or only at Running jobs, or at jobs with a particular attribute. This feature is used in, for example, preparing this page for display. Some common constraints are:
JobStatus value of 1(Idle), 2(Running), or >2 (Held)
Is[Short | Medium | Test | Analy]Job will be TRUE if the job is in the specified queue
ImageSize > 200 will look ONLY for jobs using more than 200MB of memory.

If, for example, you wish to count all jobs of a given user that are running, no matter the submit host, use the following command:
condor_q -global -constraint 'JobStatus == 2' <username> | grep "\ R\ " | wc -l

Constraints can be strung together using operators such as OR (||) and AND (&&), and grouping using parentheses is also allowed. For example, to list all the running jobs of a user in the UM T3 Medium queue, issue the following command:
condor_q -global -constraint 'IsMediumJob = TRUE && JobStatus = 2' <username>

Compute node job slots will also accept a constraint-based query. For example, to see all UM T3 Medium queue slots that are idle, enter the following command:
condor_status -constraint 'Activity = "Idle" && IS_MEDIUM_QUEUE = TRUE'

To see all the possible variables in a job ClassAd that can be used in a constraint expression, use the "condor_q -long" command on a job. To see all the possible variables in a node job slot ClassAd that can be used in a constraint expression, use the "condor_status -long" command on a sample slot, eg, "condor_status -l slot2@c-6-31.aglt2.org".

Site Startup/Shutdown procedures

It remains to write scripts to automate the Condor startup and shutdown procedures for AGLT2.

Site Startup

During the Rocks 5 refactor startup, the following procedures were followed. These revolved around files containing lists of machines, where each line of the list was the private-NIC name of a worker node. In all cases, it is assumed that the Condor service is started at the bootup of the node.

Some caveats of this:

Commands that run scripts or commands on worker nodes should be run from a user account using ssh as root to the worker node
The /export/rocks/install/tools/nodeinfo files are used to translate between local and public names of worker nodes.

Current scripts

/atlas/data08/ball/admin/check_dccp.sh -- run on a worker node, this script ensures /pnfs is mounted, then does a dccp file copy to ensure dCache access works.
/atlas/data08/ball/admin/push_cmd.sh -- accepts a list of worker nodes and runs the specified command on each as root
Not a script, however, we must ensure the /tmp disk is available, eg, "df -h|grep /tmp". If this command is run using push_cmd.sh, any failing node, and the failing command, is appended to the "failed.list" file in the directory from which push_cmd.sh is run.
/root/cmd_exe.sh -- This exists on aglbatch.aglt2.org, accepts a list of worker nodes, accesses the nodeinfo.csv structure, and runs either a command on umopt1, or as ssh on the worker node (depending on a script argument). The structure of the accessible commands is quite limited in this script.
(root only) /atlas/data08/manage/tier3/2268-compute-power.sh on|off|status. Controls power on dc/dc2 compute nodes, sx-11-28, and bambi in Tier3. Use to power on after outage or in event of cooling loss in room. We don't set these nodes to automatically power on because often power outages cause a loss of chilled water cooling in the room. Confirm we are chilling and then nodes can be returned to service.

Once we are happy that dCache is accessible, and the node is correctly configured, we run the "condor_on" command above to start full set of Condor daemons that are to run on the target node.

For each worker node an example sequence might be:

ssh root@worker_node.local "df -h|grep /tmp"
ssh root@worker_node.local "/atlas/data08/ball/admin/check_dccp.sh"
condor_on -name worker_node.aglt2.org

Site Shutdown

The projected procedures to cleanly stop AGLT2 prior to a shutdown are as follows. Scripts to automate this are yet to be written.

As much in advance of a shutdown as possible, set up OIM and Nagios outages for our resources
Approximately 13 hours in advance, issue the commands found in AtlasQueueControl to set both our queues offline
When the number of Idle jobs on gate04 has dropped to zero, or at least sufficiently, set all worker nodes to retire using the command above
After a short wait (~5 minutes), run "condor_status" and look through the command output for any node that did not receive the peaceful shutdown command. Re-issue the command to that node. Such commands utilize UDP packets and have been known to be missed by the target node at times when many such commands are issued at once.
When the shutdown time arrives, condor_rm any job still running on a worker node from gate04.

Example Controlled shut down/restart, June 30, 2011

The following procedure was followed for the controlled shut down and restart on June 30, 2011. This took advantage of the Cluster Control DB, using the list of machines that were up in Condor before the shutdown, and maintaining that state without modification throughout.

!=== Work performed on June 29 !===

Comment this line in the crontab of root on gate02
- 2,32 * * * * /bin/bash /root/tools/monitor_other_VO.sh
Ed stops the splitter for muon calibration
Turn off auto-pilots for both Analy and Prod queues (see AtlasQueueControl for details)

source /opt/osg/setup.sh
grid-proxy-init
(pw entered)
curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=AGLT2_SL6-condor&comment="Stop new auto-pilots so we can idle down in time for 8am EDT scheduled outage."'
curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=ANALY_AGLT2_SL6-condor&comment="Stop new auto-pilots so we can idle down in time for 8am EDT scheduled outage."'

Wait for all Idle jobs to be gone from gate04
Run the "minimal jobs" sequence of the quota modification crontab entry for root on aglbatch
- /bin/bash /root/condor_quota_config/new_modify_quota.sh 0 0
Comment out the crontab entry so no further changes are made
- 27,57 * * * * /bin/bash /root/condor_quota_config/new_modify_quota.sh 1 > /dev/null 2>& 1
Set all Worker Nodes to a peaceful shutdown
- Did NOT use the aglbatch "node_stop" command, as this would modify the Cluster Control DB
  - Scripted like so
    - push_cmd.sh -f machines.txt -r "condor_off -peaceful -subsys startd -name "

!=== Work performed on June 30 !===

Stop all jobs still running on gate02
- condor_q -constr 'jobstatus==2 && JobUniverse ! = 12'|grep " R "|awk '{print $1}'|xargs -n 1 condor_rm
- service condor stop
Stop all jobs still running on gate04
- condor_q -constr 'jobstatus==2 && JobUniverse ! = 12'|grep " R "|awk '{print $1}'|xargs -n 1 condor_rm
- service condor stop
Follow this same procedure on the 3 interactive machines, umt3int01/02/03
Umount /pnfs on all WN
- push_cmd.sh -f machines.txt -l "umount -l /pnfs"
Stop access to Lustre pools from worker nodes
- push_cmd.sh -f machines.txt -l "/atlas/data08/ball/admin/all_lustre_on_off.sh off"
Start serious work on network, etc

!=== Bringing the site back online !===

As the Cluster Control DB was (likely) not modified from the last running state, the record of machines up or down in Condor is maintained there, and can be used as a basis for bringing all those machines up again. Some verification is best, for example, just running a "date" command on all these machines using the "push_cmd.sh" script. The following assumes this was done.

I leave out the fun with changing the osghome mount point as that is not generally relevant.

Remount pnfs
- push_cmd.sh -f machines.txt -l "mount /pnfs"
Make Lustre file servers available again
- push_cmd.sh -f machines.txt -l "/atlas/data08/ball/admin/all_lustre_on_off.sh on"
Re-enable Condor on all worker nodes
- push_cmd.sh -f machines.txt -l "/etc/health.d/eval_T2_condor_ready.sh enh"
- At 15 minute intervals, aglbatch will take the successfully checked machines and send them a "condor_on"
Re-enable the gate-keepers to submit condor jobs. As condor was stopped, this was done by rebooting them.
Check rsv. From gate04, manually run rsv probes for any showing failed, eg
- rsv-control --run --host gate02.grid.umich.edu org.osg.general.vdt-version
When all is ready, enable prod and analy for test jobs
- Ask pandashift for Production test jobs
- curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=settest&queue=ANALY_AGLT2_SL6-condor&comment=HC.Test.Me'

There is a choice here, what to do about gate02 jobs? For this outage, I chose to keep them limited to where they would be if we were full up in ATLAS.

On aglbatch:
- /bin/bash /root/condor_quota_config/new_modify_quota.sh 0

We could as well have chosen to let them have as many slots as possible. In this case, simply re-enable both the aglbatch and the gate02 cron tasks that were disabled during the shutdown sequence. In the case here, we wait until we have a reasonably full load of atlas jobs before re-enabling these tasks.

Quotas and Balancing of Condor Users

Four mechanisms are used to properly balance and prioritize jobs submitted to the Condor pools at AGLT2

Modifications to the globus condor.pm file located at /opt/osg/globus/lib/perl/Globus/GRAM/JobManager/
- /opt/osg is a soft link to the current osg:ce installation. At this writing this is osg-1.2.6
Assignment of every user to a Condor Accounting Group, each with an appropriate job slot quota
Interception and/or replacement of the condor_submit image, adding needed job parameters
condor_config.local file modifications of several types on each worker node

Modifications to condor.pm

The modifications to condor.pm are simply additions to the method that writes the Condor submit file. This file is then placed into the Condor queues using the "condor_submit" command. These modifications extract two pieces of information from the incoming globus submission

The user name
The "queue", if specified, to which the job was submitted
- analy for ATLAS Analysis jobs
- splitter for ATLAS Muon Alignment tasks

Based upon this information, the condor submit file is appropriately modified

[gate01:JobManager]# diff condor.pm.orig condor.pm
355a356,406
> # Changes for usage at AGLT2
> #   This file is:
> # /opt/osg/globus/lib/perl/Globus/GRAM/JobManager/condor.pm
> #   /opt/osg is a soft link to whatever the currently installed
> #   osg directory may be.  At this writing, it is osg-1.2.6
> #
>     use English;
>
>     my $queue = $description->queue;
>     my $username = (getpwuid($UID))[0];
>     my $accgroup;
>     my $setThisQ;
>     my $queName;
>
>     if ($username eq 'usatlas1')      {
>       $accgroup = "group_gatekpr.$username";
>     } elsif ($username eq 'usatlas2') {
>       $accgroup = "group_gatekpr.$username";
>     } elsif ($username eq 'usatlas3') {
>       $accgroup = "group_gatekpr.$username";
>     } elsif ($username eq 'usatlas4') {
>       $accgroup = "group_gatekpr.$username";
>     } elsif ($username eq 'osg')      {
>       $accgroup = "group_gatekpr.$username";
>     } elsif ($username eq 'ivdgl')    {
>       $accgroup = "group_gatekpr.$username";
>     } else                            {
>       $accgroup = "group_VOgener.$username";
>     }
>
>     if ($queue eq 'analy')   {
>       $setThisQ = 'True';
>       $queName = 'Analysis';
>         print SCRIPT_FILE "priority = 5\n";
>     } elsif ($queue eq 'splitter') {
>       $setThisQ = 'False';
>       $queName = 'Splitter';
>       $accgroup = 'group_calibrate.muoncal';
>       print SCRIPT_FILE "priority = 10\n";
>     } else                   {
>       $setThisQ = 'False';
>       $queName = 'Default';
>     }
>
>     print SCRIPT_FILE "+AccountingGroup = \"$accgroup\"\n";
>     print SCRIPT_FILE "+IsAnalyJob = $setThisQ\n";
>     print SCRIPT_FILE "+localQue = \"$queName\"\n";
>     print SCRIPT_FILE "Rank = 32-SlotID\n";
>
> # End change
>

An important item to note here is that Analysis jobs are submitted with a priority of 5. Production jobs arrive and are submitted with a priority of zero. Within the gatekpr Accounting Group, the higher priority Analysis tasks will pick up job slots ahead of the lower priority Production tasks. This higher priority is balanced by the limited number of available Analysis job slots, as detailed below.

Condor Accounting Groups

The Condor master machine is aglbatch.aglt2.org. The condor_config.local file on that machine is modified to specify the names of valid Accounting Groups, the quota of worker node job slots they have, and whether they are allowed to exceed this quota once it is reached. At this time, a subset of these configuration macros is shown below.

GROUP_NAMES = group_gatekpr, group_muon, group_hggs, group_hggsProd, group_han, \
    group_zh, group_ww, group_generic, group_sura, group_ligo, group_BSM, \
    group_VOgener, group_MSUsam, group_calibrate, group_splitter, \
    group_glow, group_CMS, group_eID, group_csc
#
GROUP_QUOTA_group_gatekpr = 2327
GROUP_QUOTA_group_muon = 75
GROUP_QUOTA_group_ww = 100
GROUP_QUOTA_group_calibrate = 150
#   More group quotas....
#
GROUP_AUTOREGROUP_group_gatekpr = TRUE
GROUP_AUTOREGROUP_group_muon = FALSE
GROUP_AUTOREGROUP_group_ww = TRUE
GROUP_AUTOREGROUP_group_calibrate = TRUE
#  More regroup settings....

condor_submit changes

/opt/condor/bin/condor_submit is directly accessed on the gatekeeper machines, so it is replaced by a shell script, which in turn runs "real_condor_submit", the renamed "condor_submit" image. The shell script adds several local requirements we apply to all condor jobs at AGLT2. The relevant portion of this script is shown below. The full, current version of this file is kept in the AGLT2 svn repository.

       /opt/condor/bin/real_condor_submit $* \
          -append "+IsShortJob  = False" \
          -append "+IsMediumJob = False" \
          -append "+IsTestJob   = False" \
          -append "+IsLustreJob = False" \
          -append "+IsUnlimitedJob = False" \
          -append "+JobMemoryLimit = 4194000" \
          -append "Requirements = ( (TARGET.TotalDisk =?= UNDEFINED) || (TARGET.TotalDisk >= 10500000) )" \
          -append "Periodic_Remove = ( ( RemoteWallClockTime > (3*24*60*60 + 20*60) ) || (ImageSize > JobMemoryLimit) )"

Changes to condor_config.local on worker nodes

Worker node condor_config.local files are configured during the Rocks build in a variety of ways. The most basic configuration we have is shown below, where up to half of the job slots on a given worker node can run ATLAS Analysis jobs, and when they are not, they will run any other job. The other slots will run any job EXCEPT an Analysis job.

BaseTime = (72 * $(HOUR))
OffTime = (30 * $(MINUTE))
PREEMPT_VANILLA = ( $(ActivityTimer) > ($(BaseTime)-$(OffTime)) )
MaxJobRetirementTime = ( $(BaseTime) + $(OffTime) )
#
# Restrictions
# July 2009, allow up to half analysis jobs on this set of nodes
IsNotUserAnalyJob = ( TARGET.IsAnalyJob =!= True )
StartSLOT  = ((( RemoteWallClockTime < ( $(BaseTime) - $(OffTime)) ) =!= False) )
StartSLOTA = ( $(IsNotUserAnalyJob) && (( RemoteWallClockTime < ( $(BaseTime) - $(OffTime)) ) =!= False) )
START      = ((SlotID == 1) && ($(StartSLOTA)) && ($(START)))   || \
             ((SlotID == 2) && ($(StartSLOT))  && ($(START)))   || \
             ((SlotID == 3) && ($(StartSLOTA)) && ($(START)))   || \
             ((SlotID == 4) && ($(StartSLOT))  && ($(START)))   || \
             ((SlotID == 5) && ($(StartSLOTA)) && ($(START)))   || \
             ((SlotID == 6) && ($(StartSLOT))  && ($(START)))   || \
             ((SlotID == 7) && ($(StartSLOTA)) && ($(START)))   || \
             ((SlotID == 8) && ($(StartSLOT))  && ($(START)))

So, even numbered slots run anything. Odd numbers slots will only run jobs that are NOT Analysis.

Modification of condor_config.local on the Master Condor machine

aglbatch.aglt2.org is the master Condor machine, running the Collector and Negotiator daemons. A cron task running there dynamically modifies the Accounting Group quotas to fit the current situation. Primary considerations are:

non-ATLAS VO quotas should decrease as ATLAS usage increases
non-ATLAS VO quotas should decrease further as more Analysis jobs run
The quota of ATLAS jobs is decreased by the number of running jobs from gate02.grid.umich.edu
- As the ATLAS quota is high, this maintains the correct number of T3 jobs
T3 job count may increase above the "floor" quota near 300 slots
T3 job count may not increase above some maximum value, that will protect against devastatingly high NFS access

The directory /root/tools/condor_quota_config on aglbatch contains the relevant files

condor_config.local
modify_quota.sh
autopilot_counter.pl

The cron-invoked shell script re-writes /opt/condor/etc/condor_config.local every 30 minutes, then prods the negotiator process to re-read the configuration, thus applying the new set of job quotas.

As new worker nodes are brought up or down, the base set of quotas must be manually modified to maintain the correct balance of running jobs across the cluster.

-- TomRockwell - 13 Aug 2008

Topic revision: r21 - 10 Dec 2013, BobBall

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback