You are here: Foswiki>AGLT2 Web>MSUAdministration>MSUT3SetupTimedQueues (10 Sep 2019, ForrestPhillips)Edit Attach

Setting up the timed queues

The user's requested several timed queues that would hold a job after it had exceeded a certain amount of runtime. These queues each had a set number of jobs they could run at one time.

There were three queues requested:

Short queue: Accepts jobs that run less than 3 hours. Will run as many jobs as it can.
Medium queue: Accepts jobs that run less than 2 days. Will run jobs on all but 50 of the job slots.
Long queue: Accepts jobs that run less than 7 days. Will only run up to 20 jobs.

The agreed upon way to denote what queue a job should in was:

Short queue: Add nothing to the condor submit script; this is the default queue.
Medium queue: Add the line "+IsMediumJob = True" to the condor submit script.
Long queue: Add the line "+IsLongJob = True" to the condor submit script.

The built-in HTCondor macros/variables that are important for this project are:

Job Status: Denotes the status of the job. A value of 2 means it is running and 5 means it is held.
SYSTEM_PERIODIC_HOLD: A macro that runs for each job on the login nodes every 15 minutes (this time can be changed). If it evaluates to true at runtime then the job is held.
SYSTEM_PERIODIC_HOLD_REASON: A macro that runs for each job on the login nodes every 15 minutes. It sets the hold reason based on a nested if-else statement.
SYSTEM_PERIODIC_REMOVE: A macro that runs for each job on the login nodes every 15 minutes. If it evaluates to true at runtime then the job is removed from the queue entirely.
Remote User Cpu: A variable calculated on the worker nodes that is updated on the login nodes every 15 minutes. Contains how many cpu seconds a job has consumed. This was used instead of run time to account for nodes that are slow because they are being bogged down.
TARGET: This is a way for the worker nodes to reference variables on the login nodes. TARGET simply references the job that is on the login node.
START: A macro that runs on the worker nodes to determine whether a job should be started. If it evaluates to true then the job will start, if not then the job will sit until it finds a worker node/job slot for which START evaluates to true.
Slot ID: Each worker node has a number of slots that can run jobs (8 for PE 1950s and 12 for R610s); each of these has an ID that starts at 0 and goes up to (but not including) the number of slots on the node. These slots are filled up from lowest to highest ID number unless modified.

The custom HTCondor macros/variables made for this project are:

IsUserMediumJob: A place-holder for "IsMediumJob =?= True". It's there simply so we don't have to type "IsMediumJob =?= True" many times.
IsUserLongJob: Similar to IsUserMediumJob, but for long jobs.
IsUserShortJob: A place-holder for "!$(IsUserMediumJob) && !$(IsUserLongJob)". It's there for readability and to shorten the amount of written code.

Plan for implementation

Due to the nature of HTCondor, the timing aspect of these queues must be done on the login/submit nodes, while the limits on the number of jobs running in each queue must be done on the worker nodes.

The HTCondor login/submit nodes has two built-in macros that periodically check admin-defined conditions of a job to see if it should be held or removed, these are SYSTEM_PERIODIC_HOLD and SYSTEM_PERIODIC_REMOVE respectively. We will use SYSTEM_PERIODIC_HOLD to hold a job after a certain amount of time has passed (dependent on which queue it is in) and SYSTEM_PERIODIC_REMOVE to remove a job once it has been held for 24 hours (which is ample time for a user to fix their job and release it).

The HTCondor worker nodes have a built-in macro called START, which determines whether a job can be started on a specific job slot of that worker node. One of the things the condor negotiator checks when assigning jobs to worker nodes is whether START evaluates to true for that job. We can modify start to be whatever we want and will use it to set limits on the number of jobs that can run on each queue. We do this by saying which job slots can run short, medium, and long jobs. We set it so that every job slot on a worker node can run a short job, all but one job slot on each worker node can run medium jobs, and only one job slot on each worker node can run long jobs.

Implementation

Checkout the cfengine SVN repository

Start by checking out the cfengine SVN repository as described here: https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow

Implementing the time limits

Add the following lines to cfengine/masterfiles/stash/condor_msut3/58-host-submit.conf

# Create some place holder variables to shorten up later code
IsUserMediumJob = (IsMediumJob =?= true) # Checks a job's class ads to see if IsMediumJob is set to true
IsUserLongJob = (IsLongJob =?= true)     # Checks a job's class ads to see if IsLongJob is set to true
IsUserShortJob = ( !$(IsUserMediumJob) && !$(IsUserLongJob) ) # If a job is not a medium or long job then it is a short job (i.e. short job is the default)

# Hold a job if...
SYSTEM_PERIODIC_HOLD = ( JobStatus == 2 ) && (\   # the job is in the running state and...
                       ( $(IsUserShortJob) && RemoteUserCpu > 3*60*60 ) || \     # the job is a short job that has been running for more than 3 hours, or...
                       ( $(IsUserMediumJob) && RemoteUserCpu > 2*24*60*60 ) || \ # the job is a medium job that has been running for more than 2 days, or...
                       ( $(IsUserLongJob) && RemoteUserCpu > 7*24*60*60 ) )      # the job is a long job that has been running for more than 7 days

# Set the hold reason (so the user knows what went wrong)
SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(IsUserShortJob) && RemoteUserCpu > 3*60*60, "Job exceeded short queue time of 3 cpu hours.", \
                              ifThenElse($(IsUserMerdiumJob) && RemoteUserCpu > 2*24*60*60, "Job exceeded medium queue time of 48 cpu hours.", \
                              ifThenElse($(IsUserLongJob) && RemoteUserCpu > 7*24*60*60, "Job exceeded long queue time of 7 cpu days.", "Unknown periodic hold reason") ) )

# Remove a held job if it's been running for more than 24 hours
SYSTEM_PERIODIC_REMOVE = ( JobStatus == 5) && (CurrentTime - EnteredCurrentStatus > 24*60*60)

Implementing the limits on # of jobs

Add the following lines to cfengine/masterfiles/stash/condor_msut3/55-host-pe1950.conf:

# Check what type of job is asking to start, need to add TARGET. to the front because this job is coming from the submit node
IsUserMediumJob = (TARGET.IsMediumJob =?= True)
IsUserLongJob   = (TARGET.IsLongJob   =?= True)
IsUserShortJob  = ( !$(IsUserMediumJob) && !$(IsUserLongJob) )

# start the job if...
START = $(START) && ( \                        # the start condition from higher level config files is met and...
  $(IsUserShortJob) || \                       # the job is a short job or...
  ( (SlotID != 8) && $(IsUserMediumJob) ) || \ # the job is a medium job and this is not job slot number 8 or...
  ( (SlotID == 8) && $(IsUserLongJob) ) )      # the job is a long job and this is job slot number 8.

Add the following lines to cfengine/masterfiles/stash/condor_msut3/55-host-r610.conf

# Check what type of job is asking to start, need to add TARGET. to the front because this job is coming from the submit node
IsUserMediumJob = (TARGET.IsMediumJob =?= True)
IsUserLongJob   = (TARGET.IsLongJob   =?= True)
IsUserShortJob  = ( !$(IsUserMediumJob) && !$(IsUserLongJob) )

# start the job if...
START = $(START) && $(START_14_SLOT_MSU_RESERVE_SHORT) && ( \  # the start condition from higher level config files is met and...
  $(IsUserShortJob) || $(IsUserMediumJob) || \                 # the job is a short or medium job or...
  ( (SlotID == 12) && $(IsUserLongJob) ) )                     # the job is a long job and this is job slot number 12.

Create a test policy and test the queues

Create a test policy as described here: https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow

Pick one login node, one pe1950 worker node, and one r610 worker node. Switch their policy to the test policy as described here: https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow

Run a bunch of test jobs to see if each part of these new queues is working. Check that the time limits are working, that the number of jobs that can run at once match what you would expect for each queue, and that the held jobs are removed after 24 hours.

Update the T2 policy

Once you are done testing, switch the policy on the nodes you tested back to the T2 policy, then update the T2 policy as described here: https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow

-- ForrestPhillips - 04 Sep 2019

Topic revision: r3 - 10 Sep 2019, ForrestPhillips

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback