Job Queing at Michigan State

NOTE: THIS PAGE IS PRETTY MUCH OUT OF DATE

The queing system at Michigan State has not yet been established.

Job Queing at the University of Michigan

The University of Michigan hardware uses Condor (current version 7.0.5) as its batch queing system. As the Tier2 and Trash/Tier3 hardware share the same head node (umopt1) we implement group quotas to maintain the correct allocation and use of the resource. Roughly speaking:
  • 240 job slots have ONLY Tier2 access (including T2 Analysis jobs)
  • 56 job slots have ONLY Trash/Tier3 access
    • An additional 16 slots have 4 hr time limits (Medium queue)
    • An additional 6 slots have 1 hr time limits (Short queue)
    • An additional 3 slots have 30 minute time limits (Test queue)
  • 1072 job slots have shared Tier2/Trash/Tier3 access
  • 400 job slots allow access to any submitted job (including T2 Analysis jobs and other VOs)
  • 32 jobs slots are dedicated T2 Analysis-job slots
  • The pool as a whole is managed using the Condor "Group Quota" mechanism.

Condor jobs may only be submitted to our queues from the 3 interactive machines, umt3int01/02/03.

The usual "condor_submit" command for starting a job accesses the pool of slots with 3-day time limits. Upon special request only, longer jobs than this can be run.

To access the medium, short and test queues, special commands have been created so that the needed Condor job parameters will not have to reside in the memory of the job owner. These commands are:
  • condor_submit_medium
  • condor_submit_short
  • condor_submit_test

Condor jobs cannot be run that try to write to a user's afs space. Instead, the Condor logs, at least, must be placed in an NFS directory. Send an Email to aglt2-help@umich.edu if you require such a directory and do not already have one.

In general, when running a Condor job with input files, it is best to copy the files to a /tmp directory on the compute node where the job runs. At job completion, any created directory should then be cleaned up/deleted. Automatic cleanup will be performed on any such directory that is at least 5 days old. For an example of how this can be done, look at
  • Examples directory /afs/atlas.umich.edu/opt/localTools/condor
    • Condor submit job file condorJob2
    • Executable script examples athenaCondor.csh and athenaCondor.sh

Group Quota Mechanism

If all the regular users are queing more jobs than their quota, then they will be guaranteed to get at least as many processors, after equilibrium is reached, as the number in their quota. Processors available after all quotas are reached are split among the active users in the usual, obscure Condor fashion. This tends to favor small-quota groups. Condor requires that the sum of all quotas is less than or equal to the total number of available processors, so the totals below sum to fewer, guaranteeing all quotas can be met.

Note that the actual split is based upon Accounting Groups, which may include multiple users as part of a single group. No sub-group quotas are possible.

In general, the policy we have adopted gives equal access to users who regularly access the system. We have specified "regularly access" here so that so that we can maximize the per-group quota by minimizing the divisor in "available cpus / count of regular users". If you are not a "regular user", but are about to begin such use, you can notify me of the expectation and the quotas can quickly be modified.

Any user that is not a "regular user" falls into a "generic" category where a small quota is maintained that ensures access.

This "default" policy will be modified for short durations so that results to be presented at conferences, or other special needs, can be accommodated. Such needs should be brought up as much in advance as possible.

The following tables are out of date. New groups have been added, old groups deleted, and users have been moved around as needed.
Accounting Group Default Quota Current Quota
Tier2 172 152
WW Group 20 20
zhengguo/zhangpei 0 20
Higgs Group 20 20
Muon Group 20 20
Generic Group 4 4

Group Membership:
  • WW Group -- aww, xuefeili, daits, [zhengguo, zhangpei]
  • Temporary zh Group -- zhengguo, zhangpei
  • Higgs Group -- qianj, armbrusa, dharp, jpurdham, liuhao, strandbe, rthun
  • Muon Group -- dslevin, diehl, desalvo
  • Generic Group -- anyone not listed above

-- BobBall - 26 Jun 2007

This topic: AGLT2 > WebHome > CondorConfig
Topic revision: 10 Dec 2013, BobBall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback