Condor Setup at Michigan

Overview

Condor is University of Wisconsin system to run batch jobs on CPU farms and/or random groups of desktop machines.

Condor jobs are controlled by submitting a "job" file to Condor which indicates which program/script to run, where to write output, the job CPU requirements, and other configuration details. Condor runs your job in a queue based on your requirements.

Michigan has 2 condor batch queues at present: umrocks and umopt1. The umrocks CPU farm has about 100 dual-core AMD Athelon CPUs and umopt1 has 8 dual-core AMD opterons (and will be expanding in fall 2006). To use these queues log onto the head node of the given system (umrocks or umopt1) and submit your condor jobs.

Some useful condor commands.

Here are the most common Condor commands. For a full list see Condor Manuals.

$ condor_submit : submit job file to queue

$ condor_q: get list of jobs on the queue. Gives output:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
82333.0   zarzhit         4/2  19:32   0+03:44:50 I  20  512.0 runAOD4421aJob 000
 

ST = flag indicating if job is idle (I), running (R), or held (H)

$ condor_q -analyze : report on availability of queue to run a job. If your job never runs, this will tell you why.

$ condor_rm : Remove a job from condor queue

$ condor_status: Gives list/status of machines in condor queue

$ condor_status -submitters: Gives list of condor users

$ condor_hold : put a job in held (suspended) state.

$ condor_release : release a held job

Condor job file examples (see also :

#
#  Sample condor job file
#  
Universe         = vanilla
# These don't work on umrocks/umopt1
# Notification   = Complete
# Notify_user  = diehl@umich.edu

GetEnv          = True
Executable     =  /bin/echo
Arguments     = Hello World
Initialdir         = /net/data08/diehl/condor
Output           = condor$(Process).out
Log                = condor$(Process).log
Error             = condor$(Process).err
Queue

Here is the meaning of the flags:

Universe
vanilla is for any executable program; standard means program compiled with condor_compile (and which can be moved from one CPU to another while executing -- a feature not needed in our clusters).

Notification
Complete means notify the user when the job is complete. Does not work at UM.

Notify_user
Email address of user to notify.

GetEnv
Pass the users environmental variables to the Condor process.

Executable
Command or script for Condor to run.

Arguments
Parameters to be passed to the script

Initialdir
Directory where Condor run scripts and where the output files go.

Output/Log/Error
Names of the output, log, and errors files, respectively. $(Process) gives the Condor sub-job number to the file name, to allow for unique names in the case of multiple job submissions.

Queue
This tells Condor to submit the job to the queue. If number is supplied, then jobs will be submitted with different sub-numbers.

Stupid condor problems

  1. Condor cannot write to AFS. Hence, do not send your Output, Log, or Error files on AFS, otherwise your job will become Held, and just sit there indefinitely. The problem is that condor does not get AFS tokens due some bug or other. Write you output to NFS (e.g. /data08).
  2. At Michigan condor email notification does not work. Hence, you can skip putting in the "Notification" and "Notify_user" condor job options. The problem is that the condor head nodes are not authorized to send email at Michigan.
  3. Presently (Sept-20-2006), the condor queue on umopt1 does not actually run jobs due to a configuration error.

-- EdwardDiehl - 19 Sep 2006
Topic revision: r6 - 20 Sep 2006, EdwardDiehl
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback