This document helps the UM Tier3 users to diagnose their condor job problems.

Submitting Machines

Tier 3 users can submit their condor jobs from the following machines

umt3int01.aglt2.org (SL6, jobs will be run on the SL6 queue, which has 80 cores available)

umt3int02.aglt2.org /umt3int03.aglt2.org / umt3int04.aglt2.org /umt3int05.aglt2.org (SL7, jobs will be run on SL7 queue, which has 1000 cores shared with Tier2, and is allowed to overflow-use more cores-if the Tier2 is not busy)

In order to run your jobs on either the SL6 or SL7 nodes, it is safer to compile the code on the corresponding OS first.

Options of Queues

User can submit their jobs to 4 different kind of queues by using different submission commands:

condor_submit (submit to regular queues, jobs are allowed to run up to 3 days)

condor_submit_short (submited to the short queue, which has 12 reserved cores, jobs are allowed to run up to 2 hours)

condor_submit_medium (jobs are allowed to run up to 4 hours)

condor_submit_unlimited (with 16 reserved cores, jobs are allowed to run with unlimited time)

Options of Resources

User can choose different resource requirements by using different submission commands:

condor_submit_mcore (jobs are allowed to use 8 cpus/job)

condor_submit_lmem (jobs are allowed to use up to 6GB memory/job)

For more details about the submission options, check the content of the files /usr/local/bin/condor_submit* on any interactive machines.

Things to avoid during submission

Before submitting the job, make sure:

1) User jobs need to be submitted from their NFS directory, such as /atlas/data19/username, but not either their AFS home (/afs/atlas.umich.edu/home/username) or Lustre directory ( /lustre/umt3/). Otherwise, job submission will fail.

2) The job scripts should not refer (read or write) to any files stored in the AFS directories. Because when the job is run on the work nodes, it does not have the user AFS token which allows the work node to read/write the AFS directory. Otherwise, jobs will be in holding status.

What If the jobs stay in idle for hours after submission?

When this happens, very likely, your jobs do not have any matching resources due to the resource requirement specified in your script.

In order to debug it,

1) get the job id for your idle job, replace xuwenhao with your own username
-bash-4.2$ condor_q -constraint ' Owner=="xuwenhao" && JobStatus==1 '
277490.430 xuwenhao        5/1  10:43   0+00:00:00 I  10  97.7 run 430277490.431 xuwenhao        5/1  10:43   0+00:00:00 I  10  97.7 run 431

2) analyze the job
-bash-4.2$ condor_q -analyze 277490.430

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

    ( ( TARGET.TotalDisk >= 21000000 && TARGET.IsSL7WN is true &&
        TARGET.AGLT2_SITE == "UM" && ( ( OpSysAndVer is "CentOS7" ) ) ) ) &&
    ( TARGET.Arch == "X86_64" ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) ||
      ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )


Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( OpSysAndVer is "CentOS7" ) )  0                   MODIFY TO "SL7"
2   TARGET.AGLT2_SITE == "UM"         1605                 
3   ( TARGET.Memory >= 4096 )         2020                 
4   TARGET.IsSL7WN is true            2684                 
5   TARGET.TotalDisk >= 21000000      2706                 
6   ( TARGET.Arch == "X86_64" )       2710                 
7   ( TARGET.Disk >= 3 )              2710                 
8   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "local" ) )
                                      2710                 

The reason this job does not have matching resource is it requests the target (work node) OS to be "CentOS7", but our cluster work node has either SL6 or SL7. So change this requirement in your job script.

There are some other options with condor_q command to check your jobs matching status:
-bash-4.2$ condor_q -better-analyze:summary 

-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2659 slots
               Autocluster    Matches     Machine     Running     Serving            
 JobId        Members/Idle Requirements Rejects Job  Users Job  Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
272670.343    116/0                1153         141     116/116        846        50 aaronsw
277490.0      432/432                 0           0           0          0         0 xuwenhao


-bash-4.2$ condor_q -better-analyze:summary -constraint 'Owner=="xuwenhao"'

-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2656 slots
               Autocluster    Matches     Machine     Running     Serving            
 JobId        Members/Idle Requirements Rejects Job  Users Job  Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
277490.0      432/432                 0           0           0          0         0 xuwenhao
-bash-4.2$ 

More details can be viewed with the command "condor_q -help"

-- WenjingWu - 01 May 2019

This topic: AGLT2 > WebHome > Tier3CondorDiag
Topic revision: 01 May 2019, WenjingWu
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback