XRD-4 INTERNAL 12 2 7.2 24 20

XRD-4 TRAY A 12 3 7.2 36 30

XRD-4 TRAY B 12 3 7.2 36 30

XRD3 INTERNAL 12 2 7.2 24 20

XRD2 INTERNAL 12 2 7.2 24 20

XRD1 INTERNAL 12 2 7.2 24 20

MSU4.MSULOCAL 15 0.75 7.2 11.25 9

GYTHEIO 8 4 7.2 32 25.6

AGOGE 6 4 7.2 24 19.2

2 2 7.2 4 2

CYNISCA 6 4 7.2 24 19.2

2 2 7.2 4 2

MSUT3-DS FAST A 12 0.6 15k 7.2 6

MSUT3-DS FAST B 12 0.6 15k 7.2 6

MSUT3-DS FAST C 12 0.6 15k 7.2 6

MSUT3-DS FAST D 12 0.6 15k 7.2 6

288.85 241


Home Area

Users' home areas are on a relatively small disk and the space usage is protected using quotas. Quotas are set at a soft limit of 10GB to a hard limit of 20GB.

Note that this area is not backed up (no T3 user area is currently backed up). Keep copies of important files --- documents, source code etc. on other systems.

NFS "work" Areas

These Network File System areas are available from all nodes:

path size comment
/msu/data/t3work1 12 TB  
/msu/data/t3work2 12 TB  
/msu/data/t3work3 19 TB  
/msu/data/t3work4 19 TB  
/msu/data/t3work5 19 TB  
/msu/data/t3work6 19 TB  
/msu/data/t3work7 28 TB  
/msu/data/t3work8 28 TB  
/msu/data/t3fast1 5.5 TB  
/msu/data/t3fast2 5.5 TB  
/msu/data/t3fast3 5.5 TB  
/msu/data/t3fast4 5.5 TB  



Condor Usage

The cluster can execute about 500 batch jobs at once, jobs that are processing datafiles stored on the NFS storage areas can easily overwhelm the ability of the storage systems to service data access requests. This can make the jobs run inefficiently lengthening the time for your jobs to complete and can also impact other user's usage of the cluster. The number of jobs that a disk system can efficiently service varies greatly depending on the activity of the job.

Using the condor batch system, we can limit the total number of running jobs that need a given storage area. To do this, add the option "concurrency_limits" to your job description file (submit file). Each of the NFS storage areas has a total limit of 10000 set, in you job, specify a limit of M=10000/N where N is the total number of jobs that can run efficiently at once. A good starting point is N=50 and M=200. If the storage requirements are lower, you can reduce M (condor will then allow more of the jobs to run at once).

For example, if the job reads from the /msu/data/t3work1 area, add this to the submit script:

concurrency_limits = DISK_T3WORK1:200

In the above example, T3WORK1? can be replaced by T3WORK2?, T3FAST1?, T3FAST2?, T3FAST3?, T3FAST4?. To increase the number of jobs run at once, reduce 200 to a smaller integer. Note that this setup is voluntary --- if you don't use this job option, condor can't help control the disk utilization.

Check Jobs Waiting for Resource

To see if a job is waiting to run because of a concurrency limit, the command condor_q -better-analyze [jobid] will report this.

Debugging I/O

Run `top` to see what your waitIO is. If it is more than a couple percent, then a disk somewhere is being overutilized. Check http://msurxx.msulocal/ganglia/?c=MSU%20Server&m=load_one&r=week&s=descending&hc=4&mc=2 and see if any of the boxes are in red. If you click on the box, and look at the third plot on the right. This is a plot of CPU utilization. Orange means the CPU is waiting on Disk I/O. If there is a visible amount of it then this disk is reading and/or writing too fast. Identify which disk is hosted on that server. Check what condor jobs are dominating the queue and ask the user(s) if they are using the affected disk. If they are they should resubmit with appropriate concurrency limits.

path server
/msu/data/t3work1 msu3
/msu/data/t3work2 msu3
/msu/data/t3work3 msut3-xrd-1
/msu/data/t3work4 msut3-xrd-2
/msu/data/t3work5 msut3-xrd-3
/msu/data/t3work6 msut3-xrd-4
/msu/data/t3work7 msut3-xrd-3
/msu/data/t3work8 msut3-xrd-4
/msu/data/t3fast1 msut3-d2
/msu/data/t3fast2 msut3-d2
/msu/data/t3fast3 msut3-d2
/msu/data/t3fast4 msut3-d2

On previously queued jobs

Add requirement to queued jobs?

-- TomRockwell - 31 May 2011
Topic revision: r9 - 09 May 2016 - 16:29:11 - RichardDrake?

