You are here: Foswiki>AGLT2 Web>MCoreSetup (23 Apr 2015, BobBall)Edit Attach

MultiCore Condor Set UP

Introduction

AGLT2 implements a mix of static and dynamic job slots for MultiCore jobs. At the time of this writing, we use 10 static slots, and sufficient cores for up to 230 dynamic (8 core) slots.

The static slots are documented elsewhere.

The dynamic slots start with the work of Will Kellogg-Strecker.

Implementation on Condor

As adapted at AGLT2, our configuration consists of the following lines added into the Condor configuration files of Worker Nodes under /etc/condor/config.d . This slot setting also addresses an issue where Analysis jobs could flood these Dynamically configured machines, saturating them with IO. Although not fully addressed now (do we need some IO related parameter in the job submissions?) this allows us to restrict the number of Analysis jobs that can run on these machines utilizing a new HTCondor Resource that we have named "AnalyTask". Analysis jobs come in with "RequestAnalyTask=1", while all other jobs have "RequestAnalyTask=0". See our Wiki page on HTCondor-CE configuration for details of that implementation.

# Dynamic slot based upon cpus.
# Also has consumable Analysis job slots, up to $(MaxAnaly) per machine
# The number of allowed Analysis jobs is computed, with the allowed fraction based upon node IO capability
MaxAnaly = int($(DETECTED_CORES) * 9 / 10)
MACHINE_RESOURCE_AnalyTask = $(MaxAnaly)
JOB_DEFAULT_REQUESTAnalyTask = 0
SLOT_TYPE_3 = cpus=$(DETECTED_CORES), AnalyTask=100%
SLOT_TYPE_3_PARTITIONABLE = True
SLOT_TYPE_3_CONSUMPTION_POLICY = False
SLOT_TYPE_3_CONSUMPTION_AnalyTask = ifThenElse(target.RequestAnalyTask =!= undefined, target.RequestAnalyTask, 0)

#### Prior to introduction of the AnalyTask Resource, this was the setup
#### Dynamic slot based upon cpus.
### SLOT_TYPE_3 = 100%
### SLOT_TYPE_3_PARTITIONABLE = True

# Sort by number of cpus, such that single core jobs match to non-mp8 machines
NEGOTIATOR_POST_JOB_RANK = ifThenElse(PARTITIONED =?= True, 40, 100)
NEGOTIATOR_PRE_JOB_RANK = ifThenElse(PARTITIONED =?= True, 40, 100)

# Capacities
#
NUM_SLOTS = 1
NUM_SLOTS_TYPE_3 = 1
SlotWeight = Cpus

# Likely also need this on schedd machine.... Also placed on Master
CONSUMPTION_POLICY = False
CLAIM_PARTITIONABLE_LEFTOVERS = False

# A local, useful Definition
PARTITIONED = True
#
MEMORY = $(NUM_SLOTS_TYPE_3) * $(DETECTED_CORES) * $(ST_1_MEM)
# (note, we have ST_1_MEM = 4096)

# Basic START rule
#
START = $(IS_T2_USER_GRP)

# Attributes
#
STARTD_ATTRS = $(STARTD_ATTRS), CPU_TYPE, PARTITIONED

Job matching and monitoring are assisted as well by parameters (pre-HTCondor-CE) added to the submit file by the condor.pm code on the production gatekeeper, or (post HTCondor-CE) by the JobRouter. This consists of the following lines.

+localQue = "MP8"
priority = 25
request_cpus=8
+Slot_Type = "mp8"

# Add this to the Requirements expression for these pilots
#  (( TARGET.Cpus == 8 && TARGET.CPU_TYPE =?= "mp8" ) || TARGET.PARTITIONED =?= True )
# Note that this Requirements statement can be further modified by the "condor_submit" script or the JobRouter

Job Matching Problems

Initially, these cores could be used by ANY T2 job, from any VO, and including jobs from the "atlasconnect" account. Unfortunately, this is not a well controlled situation. The atlasconnect jobs are bursty in nature, and quickly consume the dynamic cores, pushing down the running job count. Analysis jobs are much the same, and Production jobs can run for very long times. Keeping the MCore job count within reasonable limits therefore meant periodically draining all these competing factors, and that, in turn, means that we were not running efficiently and with a high occupancy level in our total set of cores. All of this was accomplished using the Condor Group Quota mechanism.

After that initial job set, we tried various mechanisms within the Group Quotas to control the situation. We banned non-ATLAS VOs and atlasconnect from running at all on the dynamically configured cores. That left only Production and Analysis jobs to compete. The result though was still unsatisfactory.

A Solution Using a Modified "condor_submit" Script

Finally, we switched to direct control via the Condor requirements expression, informed by the current running job situation, to define what pilots could run on the dynamically configured machines. This is working well, in our view, and is keeping a high level of MCORE job occupancy on the Dynamically configured machines.

The MCore fill is maintained between high water and low water numbers of running MCore jobs. This MCore job count is examined every 15 minutes via cron, and the desired situation is recorded in a text file maintained in /var/tmp. For example, the current content is as follows:
[root@gate04 ~]# cat /var/tmp/mp8Prio.txt
#  lastMP8Prio=1 indicates we had dropped below low water, and we are currently filling.  
#  lastMP8Prio=0 indicates we are in a non-forced-fill situation of some kind
lastMP8Prio=1
#  fracAllowRunDynamic is the per-cent of Condor jobs queued that are allowed to 
#    consider the Dynamic core machines in match-making
fracAllowRunDynamic=5
#  This is the total number of queued/Idle MCore jobs
idleMP8Count=89

The second piece of this puzzle is within the AGLT2 condor_submit script itself. The mp8Prio.txt file is sourced, if it exists, and default values are chosen if it does not exist or is otherwise inaccessible. A random number (percent) is then thrown. If "lastMP8Prio" is 1, or the thrown fraction is higher than the "fracAllowRunDynamic" value, then this clause is added to the submitted jobs "Requirements" macro:
TARGET.PARTITIONED =!= True
Using this method, we maintain a relatively high fill fraction at any time.

Again, with the advent of HTCondor-CE this is applied at 15 minute intervals by the JobRouter using the scripts below. Note in particular the modification of the PATH for these cron tasks so that the "reconfig" command is correctly executed.

Scripts

/root/tools/evaluate_mp8_fill.sh

#!/bin/bash
#----------------------------------------------------------
#
# "jobStateProg" returns the following values based upon condor_status output:
#   0 - $totalRun      total jobs running
#   1 - $totalSlots    total number of job slots on the system
#   2 - $totalCores    total number of available Cpus on the system
#   3 - $freeSlots     Idle job slots
#   4 - $freeCores     Idle Cpus
#   5 - $atlasRun      CPU count of jobs running as the gatekpr group
#   6 - $runningA    Jobs running under the opportunisticA group
#   7 - $runningB      Jobs running under the opportunisticB group
#   8 - $otherVORun    Jobs run by other VOs, coming in from gate02
#   9 - $calibRun      Jobs running as muon calibration
#   10 - $T3Run        Anything left over is a T3 job
#   11 = $idleMP8      Number of idle MP8 jobs
#   12 = maxMP8cores   Total available cores for MP8 jobs
#   13 = usedMP8cores  Number of running MP8 cores
#
# NOTE: This "jobStateProg" is the same perl script that runs on the
#       T2 Condor Master
#
condor_ce_dir="/etc/condor-ce/config.d"
condor_ce_mp8_prio_file="${condor_ce_dir}/60-aglt2-mp8prio.conf"
jobStateProg="/usr/bin/perl /root/tools/job_state_counter.pl"
mp8PrioFile="/var/tmp/mp8Prio.txt"
minIdleMP8=19
(( tooLongIdle = 3 * 3600 ))
rightNow=`date +%s`
stateLog="/var/log/mp8PrioLog.txt"
#
if [ ! -e $stateLog ] ; then
  touch $stateLog
fi
#
if [ ! -e $mp8PrioFile ] ; then
  cat > ${mp8PrioFile} <<EOF
noneQueuedTime=$rightNow
lastMP8Prio=0
fracAllowRunDynamic=0
idleMP8Count=0
EOF
fi
. ${mp8PrioFile}
#
countSet=`$jobStateProg`
stErr=$?
#
if [ $stErr -ne 0 ] ; then
  exit
fi
#
countWord=( $countSet )
totalSlots=${countWord[1]}
totalCores=${countWord[2]}
freeSlots=${countWord[3]}
freeCores=${countWord[4]}
ATLASrun=${countWord[5]}
runningA=${countWord[6]}
runningB=${countWord[7]}
otherVORun=${countWord[8]}
calibRun=${countWord[9]}
T3Run=${countWord[10]}
idleMP8=${countWord[11]}
maxMP8cores=${countWord[12]}
usedMP8cores=${countWord[13]}
#
# Max (and min) fraction of prod and analy pilots that will be allowed
#  on MCore machines
#
if [ $totalCores -ne 0 ] ; then
  (( maxFreeFrac = maxMP8cores * 100 / totalCores ))
else
  maxFreeFrac=50
fi
minFreeFrac=10
#
# Let's set the thresholds higher again.  Target will be 75% initially, or
# leaves 480 cores out of 1920 for Analysis.
#
# Test, change from 72/78 to 88/94 on 3/19/2014
#
(( lowWater = maxMP8cores * 88 / 100 ))
(( highWater = maxMP8cores * 94 / 100 ))
(( freeMP8cores = maxMP8cores - usedMP8cores ))
if [ $totalCores -ne 0 ] ; then
  (( freeFrac = freeMP8cores * 200 / totalCores ))
  if (( freeFrac > maxFreeFrac )) ; then
    freeFrac=${maxFreeFrac}
  fi
  if (( freeFrac < minFreeFrac )) ; then
    freeFrac=${minFreeFrac}
  fi
else
  freeFrac=50
fi
#
setMP8prio=0
#
if (( idleMP8 > minIdleMP8 )) ; then
  noneQueuedTime=$rightNow
  if (( usedMP8cores < lowWater )) ; then
      setMP8prio=1
  elif (( lastMP8Prio == 1 && usedMP8cores < highWater )) ; then
      setMP8prio=1
  fi
else
  if (( rightNow - noneQueuedTime > tooLongIdle )) ; then
    freeFrac=99
  fi
fi
#
cat > ${mp8PrioFile} <<EOF
noneQueuedTime=${noneQueuedTime}
lastMP8Prio=${setMP8prio}
fracAllowRunDynamic=${freeFrac}
idleMP8Count=${idleMP8}
EOF

# Stuff for HTCondor-CE
#   Limited to machines where the configuration directory exists
#
if [ -e /etc/init.d/condor-ce ] ; then
  ranFrac=$RANDOM
  (( ranFrac = ranFrac * 100 ))
  (( ranFrac = ranFrac / 32767 ))
  if (( setMP8prio==1 || ranFrac>freeFrac )) ; then
    LastAndFrac="True"
  else
    LastAndFrac="False"
  fi
  if (( idleMP8 > minIdleMP8 )) ; then
    IdleMP8Pressure="True"
  else
    IdleMP8Pressure="False"
  fi
#
  echo "# Based upon the usual 15minute computation of mp8 stuff, Add these line, on this basis" > ${condor_ce_mp8_prio_file}
  echo "# lastMP8Prio == 1 || RandomFrac > TargetFrac" >> ${condor_ce_mp8_prio_file}
  echo "LastAndFrac = ${LastAndFrac}" >> ${condor_ce_mp8_prio_file}
  echo "# idleMP8Count > idleMP8Threshold" >> ${condor_ce_mp8_prio_file}
  echo "IdleMP8Pressure = ${IdleMP8Pressure}" >> ${condor_ce_mp8_prio_file}
  export PATH=${PATH}:/usr/sbin
  /usr/bin/condor_ce_reconfig -daemon job_router
fi

echo "${rightNow} ${setMP8prio} `date`" >> ${stateLog}

/root/tools/job_state_counter.pl

#!/usr/bin/perl
#
# Return the following values based upon condor_status output:
# $totalRun     total jobs running
# $totalSlots   total number of job slots on the system
# $totalCores   total number of cores on the system
# $freeSlots    Idle job slots
# $freeCores    Number of free cores on the system
# $atlasRun     Jobs running as the gatekpr group
# $opporARun    Jobs running under the opportunistic A group
# $opporBRun    Jobs running under the opportunistic B group
# $otherVORun   Jobs run by other VOs, coming in from gate02
# $calibRun     Jobs running as muon calibration
# $T3Run        Anything left over is a T3 job
# $idleMP8      Number of Idle MP8 jobs
# $maxMP8cores  Max number of cores available for MP8 jobs
# $usedMP8cores cores used for MP8 jobs
#
#   $ENV{CONDOR_CONFIG}="/etc/condor/condor_config";
#--------------------
$condor_status = "/usr/bin/condor_status";
$condor_q = "/usr/bin/condor_q";
#--------------------
$condor_args = "-format \"%s \" Name -format \"%d \" Cpus -format \"%s \" State -format \"%s \" Activity -format \"%d \" SlotTypeId -format \"%d \" TotalCpus -format \"%d \" TotalSlotCpus -format \"%s \" AccountingGroup -format '\n' In 2\>\&1";
#--------------------
$condor_q_args = "-name gate04.aglt2.org -constr \'jobprio == 25 || jobprio == 26 && jobstatus == 1\' -format \"%d\n\" ClusterId";
#--------------------
#
# Initialize variables -----
#
# condor_status group
#
$totalRun = 0;
$totalSlots = 0;
$totalCores = 0;
$freeSlots = 0;
$freeCores = 0;
$atlasRun = 0;
$opporARun = 0;
$opporBRun = 0;
$otherVORun = 0;
$calibRun = 0;
$T3Run = 0;
$maxMP8cores = 0;
$usedMP8cores = 0;
#
# condor_q group -----
#
$idleMP8 = 0;
#
# Begin reading -----
#
open (FID, "$condor_status $condor_args | ");
foreach $line (<FID>) {
#    print "----- Input line is $line";
    chomp $line;
# Split the input line
    ($Owner,$Cores,$State,$Activity,$SlotType,$TotalCpus,$SlotCpus,$AccGroup) = split(/\s+/,$line);
#    print "$Owner,$Cores,$State,$Activity,$SlotType,$TotalCpus,$SlotCpus,$AccGroup\n";
    if ( $Owner eq "Error:") {
        exit 1;
    }
    $totalSlots++;
    $totalCores+=$Cores;
# Accumulations on running jobs
    if ( $State eq "Claimed" && $Activity eq "Busy" ) {
        $totalRun++;
# Split out the actual accounting group.  We will then count on these
        ($LongGroup,$balance,$balance2) = split(/\./,$AccGroup);
        ($discard,$Group) = split(/\_/,$LongGroup);
        $lvl2Group = "${Group}.${balance}";
#       print "     Found group $Group and two-level group $lvl2Group\n";
        if ( $Group eq "gatekpr" ) {
            $atlasRun+=$Cores;
        }
        elsif ( $lvl2Group eq "gatekpr.prod" ) {
            $atlasRun++;
        }
        elsif ( $lvl2Group eq "gatekpr.other" ) {
            $atlasRun++;
        }
        elsif ( $Group eq "opporA" ) {
            $opporARun++;
        }
        elsif ( $Group eq "opporB" ) {
            $opporBRun++;
        }
        elsif ( $Group eq "VOgener" ) {
            $otherVORun++;
        }
        elsif ( $Group eq "calibrate" ) {
            $calibRun++;
        }
        else {
            $T3Run++;
        }
        if ( $Cores > 1 ) {
            $usedMP8cores+=$Cores;
        }
    }
# Accumulations on Idle jobs and slots
    elsif ( ($State eq "Unclaimed" || $State eq "Owner") && $Activity eq "Idle" ) {
        $freeSlots++;
        $freeCores+=$Cores;
    }
# SlotType 3 is the main Dynamic slot on a machine.
# slotType 2 is a static MP8 slot
#
    if ( $SlotType == 3 ) {
        $maxMP8cores+=$TotalCpus;
    }
    if ( $SlotType == 2 ) {
        $maxMP8cores+=$SlotCpus;
    }
}
close (FID);
#
open (FID, "$condor_q $condor_q_args | ");
foreach $line (<FID>) {
    $idleMP8++;
}
close (FID);
#
# Return values to the caller
#
print "$totalRun $totalSlots $totalCores $freeSlots $freeCores $atlasRun $opporARun $opporBRun $otherVORun $calibRun $T3Run $idleMP8 $maxMP8cores $usedMP8cores\n";

/usr/bin/condor_submit

NOTE that this file is now obsolete with the introduction of HTCondor-CE for the grid protocol. See here, and in particular the settings of the macro "RequestAnalyTask".

#!/bin/bash
# Looking for value localQue in the submission
# If it is not set, then set a default value for it
# For globus jobmanager-condor jobs, IsAnalyJob, localQue, Rank
#  and AccountingGroup have been pre-set correctly.
#
# The number of idle MP8 jobs will be used by condor_submit to
#   determine if atlasconnect jobs should be excluded from the
#   machines running Dynamic Slots
# The file specified here is also specified in the tools/evaluate_mp8_fill.sh
#   script.  If it is changed here, it must also be changed there
#
idleMP8Threshold=50
prioFile="/var/tmp/mp8Prio.txt"
#
if [ -e ${prioFile} ] ; then
  . ${prioFile}
else
  lastMP8Prio=1
  fracAllowRunDynamic=20
  idleMP8Count=0
fi
#
myRequirement=" ( (TARGET.TotalDisk =?= UNDEFINED) || (TARGET.TotalDisk >= 21000000) ) "
#
isQueHere=0
for arg in $@
do
  if [ -f $arg ]
  then
      foundQue=`grep -c localQue $arg`
      if [ $foundQue -ne 0 ]
      then
          isQueHere=1
          queIs=`grep localQue $arg | awk '{print $3}'|sed s/\"//g`
      else
          queIs="Default"
      fi
#  else
#      echo $arg is NOT a file
  fi
done
if [[ "$queIs" == "MP8" ]] ; then
  maxMem=33552000
else
  maxMem=4194000
fi
#
# More restrictions that should benefit the number of MCore jobs running
# With either Default or Analysis queues, if we are either prioritizing
# MultiCore jobs, or the randomly thrown fraction is greater than threshold,
#   don't let them run on Multicore capable machines.
# For atlasconnect jobs (and muon calibration), if there are queued
#   Multicore jobs, ban them completely from the Dynamic machines.
#
ranFrac=$RANDOM
(( ranFrac = ranFrac * 100 ))
(( ranFrac = ranFrac / 32767 ))
noFlag=0
#
if [[ $LOGNAME == "atlasconnect" || $LOGNAME == "muoncal" ]] ; then
  noFlag=1
  if (( idleMP8Count > idleMP8Threshold )) ; then
    myRequirement=" ( ((TARGET.TotalDisk =?= UNDEFINED) || (TARGET.TotalDisk >= 21000000)) && (TARGET.PARTITIONED =!= True) ) "
  fi
fi
#
if [[ "$queIs" == "Default" || "$queIs" == "Analysis" ]] ; then
  if [ $noFlag -eq 0 ] ; then
    if [ ${lastMP8Prio} -eq 1 ] ; then
      myRequirement=" ( ((TARGET.TotalDisk =?= UNDEFINED) || (TARGET.TotalDisk >= 21000000)) && (TARGET.PARTITIONED =!= True) ) "
    elif (( ranFrac > fracAllowRunDynamic )) ; then
      myRequirement=" ( ((TARGET.TotalDisk =?= UNDEFINED) || (TARGET.TotalDisk >= 21000000)) && (TARGET.PARTITIONED =!= True) ) "
    fi
  fi
fi
#
if [ $LOGNAME == "rsv" ]
then
   /usr/bin/real_condor_submit $*
else
   localRequire=`grep -i requirements $*|grep -v -E -i "#+\ *"`
   if [ "$localRequire" == "" ]
   then
      fullRequire=$myRequirement
   else
      extractRequire=`echo $localRequire|awk '{$1=""; $2=""; print}'`
      fullRequire="$myRequirement && ($extractRequire )"
   fi
#
   if [ $isQueHere -eq 1 ]
   then
       /usr/bin/real_condor_submit $* \
          -append "+IsShortJob  = False" \
          -append "+IsMediumJob = False" \
          -append "+IsTestJob   = False" \
          -append "+IsLustreJob = False" \
          -append "+IsUnlimitedJob = False" \
          -append "+JobMemoryLimit = ${maxMem}" \
          -append "Requirements = ( $fullRequire )" \
          -append "Periodic_Remove = ( ( RemoteWallClockTime > (3*24*60*60 + 5*60) ) || (ImageSize > JobMemoryLimit) )"
   else
       /usr/bin/real_condor_submit $* \
          -append "+IsShortJob  = False" \
          -append "+IsMediumJob = False" \
          -append "+IsTestJob   = False" \
          -append "+IsAnalyJob  = False" \
          -append "+IsLustreJob = False" \
          -append "+IsUnlimitedJob = False" \
          -append "+localQue = \"Default\"" \
          -append "+JobMemoryLimit = ${maxMem}" \
          -append "Requirements = ( $fullRequire )" \
          -append "Periodic_Remove = ( ( RemoteWallClockTime > (3*24*60*60 + 5*60) ) || (ImageSize > JobMemoryLimit) )"
   fi
fi

-- BobBall - 03 Mar 2014
Topic revision: r6 - 23 Apr 2015, BobBall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback