Workflows for modifying HTCondor configuration

When modifying condor there are two broad phases any steps taken can be put into: the testing phase and the implementation/dissemination phase. In the testing phase you test your modifications on one or more machines to make sure they work as expected. Once you have sorted out all the bugs and are confident the modifications work as intended, you disseminate the modifications to every machine.

There are two ways to approach the testing phase, but only one way to approach the dissemination phase.
  1. One can test HTCondor modifications by either turning off cfengine on whichever machine the tests are being done, edit config files directly, and then use condor_reconfig to update HTCondors internal configuration (it does not update itself every time a config file is edited). Once the modifications appear to work, create a cfengine test policy, turn cfengine on the machine back on, and test everything again. This approach is quicker for modifications that might take a lot of attempts to get right.
  2. Do all of the testing using a cfengine test policy. This approach is quicker for modifications that will only take a couple tries to get right (such as modifying the time limits of the short, medium, and long queues).

The dissemination phase can only be done using cfengine.

Testing Approach 1: Turn off cfengine for most of the testing

  1. If editing configuration of a submit node, notify users not to use it.
  2. Turn off cfengine using "sudo service cfengine3 stop"
  3. If editing configuration of a worker node, add the "TARGET.IsTestJob =?= True" to the START variable/macro of 55-host.conf like so:
    START = $(START) && (TARGET.IsTestJob =?= True) && <everything else that was already there>
    Then use "condor_reconfig" This edit will make it so that only jobs with "+IsTestJob = True" in their submit script will run (so make sure this is added to any test jobs you run). In addition, add "Requirements = (Machine == "c--.aglt2.org")" to your submit script to make sure jobs only run on this specific node. If you need to test multiple nodes, modify this statement to something like "Requirements = (Machine == c-113-2.aglt2.org) || (Machine == cc-113-3.aglt2.org)"
  4. Make changes to HTCondor configuration files.
  5. Use "condor_reconfig" or "condor_restart" depending which variables/macros were modified (check list here to see if "condor_restart" is necessary).
  6. Test your changes by running test jobs.
  7. Repeat steps 4 and 5 until everything works as expected.
  8. Create a test policy as described in "Testing Approach 2" (below) and test the test policy to see if it works as expected.

Testing Approach 2: Use cfengine for all of the testing

Follow the steps here for checking out the cfengine SVN repository, as well as setting up and using a test policy. Modify and use said test policy to test changes to HTCondor configuration files.

Dissemination

To disseminate configuration changes to all machine, follow the instructions here for putting the test policy into production.

Examples

Modifying the time limits of the short, medium, and long queues

To edit the time limits of the short, medium, and long queues, follow these steps:
  1. Since these are part of the submit node configuration, pick a submit node to test on (white is meant for testing purposes, so you should probably pick it)
  2. Checkout cfengine as described here.
  3. The time limits are configured in cfegnine/masterfiles/stash/condor_msut3/58-host-submit.conf.
  4. For each time limit there are two places in this file that need to be updated: SYSTEM_PERIODIC_HOLD and SYSTEM_PERIODIC_HOLD_REASON.
  5. Edit the appropriate places for each of the time limits you'd like to modify. Note that the values given are in seconds (3*60*60 = 3 hours * 60 minutes/hour * 60 seconds/minute = 10,800 seconds).
  6. Create and sync a test policy as described here.
  7. On the submit node, change the policy by editing /var/cfengine/policy_path.dat to point to the test policy. (Remember to reset this to /var/cfengine/policy/T2 once testing is done)
  8. Either wait for cfengines hourly run, or use the command "cf-agent -f failsafe.cf -!DPolicyPath_mytest ; cf-agent -!DPolicyPath_mytest" on the submit machine to run cfengine immediately. Note that each instance of "mytest" in this command should be replaced with the name of the test policy.
  9. Submit jobs from this node to make sure the time limits work as intended. Repeat steps 5, 6, and 8 until everything works as intended.
  10. Undo step 7.
  11. Disseminate the changes to all of the submit nodes as described here.

Modifying the limits on the number of concurrently running short, medium, and long jobs

To edit the limits on the number of concurrently running short, medium, and long jobs, follow these steps:
  1. Since these are part of the worker node configuration, pick a worker node of each type to test on (at time of writing, this would be one PE1950 and one R610).
  2. Checkout cfengine as described here.
  3. The limits are configured in cfegnine/masterfiles/stash/condor_msut3/55-host-.conf.
  4. The way these limits work is by modifying which jobs slots a short, medium, or long job can run on. Currently the setup is as follows. Short jobs can run on all job slots. On PE 1950s, medium jobs can run on all slots except slot 8. On R610s, medium jobs can run on all job slots. Long jobs can only run on the last job slot of each machine (slot 8 for PE1950s and slot 12 for R610s). This makes it so that as many short jobs as possible can run, nearly all but 30 medium jobs can run at once, and about only 50 long jobs can run at once.
  5. Figure out what configuration of job slots will roughly limit the appropriate number of jobs.
  6. Modify the START statement in 55-host-*.conf to include "TARGET.IsTestJob =?= True" as described above then modify the START statement so the appropriate job slots only run the intended jobs.
  7. Create and sync a test policy as described here.
  8. On the worker nodes you will test on, change the policy by editing /var/cfengine/policy_path.dat to point to the test policy. (Remember to reset this to /var/cfengine/policy/T2 once testing is done)
  9. Either wait for cfengines hourly run, or use the command "cf-agent -f failsafe.cf -!DPolicyPath_mytest ; cf-agent -!DPolicyPath_mytest" on the appropriate worker nodes to run cfengine immediately. Note that each instance of "mytest" in this command should be replaced with the name of the test policy.
  10. Submit jobs to these worker nodes by adding "Requirements = (Machine == "c--.aglt2.org")" to your submit script to make sure jobs only run on this specific node. If you need to test multiple nodes, modify this statement to something like "Requirements = (Machine == c-113-2.aglt2.org) || (Machine == cc-113-3.aglt2.org)"
  11. Once everything seems to be working as intended, undo step 7.
  12. Put the changes into production as described here.
  13. Use cluster_control to run "cf-agent -f failsafe.cf -!DPolicyPath_T2 ; cf-agent -!DPolicyPath_T2" on all of the worker nodes.

Adding/modifying concurrency limits

The concurrency limits are managed by the condor_negotiator daemon and are defined in 20-job-rules.conf. Because they are managed by the condor_negotiator, they cannot be tested without affecting the entire HTCondor system on the tier3. So we must be very careful with them. Follow these steps to add or modify the concurrency limits:
  1. Checkout cfengine as described here.
  2. Add/Modify the concurrency limits inside masterfiles/stash/condor_msut3/20-job-rules.conf by changing the names, adjusting the numbers, or adding new "XXX_LIMIT = YYY" lines.
  3. Create and sync a test policy as described here.
  4. Change the policy on msut3-condor by editing /var/cfengine/policy_path.dat to point to the test policy. (Remember to reset this to /var/cfengine/policy/T2 once testing is done)
  5. Either wait for cfengines hourly run, or use the command "cf-agent -f failsafe.cf -!DPolicyPath_mytest ; cf-agent -!DPolicyPath_mytest" on msut3-condor to run cfengine immediately. Note that each instance of "mytest" in this command should be replaced with the name of the test policy.
  6. Make sure everything works as intended by running some jobs with the new concurrency limits.
  7. Once everything seems to be working as intended, undo step 4.
  8. Put the new changes into production as described here.
Topic revision: r3 - 24 Sep 2019, ForrestPhillips
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback