The PanDA Auto Exclusion process for ANALY_AGLT2

Introduction

Procedures here were documented by D. van der Ster in this talk. To see this you will need a CERN login.

The PanDA Auto-exclusion process was turned on November 14, 2010. Test jobs are sent to Analysis queues, and if they fail, then the site is turned off, excluding further submissions until the problem is correctly handled. Emails notify all grid sites of the failure, and also when the site is again succeeding.

Auto-exclusion can be enabled or disabled on a site-by-site basis.

Monitoring

  • Look here to see the current set of test jobs
  • Look here for recently failing auto-exclusion analysis jobs at AGLT2.
    • Or, just click "gangarobot" from the PandaMon analysis page
  • This shows a plot of Hammer Cloud auto-exclusions

Diagnosis

  • Look at the Ganga Robot page
    • Check for a ? next to our site, and click that. This is the HC "incidents" page for ANALY_AGLT2.
    • May look something like this: ANALY_AGLT2 (brokeroff): Needs jobs in template 96
  • Go back to the HC homepage, click the current test for template 96. Click through to ANALY_AGLT2 jobs (the "Link" column).
    • The "Test" column on the "incidents" page arrives at this same link.
  • Analyze the jobs list for problems that are stopping successful completion of the HC test.
  • History of queue commands (online, offline, etc) for all sites is shown here

Recovery

  • An Email is sent notifying the list (and site) that they are off-line
  • Coordinate fixes to repair whatever may be wrong.
  • Site must pass 3 successive good jobs, then an Email will be sent that the site is once again passing.
  • Set the site back to online. This could (but may not) require test jobs from panda shifters.
    • Queue state change document
    • Short form is in our own Twiki here

Jarka Schovancova sent the following information on Feb 15, 2010, concerning a procedure to turn the analysis queue back online following a site shutdown:

ANALY queues are handled by DAST shifters, ADCoS steps in only occassionally.

In order to have queues properly tested before they are put back into production, please note that ANALY queue should always be set to 'brokeroff'. This status restricts users from job submission to ANALY_AGLT2, and brings automatic test jobs. When couple automatic test jobs in a row finish OK, HammerCloud sets the corresponding queue online.

This has been modified recently. A comment must be added to the curl command with content "HC.Test.Me", and this cannot be quoted in any way. An example command of this type would then be (after getting your grid proxy).
curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=settest&queue=ANALY_AGLT2_SL6-condor&comment=HC.Test.Me'

Disabling auto-exclusion

  • Refer to this document if we wish to have auto-exclusion turned off
    • Disable via curl http://hammercloud.cern.ch/atlas/autoexclusion/disable/ANALY_AGLT2
    • Enable via curl http://hammercloud.cern.ch/atlas/autoexclusion/enable/ANALY_AGLT2

Reference: Incidents Page

-- BobBall - 18 Nov 2010
Topic revision: r9 - 17 Sep 2013, BobBall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback