Procedures followed to bring gate03 online as a test gate keeper

NOTE: This page is changing as the procedures and tests evolve. This note will be removed once testing is complete and the odd comment here and there below are cleaned up.

Bringing up the cloned gate03

On gate03, the cloned gate01, did, at single, as it came up the first time

vdt-control --off
chkconfig condor off
reboot (to get correct IP and hostname)

  • Had to rebuild vmware tools. No NICs and service startup failures.
  • Turning off 3 cfengine services.
  • Commented out crontab entries via ##

Differences between PACman and RPM install

See this URL: https://twiki.grid.iu.edu/bin/view/Documentation/Release3/RPMWhatsNew

Some components are not yet packaged as RPMs and you still need to get them from existing Pacman installs. In particular, GUMS and the Gratia Collector are not yet provided via RPM.

  • $VDT_LOCATION no longer exists
  • $VDT_LOCATION/setup.sh no longer exists and isn't needed. For user jobs that still expect $OSG_GRID/setup.sh to exist, a dummy has been placed in /etc/osg/wn-client/setup.sh, and you can set $OSG_GRID to /etc/osg/wn-client.
  • $GLOBUS_LOCATION no longer exists
  • The single config.ini file has been replaced by a directory of files in /etc/osg/*.ini. They are read in alphabetical order.
  • configure-osg has been renamed osg-configure.

Creating the CE

Follow directions here and make Resource Group AGLT2_TEST Next, follow directions here and create AGLT2_TEST_CE for gate03. Following this, an OSG ticket is created, and the gate keeper in the new resource group is marked as Inactive until the next weeks OSG management meeting. I attended this meeting, answered questions for the admins, and the resource was then activated.

This activation is necessary to initiate reporting to the BDII, else releases are not Tagged as available. To see this, the following commands are useful:
  • lcg-info --vo atlas --list-ce --attr Tag|grep gate03.aglt2.org
  • lcg-info --vo atlas --list-ce --attr Tag|grep -A 100 gate03.aglt2.org

  • Following edit of /opt/osg/osg/etc/config.ini, did
    • [gate03:etc]# configure-osg -v
    • [gate03:etc]# configure-osg -c
  • Must also disable:
    • condor-cron
    • osg-rsv
  • Should now be able to start condor and do
    • vdt-control --on

  • Go into gums servers linat02/03/04 and add the "Host To Group Mappings" for gate03, identical to gate01 mapping.

  • Add new gate03 hostcert and httpcert (service)
Add rsv account as shown in these directions:

SchedConfig Changes

copy AGLT2-condor.py to AGLT2_TEST-condor.py with gate01 -> gate03.  This will look like a production queue.
(
later to run the osg-wn-client rpm set on the WN, change this as follows
envsetup' : 'source /afs/atlas.umich.edu/OSGWN/setup.sh;'
envsetup' : 'source /etc/osg/wn-client/setup.sh;'
)

Hi Gianfranco,
Could you please add AGLT2_TEST queue to MC HC test?
Thanks, Yuri (ADCoS expert)

is is enough to add it to the 2 tests that do not trigger auto-exclusion, or do you want the full suite 
(including auto-exclusion/inclusion)?
For gate03, the 2 tests are sufficient.  It can take up to 24 hours to begin such testing.

OK, I have added the queue to template 450 (PFT Evgen_trf 16.6.5.1) and 164 (PFT Reco_trf 16.6.5.5.1), 
both of which are not used for auto-exclusion.

Test plan

  • Submit test jobs to gate03 in standard way
    • Works, both via globus tests and submission from splitter.
  • Set the queue in test state
    • [ball@gate01:~]$ curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=AGLT2_TEST-condor'
      • Set queue nickname='AGLT2_TEST-condor', siteid='AGLT2_TEST' to manual
    • [ball@gate01:~]$ curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=settest&queue=AGLT2_TEST-condor&comment=HC.Test.Me'
      • Changed status of queue nickname='AGLT2_TEST-condor', siteid='AGLT2_TEST' from offline to test

  • Asked pandashift for a batch of test jobs....
    • Got a few, but the HC testing above is the true test of the gate03 workings.
    • HC testing succeeding.

  • Upgrade condor to 7.8.1
    • rpm -Uvh ...
    • cd /usr/bin; mv condor_submit real_condor_submit; cp -p new_condor_submit condor_submit
    • Test jobs continue to successfully run. Moving on.
    • Note: grid jobs submitted from splitter to gate03, and running on Condor 7.6.6 WN, will not correctly transfer output files back to splitter
      • Same is true for locally run jobs on gate03, output files are called _condor_stdout and _condor_stderr
      • There is no such issue if the WN is running 7.8.1

  • Submit test jobs to using osg_wn_client rpm set from gate03. Make sure they work.
    • Success

  • Convert gate03 to rpm set and again submit jobs to osg_wn_client rpm set.
    • OSG 3.x comes with GT5. When we switch, please, send me (Jose Caballero) and John Hover an email so we can adjust the pilot factory.
yum --enablerepo=osg install empty-ca-certs
yum --enablerepo=osg install osg-ce-condor
yum --enablerepo=osg install globus-gram-job-manager-managedfork

Carefully check over all of these files:
[gate03:yum.repos.d]# cd /etc/osg/config.d
[gate03:config.d]# ll
total 48
-rw-r--r-- 1 root  866 Oct 31  2011 01-squid.ini
-rw-r--r-- 1 root 1698 Mar  9 16:54 10-misc.ini
-rw-r--r-- 1 root 2370 Oct 20  2011 10-storage.ini
-rw-r--r-- 1 root  341 Aug 29  2011 15-managedfork.ini
-rw-r--r-- 1 root 1204 Dec  7  2011 20-condor.ini
-rw-r--r-- 1 root 1453 Feb 23 13:05 30-cemon.ini
-rw-r--r-- 1 root 8003 Apr  2 15:30 30-gip.ini
-rw-r--r-- 1 root 1884 Oct 31  2011 30-gratia.ini
-rw-r--r-- 1 root  339 Dec  7  2011 40-localsettings.ini
-rw-r--r-- 1 root 1442 Mar  9 16:54 40-network.ini
-rw-r--r-- 1 root 2325 Aug 29  2011 40-siteinfo.ini

Preparing for the rpm install

The RSV must be installed separately from the CE

Actions taken

rpm install methods

  • On gate02 do yum install osg-ca-certs
  • On gate01 and gate03 do yum install empty-ca-certs
    • gate01: remember to set the queues to "brokeroff" to drain activated jobs
      • Not needed on gate03

  • Referenc this page and find the following should be performed
    • edit /etc/yum.repos.d/osg.repo to add/change the following lines:
      • exclude=condor empty-condor*
      • enabled=0

  • When ready, choose "yum install osg-ce-condor" as the rpm to install.
-- BobBall - 23 Jul 2012
Topic revision: r7 - 17 Sep 2013, BobBall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback