Install or Upgrade OSG at AGLT2
The main difference between these instructions and the usual documentation is that we use worker node and wlcg-client installations in AFS as well as certificates in AFS which are kept up to date by gate02.
For full Information of how to install OSG, please refer to this page
OSGCE
For a short tuturial see:
https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ComputingElementHandsOn
- Most of our config should come over when you do extract_config in an upgrade (more below)
- Ignore the parts of this tutorial regarding CA setup. Make symlinks as noted below.
- authorization_method in config.ini is "prima"
- Also ignore parts about configuring RSV and RSV certs on gate02 at least.
- Needed host certs pushed automatically into /etc/grid-security from umopt1
Changelog
The following are the commands I used to install OSG100 on gate02.. there are some site-specific issues:
Updated May 12, 2009, for OSG101 install -- B.Ball
Updated June 13, 2009, for OSG104 -- B.Ball
No changes required to fundamental procedure outlined below.
August 8 2009 - bmeekhof
Renamed topic, edited according to experience upgrading OSG 1.0.4 to OSG 1.2.0 on gate02 following tutorial linked.
August 23 2009 - bmeekhof
Updated after installation on gate01. Additional info about updating AFS installations of Pacman, OSGWN and opt/WLCG-client and setting CA locations.
November 2, 2010 - Bob Ball
Upgrade OSG 1.2.6 to 1.2.15
January 17, 2011 - Bob Ball
Upgrade OSG 1.2.15 to 1.2.16
April 5, 2011 - Bob Ball
Upgrade OSG 1.2.16 to 1.2.19
October 21, 2011 - Bob Ball
Upgrade OSG 1.2.19 to 1.2.23
October 21, 2011 -- Bob Ball
Install OSGWN 1.2.23
November 7, 2011 -- Bob Ball
Upgrade OSG 1.2.23 to 1.2.24
November 15, 2011 -- Bob Ball
Upgrade OSG 1.2.24 to 1.2.25 on gate02, and apply 1.2.25 gratia security fix on gate01
March 8, 2012 -- Bob Ball
Upgrade OSG to 1.2.28 on both gate01 and on gate02.
February 27, 2016 -- Directions used most recently for an OSG 3.3 upgrade
Prepare for install
turn off the existing OSG services
Source the existing OSG install.
source /opt/OSG104/setup.sh
vdt-control --off
Logout to unexport the env variables or login a new shell.
Set up the env variables.
This is important, don't forget it or you'll be re-installing. Setting OLD_VDT_LOCATION ensures your old configuration gets pulled in, but also we will have to run "extract_config" later to setup config.ini.
export VDTSETUP_CONDOR_LOCATION=/opt/condor
export VDTSETUP_CONDOR_CONFIG=/opt/condor/etc/condor_config
export VDT_GUMS_HOST=linat04.grid.umich.edu
export OLD_VDT_LOCATION=/opt/OSG104/
Install the software
Install pyOpenSSL
"We have identified a reporting bug in OSG 1.2 that could impact accounting (for WLCG) and monitoring since it impacts the ability to publish RSV records to the GOC RSV database and WLCG SAM. The current monitoring system shows that all the Tier-2s running 1.2 have either fixed this problem or are aware of it. A VDT update will be available early next week.
The bug stems from a newly introduced dependency in the RSV Gratia probe on pyOpenSSL. If your site is already running pyOpenSSL, it should not be affected. If you are not running pyOpenSSL, this means that your site is not be reporting Gratia accounting data. The work around is to install pyOpenSSL. Alternatively, as noted above, this will be available in the a soon to be released VDT update. "
(message dated Friday Aug 14 2009)
You'll need admin AFS tokens to do this. "kinit admin" and "aklog". Note that sometimes afs paths are .atlas.umich.edu when we need the RW volume.
Install latest Pacman
Install pacman (AFS):
cd /afs/.atlas.umich.edu/opt
wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-latest.tar.gz
tar -xzvf pacman-latest.tar.gz
rm pacman (remove old symlink)
ln -s pacman-x.xx pacman
rm pacman-latest.tar.gz
vos release opt
cd /afs/atlas.umich.edu/opt/pacman/
source setup.sh
(first pacman source wants you to be in local dir)
Update AFS installations of OSG Worker Node and OSG WLCG client
Source /afs/atlas.umich.edu/opt/pacman/setup.sh if you have not already.
Please read
LocalDQ2Tools#The_Installation_Procedure for information about updating this and what you have to do to remount the /opt volume as read-write in AFS.
UPDATE: Or...use /afs/.atlas.umich.edu to use RW volume as documented below and fix the paths in files.
Probably should save a copy of current installation and delete existing files.
Then install worker node and wlcg using pacman (note the "." in afs path to use RW volume, and note that we fix the paths up to use the RO volume in usage):
OSGWN updated May 25, 2010 to osg 1.2.9 version
cd /afs/.atlas.umich.edu/OSGWN
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:wn-client
sed -i s/\.atlas\.umich\.edu/atlas\.umich\.edu/g `grep -RIl "\.atlas\.umich\.edu" *`
### NOTE: for the 10/21/2011 update to OSGWN, the OSGWN volume was
### remounted rw, all files were moved to the directory old_OSGWN,
### and the pacman command was run on an "empty" directory.
###
### The content of the new dccp/bin directory contained ONLY dccp,
### so all the old lsm files were copied from the old_OSGWN tree to
### the new location
ln -s /afs/atlas.umich.edu/OSG_certificates/certificates globus/share/certificates
ln -s /afs/atlas.umich.edu/OSG_certificates/certificates globus/TRUSTED_CA
cd /afs/.atlas.umich.edu/opt/WLCG-client
pacman -get http://www.mwt2.org/caches/osg-1.2:wlcg-client
sed -i s/\.atlas\.umich\.edu/atlas\.umich\.edu/g `grep -RIl "\.atlas\.umich\.edu" *`
ln -s /afs/atlas.umich.edu/OSG_certificates/certificates globus/share/certificates
ln -s /afs/atlas.umich.edu/OSG_certificates/certificates globus/TRUSTED_CA
Check/fix the openssl path so it is as below (don't do install on host with /opt/globus so it picks up the right path):
/afs/atlas.umich.edu/OSGWN/globus/bin/openssl -> /usr/bin/openssl
/afs/atlas.umich.edu/WLCG-client/globus/bin/openssl -> /usr/bin/openssl
Be sure to release the volumes:
vos release opt
vos release OSGWN
Install OSG
Install OSG in /opt on Compute Elements (gate01,gate02):
mkdir /opt/osg-1.2 ; cd /opt/osg-1.2
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:ce
Install managedfork
Install into /opt/osg-1.2:
cd /opt/osg-1.2
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:ManagedFork
These instructions were not performed in upgrading to osg-1.2, not sure they are needed or if they are part of upgrade:
source $VDT_LOCATION/setup.sh
$VDT_LOCATION/vdt/setup/configure_globus_gatekeeper --managed-fork y --server y
Install Job Manager for condor
Install in /opt/osg-1.2:
cd /opt/osg-1.2
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:Globus-Condor-Setup
These instructions were not performed in upgrading to osg-1.2, not sure they are needed or if they are part of upgrade:
##uncomment this line in the condor.pm
vi $VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobManager/condor.pm
# $requirements .= " && Arch == \"" . $description->condor_arch() . "\" ";
Do post-install
source /opt/osg-1.2/setup.sh
vdt-post-install
gate02 is the machine which updates our AFS certs. It may be necessary to do the setupca command below if not upgrading. There is no longer a vdt-questions.sh to run (reference to it removed below).
See the notes in the post-install/README file on CA-Certificates. Edit the value of cacerts_url in the configuration file at /opt/ost-1.2/vdt/etc/vdt-update-certs.conf
cacerts_url = http://software.grid.iu.edu/pacman/cadist/ca-certs-version
cd /opt/osg-1.2
source /opt/osg-1.2/setup.sh
vdt-ca-manage setupca --location local --url osg
At AGLT2 -- point the installation at our AFS certificates:
rm /opt/osg-1.2/globus/share/certificates
rm /opt/osg-1.2/globus/TRUSTED_CA
gate02 (updates certificates, RW):
ln -s /afs/atlas.umich.edu/Certficates/certificates /opt/osg-1.2/globus/share/certificates
ln -s /afs/atlas.umich.edu/Certficates/certificates /opt/osg-1.2/globus/TRUSTED_CA
gate01 (RO):
ln -s /afs/atlas.umich.edu/OSG_certificates/certificates /opt/osg-1.2/globus/share/certificates
ln -s /afs/atlas.umich.edu/OSG_certificates/certificates /opt/osg-1.2/globus/TRUSTED_CA
Copy auth files from post-install. The files will have the correct values as long as you set OLD_VDT_LOCATION before the installation.
cp /opt/osg-1.2/post-install/gsi-authz.conf /etc/grid-security/
cp /opt/osg-1.2/post-install/prima-authz.conf /etc/grid-security/
vi /etc/grid-security/prima-authz.conf
logLevel info
Setup config.ini
For an upgrade (be sure you set env vars before you started) you will need to run first:
source /opt/osg-1.2/setup.sh (if not sourced already)
extract-config
Copy extracted-config.ini to /opt/osg-1.2/osg/etc/config.ini and check it over. Then check that it verifies and then apply the config:
configure-osg -v
configure-osg -c
Modify your sudo file
Runas_Alias GLOBUSUSERS = ALL, !root
globus ALL=(GLOBUSUSERS) \
NOPASSWD: /opt/osg-1.2/globus/libexec/globus-gridmap-and-execute \
-g /etc/grid-security/grid-mapfile \
/opt/osg-1.2/globus/libexec/globus-job-manager-script.pl *
globus ALL=(GLOBUSUSERS) \
NOPASSWD: /opt/osg-1.2/globus/libexec/globus-gridmap-and-execute \
-g /etc/grid-security/grid-mapfile \
/opt/osg-1.2/globus/libexec/globus-gram-local-proxy-tool *
globus ALL=(GLOBUSUSERS) \
NOPASSWD: \
/opt/osg-1.2/globus/libexec/globus-job-manager-script.pl *
globus ALL=(GLOBUSUSERS) \
NOPASSWD: \
/opt/osg-1.2/globus/libexec/globus-gram-local-proxy-tool *
Check perms on containercert/key
Make sure under /etc/grid-security, both containercert.pem and containerkey.pem belong to the same user globus..
gate02:monitoring]# ls -l /etc/grid-security/container*|grep -v old
-r--r--r-- 1 globus osg 1302 Jul 9 11:43 /etc/grid-security/containercert.pem
-r-------- 1 globus osg 887 Jul 9 11:43 /etc/grid-security/containerkey.pem
Check your services, enable the ones you want with vdt-control --enable
Turn off condor, turn on anything needed. Gate02 needs to run the cert and crl update services. I didn't need to do the vdt-register-service in an upgrade. This is for gate02:
vdt-control --enable fetch-crl
vdt-control --enable vdt-update-certs
vdt-control --disable condor-cron
vdt-register-service --name condor-cron --disable
NOTE: gate01 is the opposite. Enable condor-cron, disable fetch-crl and vdt-update-certs
Double check that it's all good. Our two gatekeepers are different in requirements. Gate02 needs these:
vdt-control --list
[gate02:osg-1.2]# vdt-control --list
Service | Type | Desired State
------------------------+--------+--------------
fetch-crl | cron | enable
vdt-rotate-logs | cron | enable
vdt-update-certs | cron | enable
globus-gatekeeper | inetd | enable
gsiftp | inetd | enable
mysql5 | init | enable
globus-ws | init | enable
gums-host-cron | cron | enable
MLD | init | do not enable
condor-cron | init | do not enable
apache | init | enable
tomcat-55 | init | enable
gratia-condor | cron | enable
edg-mkgridmap | cron | do not enable
Gate01 needs these:
[gate01:afs]# vdt-control --list
Service | Type | Desired State
------------------------+--------+--------------
fetch-crl | cron | do not enable
vdt-rotate-logs | cron | enable
vdt-update-certs | cron | do not enable
globus-gatekeeper | inetd | enable
gsiftp | inetd | enable
mysql5 | init | enable
globus-ws | init | do not enable
gums-host-cron | cron | enable
MLD | init | enable
condor-cron | init | enable
apache | init | enable
tomcat-55 | init | enable
gratia-condor | cron | enable
edg-mkgridmap | cron | do not enable
osg-rsv | init | enable
Make sure mysql is started up before globus-ws
This was not necessary in upgrade to osg-1.2. It appears to be fixed in the distribution - services started up in the correct order without modifications below. Init files from dist setup put mysql at 90 and tomcat-55,apache,globus-ws at order 99. Init file is named mysql5 now.
sed '/^# chkconfig:/c # chkconfig: 345 97 09' --in-place=.ORI /etc/rc.d/init.d/mysql
sed '/^# chkconfig:/c # chkconfig: 345 98 04' --in-place=.ORI /etc/rc.d/init.d/globus-ws
chkconfig mysql reset
chkconfig globus-ws reset
Start the services
vdt-control --on
Modify crontab for root on gate02 (vdt-control should put these in but you will need to adjust timing)
This applies to gate02 only.
- fetch-crl.cron should run every hour at 8 minutes after every hour
- vdt-update-certs-wrapper should run at 12 minutes after every hour
8 * * * * /opt/osg-1.2/fetch-crl/share/doc/fetch-crl-2.6.6/fetch-crl.cron
12 * * * * /opt/osg-1.2/vdt/sbin/vdt-update-certs-wrapper --vdt-install /opt/osg-1.2 --called-from-cron
Make a symlink for OSG104 RSV probes from gate01
It won't find this binary if you don't do the below:
ln -s /opt/osg-1.2/osg/bin/osg-version /opt/osg-1.2/osg-version
Update various other scripts
I did not do 2) when updating to OSG 1.2.0.
1) Following
directions here Add this on a one-time only basis to /etc/security/limits.conf
globus hard nofile 16384
2) Still following those directions, add to GLOBUS_OPTIONS in /opt/OSG104/setup.sh
-Dorg.globus.wsrf.container.persistence.dir=/home/GRAM4_metadata
This directory is created with 777 permissions
3) Bring these startups in line
sed -i s/OSG104/osg-1.2/g /etc/init.d/gsisshd
sed -i s/OSG104/osg-1.2/g /etc/init.d/gsi_sshd
sed -i s/OSG104/osg-1.2/g /etc/syslog-ng/syslog-ng.conf
Note that for the first 2, the file /etc/sysconfig/vdt.conf is defined now, that specifies the
location of the VDT, like so:
export VDT_CURRENT=/opt/osg-1.2
The gsisshd and gsi_sshd startups now source this file, and then branch accordingly.
syslog-ng.conf cannot do this, and so must be modified by hand.
gate01 now employs the same setup.
Verify the site
Do these as a normal user with your grid cert.
source /opt/osg-1.2/setup.sh
grid-proxy-init
cd /opt/osg-1.2/verify
./site_verify.pl
Some commands to verify the services:
grid-proxy-init
##verify managedfork
time globus-job-run gate02.grid.umich.edu:2119/jobmanager-managedfork /bin/hostname
##verify jobmanager-cordor
time globus-job-run gate02.grid.umich.edu:2119/jobmanager-condor /bin/hostname
##verify globus-ws
globusrun-ws -submit -F gate01.aglt2.org:9443 -S -s -c /bin/bash -c 'export CONDOR_CONFIG=/opt/condor/etc/condor_config; condor_q'
Example of setting up RSV Probes
vdt-control --off osg-rsv
perl osg-rsv/bin/misc/cleanup-rsv.pl --reset
./osg-rsv/setup/configure_osg_rsv --user rsvuser --init --server y --ce-probes \
--ce-uri "gate01.aglt2.org gate02.grid.umich.edu" --srm-probes --srm-uri "head01.aglt2.org" \
--srm-dir /pnfs/aglt2.org/dq2 --srm-webservice-path "srm/managerv2" --gratia --grid-type "OSG" \
--consumers --verbose --setup-for-apache --proxy /tmp/x509up_u55625
vdt-control --on osg-rsv
Upgrade OSG 1.2.6 to OSG 1.2.15
This upgrade was performed on November 2, 2010, and went very smoothly. Instructions were followed from
this URL. This particular URL is linked from
this master URL.
Pre-upgrade steps
# Save some files:
cd /root
mkdir osg1.2.15_up
crontab -l > osg1.2.15_up/crontab_l
vdt-control --list > osg1.2.15_up/vdt-control-list.txt
cp -p /opt/osg-1.2.6/osg/etc/config.ini osg1.2.15_up/
#
# Check some links so we can ensure they are correctly set at the end
[gate02:~]# ll /opt/osg/globus|grep TRUST
lrwxrwxrwx 1 root root 30 Apr 30 17:26 TRUSTED_CA -> /opt/certificates/certificates
[gate02:~]# ll /opt/osg/globus/share|grep cert
lrwxrwxrwx 1 root root 30 Apr 30 17:27 certificates -> /opt/certificates/certificates
#
[gate01:~]# ll /opt/osg/globus|grep TRUST
lrwxrwxrwx 1 root 50 Sep 2 12:32 TRUSTED_CA -> /afs/atlas.umich.edu/OSG_certificates/certificates/
[gate01:~]# ll /opt/osg/globus/share|grep cert
lrwxrwxrwx 1 root 50 Sep 2 12:32 certificates -> /afs/atlas.umich.edu/OSG_certificates/certificates/
#
# Make sure that condor is cleaned. Auto-pilots were previously stopped as this is
# a scheduled outage.
condor_q -constr 'jobstatus==1'|grep " I "|awk '{print $1}'|xargs -n 1 condor_hold
condor_q -constr 'jobstatus==2'|grep " R "|awk '{print $1}'|xargs -n 1 condor_rm
service condor stop
export VDTSETUP_CONDOR_LOCATION=/opt/condor
export VDTSETUP_CONDOR_CONFIG=/opt/condor/etc/condor_config
Actual upgrade steps
This is a summary of the steps explained in the URL above.
cd VDT_LOCATION
source setup.sh
vdt-control --off
cp -a $VDT_LOCATION BACKUP_LOCATION
# Get the latest version of the vdt-updater script:
pacman -update VDT-Updater
# Note: If you do not yet have the updater script (look for $VDT_LOCATION/vdt/update/vdt-updater),
# then fetch it with this command:
pacman -get http://vdt.cs.wisc.edu/vdt_200_cache:VDT-Updater
cp -a $VDT_LOCATION NEW_BACKUP_LOCATION
vdt/update/vdt-updater
cp osg/etc/config.ini /tmp/config.ini-backup
pacman -update osg-version
pacman -update osg-config
cp /tmp/config.ini-backup osg/etc/config.ini
# After updating, re-source the setup.sh file to load any changes in the environment:
source setup.sh
vdt-post-install
On a CE, you will also need to reconfigure your system
configure-osg -v
configure-osg -c
# Get rid of the gratia probes for gate02 running from gate01
cd /opt/osg/osg-rsv/submissions/probes
mv gate02*gratia* /root/osg1.2.15_up
# Note that the srmcp-srm-probe is also different, having been modified to use
# a dCache token-controlled area. Compare to /root/srmcp-srm-probe
# Directory is /opt/osg/osg-rsv/bin/probes
vdt-control --on
service condor start
Upgrade OSG 1.2.15 to OSG 1.2.16
Smooth upgrade. Also added in Rack 110 and 119 workers, and bl-5 workers, as sub-clusters 7-9.
This was a small step in versions. Instructions were therefore followed from
this URL instead of the path followed for the 1.2.15 upgrade.
Upgrade OSG 1.2.16 to OSG 1.2.19
Smooth upgrade following directions. Two complications and one change.
- print_local_time = TRUE (or anything) is no longer supported for rsv times in config.ini
- The max value of SI00 is 5000, whereas we had 6700 for the sub-cluster where it was needed, so that is now reset to 5000
- org.osg.gratia.condor and org.osg.gratia.metric probes were disabled for rsv on gate02. This is made possible by the new rsv-control command documented here.
- rsv-control --disable --host gate02.grid.umich.edu org.osg.gratia.condor org.osg.gratia.metric
- This was followed by a gate01 reboot that actually turned off these probes.
Upgrade OSG 1.2.19 to OSG 1.2.23
Pre-upgrade note:
Directions
here look straightforward. However, condor.pm must be modified as I understand it is changed in this release.
Post-upgrade note:
Modified condor.pm to not invoke the new condor_account_groups.pm . This was the only real change to condor.pm in this update, on both gate01 and gate02.
gate02 updates smooth and by the book
gate01 updated with one modification to the procedure. Before the last step, "vdt-control --on", a check of the rsv probes shows the same two probes as in the 1.2.19 update were once again enabled. Disabled them.
- rsv-control --disable --host gate02.grid.umich.edu org.osg.gratia.condor org.osg.gratia.metric
Upgrade OSG 1.2.23 to OSG 1.2.24
Upgrade went smoothly on both gate keepers.
On gate01 the rsv metrics were again disabled. In addition, the global timeout was changed from 1200 to 720 seconds, and the srmcp-readwrite condor-cron interval was changed from "28 *" to "13,28,43,58 *". The following two files were edited to achieve this.
- /opt/osg/osg-rsv/etc/rsv.conf (timeout)
- /opt/osg/osg-rsv/meta/metrics/org.osg.srm.srmcp-readwrite.meta (periodicity)
Both gate01 and gate02 were rebooted following the updates. rsv probes that failed during the downtime were run and the report was fully green.
Upgrade OSG 1.2.24 to OSG 1.2.25
Upgrade only gate02 following directions. Total outage was approximately 20 minutes.
On gate01, perform only the gratia fix outlined at
https://ticket.grid.iu.edu/goc/viewer?id=11248
Upgrade to OSG 1.2.28
Upgrade following directions. No changes in condor.pm or in config.ini.
The srmcp-readwrite rsv probe required a second change, that perhaps should have been there all along. The change is shown in this output from the diff command:
[gate01:probes]# diff srmcp-srm-probe srmcp-srm-probe.orig
103c103
< my $srmcp_cmd = "$o{'srmcpCmd'} -space_token=5904816 -streams_num=1 -srm_protocol_version=".
---
> my $srmcp_cmd = "$o{'srmcpCmd'} -streams_num=1 -srm_protocol_version=".
109c109
< $srmcp_cmd = "$o{'srmcpCmd'} -space_token=5904816 -streams_num=1 -srm_protocol_version=".
---
> $srmcp_cmd = "$o{'srmcpCmd'} -streams_num=1 -srm_protocol_version=".
The metric interval changes made in the upgrade to version 1.2.24 were retained in this update and did not require re-implementation.
The rsv probe disable for gate02 made in the upgrade to version 1.2.23 was again performed.
Upgrade OSG 3.3
The HTCondor repo is installed, but not active, on aglbatch. From there we can see the URL for the repo is
http://research.cs.wisc.edu/htcondor/yum/stable/rhel6/ so browse to there and download the needed rpms.
condor-8.4.11-1.el6.x86_64.rpm
condor-classads-8.4.11-1.el6.x86_64.rpm
condor-cream-gahp-8.4.11-1.el6.x86_64.rpm
condor-external-libs-8.4.11-1.el6.x86_64.rpm
condor-procd-8.4.11-1.el6.x86_64.rpm
condor-python-8.4.11-1.el6.x86_64.rpm
The test case is on gate03, but all gatekeepers are treated identically following confirmation of success on gate03.
# Stop cfengine
service cfengine3 stop
# Stop Condor without terminating the shadow/WN processing
condor_off -fast
# Update condor
cd /atlas/data08/ball/admin/condor_rpms_8.4.11
yum localupdate condor-8.4.11-1.el6.x86_64.rpm condor-classads-8.4.11-1.el6.x86_64.rpm \
condor-external-libs-8.4.11-1.el6.x86_64.rpm condor-procd-8.4.11-1.el6.x86_64.rpm \
condor-python-8.4.11-1.el6.x86_64.rpm condor-cream-gahp-8.4.11-1.el6.x86_64.rpm
# Update osg
yum --enablerepo=osg update
# Now, here, watch the yum output for .rpmnew files, check each one thoroughly to understand
# it, and make any needed cf3 changes to config files. When all is happy....
# Run cf-agent to re-establish anything needing it
cf-agent -Kf failsafe.cf; cf-agent -K
# Verify the osg configuration is clean....
osg-configure -v
# Then apply it
osg-configure -c
# And then reboot.
reboot
Clean install of OSG 3.3
gate02 choked. So, a new VM was cloned from the old, it was Cobbler built, and then the following steps were undertaken to do a full osg 3.3 install. This resulted in OSG 3.3.23. For now, this is just a "history" dump. This left condor and condor-ce stopped.
48 yum install yum-plugin-priorities
50 rpm -Uvh https://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm
51 yum --enablerepo=osg-empty install empty-ca-certs
54 yum --enablerepo=osg install condor
56 yum --enablerepo=osg install osg-ce-condor
60 mkdir /root/saves
61 cp -ar /etc/condor/config.d /root/saves/condor_config.d
64 cp -ar /etc/condor-ce/config.d /root/saves/condor-ce_config.d
65 cp -ar /etc/osg/config.d /root/saves/osg_config.d
67 yum --enablerepo=osg install rsv
68 cf-agent -Kf failsafe.cf; cf-agent -K
69 service cfengine3 stop
75 cf-agent -Kf failsafe.cf; cf-agent -K
82 service autofs start
84 osg-configure -v
87 reboot
88 exit
90 yum install ruby
91 yum install rubygems
92 yum install rubygem-json.x86_64 rubygem-pg
93 yum install rubygem-activesupport.noarch
94 gem install activerecord -v 2.3.18
95 gem list
Manually edit in a gums server in /etc/lcmaps.db
[root@gate02 osg]# chkconfig gums-client-cron on
[root@gate02 osg]# service gums-client-cron start
Enabling periodic gums-host-cron: [ OK ]
# Run it once manually
[root@gate02 osg]# [[ ! -f /var/lock/subsys/gums-host-cron ]] || /usr/bin/gums-host-cron
yum install tomcat6
chkconfig tomcat6 on
cf-agent run
Check that the http certs are owned by tomcat. Found false, so
chown tomcat.tomcat /etc/grid-security/http/*.pem
service tomcat6 start
chkconfig --add gratia-probes-cron
chkconfig gratia-probes-cron on
service gratia-probes-cron start
Clean install of OSG 3.4
In December, 2017, an SL7.4 gatekeeper, gate01.aglt2.org, was built from Cobbler, and utilizing a full install from cfengine3, including all OSG repos. This was set to use OSG 3.4 via resolved link from the generic rpm osg-3.4-el7-release-latest.rpm to osg-release-3.4-2.osg34.el7. So, Cobbler run to build the gatekeeper, cfengine3 runs, multiple times, until no errors are returned, to configure the gatekeeper.
Currently (12/21/2017) this is running with only a small sub-cluster in 30-gip.ini (/etc/osg/config.d) and has about 40 WN slots backing it. The OIM Resource AGLT2_PROD is defined, but AGIS is broken and so no
PandaQueue can be cloned until January.
Following multiple, initial cf-agent runs, the directions
at this OSG URL were followed to set the machine going. This seems to devolve down to "yum install osg-ce-condor". Upon install, condor-cron is enabled to run, but none of rsv, condor-ce or condor were enabled. We found the following manual actions were required to start this gatekeeper going
- systemctl enable rsv
- systemctl enable condor-ce
- systemctl enable condor
- osg-configure -v
- osg-configure -c
- systemctl start rsv
- systemctl start condor
- systemctl start condor-ce
- rsv-control --run --all-enabled
- chkconfig -add gratia-probes-cron
- chkconfig gratia-probes-cron on
- service gratia-probes-cron start
Various submit tests were successfully performed.
Notes on installing osg 3.4 on gate02, April, 2018
The repo would not install from cfe, and was manually installed
AFTER THE FACT NOTE; CFENGINE WAS NOT PROPERLY CONFIGURED TO SAY THAT GATE02 WAS SL7, HENCE THIS ERROR IN SETTING UP THE REPO
Had to manually "yum install osg-version"
The rsv service does not need to run, but, we see that the "org.osg.general.vo-supported" probe is now deprecated (both gate01 and gate02)
Remembered to dump from old gate02, and re-copy to new gate02, the condor job plot files from /var/www/html/Monitoring
- count_MP8_logs directory
- count_logs directory
- MPmon_count.log
- count_jobs.log
From /opt/condor/scan, compile the .c program, and restore scanHistoryTime
The ruby install documented
here does not work, ruby crashes
REMEMBER, run the rsv user cron once, manually, to generate a grid proxy.
Manually mount /pnfs after cfe adds it to /etc/fstab
CFEngine issues on initial gatekeeper build
The lines of policy in osg_ce_condor.cf where the cron-intervals are modified for 2 probes, do not work with the out-of-the-box rpms, becasue they do not take into account that the intervals are commented out. This should be fixed at some point.
--
WenjingWu - 09 Jul 2008