Tier2 Services at UM

Services for Tier2 job submission and remote monitoring are distributed across several physical machines at UM. Below is a breadown of what these machines are, what services they comprise, and some information on the debugging of the service.

Contributing Machines

Web Access

Web access (http) to AGLT2 services is via linat01.grid.umich.edu, an SLC3.08 system. This portion would not build/work under SLC4, hence the split. The service is not a system service, but a cron-started entry of the dq2 user account.

su - dq2; crontab -l
X509_USER_PROXY=/tmp/x509up_u55621
DDM_HOME=/opt/dq2
# HTTPD
*/10 * * * * [ -e "$HOME/.profile" ] && . $HOME/.profile; source $DDM_HOME/config/AGLT2/environment.sh; source $DDM_HOME/server_env.sh; $DDM_HOME/httpd/bin/apachectl -k start > /dev/null 2>&1

The ATLAS scripts (eg, free disk space) accessed by the web server on linat01 are located at

/opt/dq2/httpd/htdocs/dq2

The LRC

The LRC (Local Replica Catalog) is based on umfs02.grid.umich.edu. Loss of this mysql database is not a disaster of biblical proportions, but neither is it any one of us wants to go through. mysql runs here as a system service.

Samples of some interesting commands
mysql -u root -p localreplicas
(requires pw)
show databases;
show tables;
describe t_lfn;
select * from t_pfn limit 10;
select * from mysql_user;
quit;

An LRC error

On Sep 25 there was an apparent LRC crash. The mysqld DB was cleaned, the service restarted, and the problem apparently cleared. Following is the set of error messages, and the solution.
#  The errors:
On dq2.aglt2.org: /var/log/dq2/dq2.log shows:

2007-09-25 08:19:39,962 - INFO - Exception occurred LRCFileReplicaCatalog exception [Failed querying 
catalog [(145, "Table './localreplicas/t_pfn' is marked as crashed and should be repaired")]]

2007-09-25 08:19:39,963 - INFO - Exception occurred LRCFileReplicaCatalog exception [Failed querying 
catalog [(145, "Table './localreplicas/t_pfn' is marked as crashed and should be repaired")]]

2007-09-25 08:19:39,963 - INFO - Exception occurred LRCFileReplicaCatalog exception [Failed querying 
catalog [(145, "Table './localreplicas/t_pfn' is marked as crashed and should be repaired")]]


#  The solution:
service mysql stop
cd /var/lib/mysql/localreplicas/
myisamchk -r *.MYD
myisamchk -r *.MYI
service mysql start

File Transfer Services (DQ2)

The DQ2 file transfer services are located on dq2.aglt2.org. Services should be started at boot time (but often are not). For example, the fetcher must be checked every day to ensure it is running (psgrep fetcher).

The DQ2 activity log is maintained at /var/log/dq2/dq2.log. All entries are also logged remotely by the central USATLAS syslog-ng server. This service initially conflicted with our own, local syslog-ng, so the local service was renamed to syslog-ng.local.

All services are started and maintained under the dq2 account. The current DQ2 version (7/2/2007) is 0.3. Services are monitored, started, and stopped using the appropriate dashboard commands, eg,
[dq2:dq2]$ su - dq2
[dq2:dq2]$ source /opt/dq2/setup.sh
[dq2:dq2]$ dashb-agent-list
SERVICE GROUP             STATUS     SERVICES                 
dq2agents                 FINISHED   'agents',                
dq2udpserver              FINISHED   'udpserver',             
dq2fetcher                FINISHED   'fetcher',               

These non-running services left behind lock files in the /tmp directory, all of which begin their 
name with ".s.dashboard", eg,
[dq2:dq2]$ dashb-agent-stop dq2agents
dashboard.common.InternalException: Failed to terminate process 21840 (it did not exist). You 
might want to remove lock file by hand: /tmp/.s.dashboard.dq2agents.lock

These lock files can be removed by hand in order to get processes restarted correctly.
The processes start in reverse alpha-order, and should be stopped in alpha-order.  

[dq2:dq2]$ dashb-agent-start dq2udpserver
.STARTED

On August 31, 2007, the dq2 startup service was implemented to handle these agents. See "service dq2 help" for details. Thanks to Charles Waldman at U. Chicago for the original script source.

There is a mysql DB in which active file transfers are recorded and maintained. When complete, the LRC DB on umfs02.grid.umich.edu is updated appropriately and the information is removed from this dq2 volatile DB. Should the volatile DB ever become corrupted, it can simply be recreated and it will re-populate itself using the command "dq2site-recreate-database." Some sample DB access:
mysql -u root -p
(enter the pw)
mysql> use _dq2;
mysql> show tables;
mysql> select count(*), state, site_id, src_site_id from file group by state, src_site_id, site_id;
       (Transfer jobs stuck in HOLD state will show with status 14 [hold_max_attempts_reached] here)

For Proxy Credential renewal directions, See here

syslog-ng

The syslog-ng service for reporting to University of Chicago

service syslog-ng restart

Panda Job Submission

Local jobs are queued using Condor, currently (7/2/2007) at installed version 6.8.5. The gatekeeper for this is gate01.aglt2.org. All pilots running here are locally submitted as user usatlas1, and a cron job ensures at 15 minute intervals that the pilots are still running/submitting properly.

On September 17, 2007, a second cron job was added, running local pilots for the ANALY_AGLT2 Panda site. It's cron listing is also shown below.

The actual pilot script in each case is in directory ~usatlas1/data/pilot_local/submit named pilot_local_submit.py and anal_local_submit.py, respectively.

[gate01:~]# su - usatlas1
[gate01] /atlas/data08/OSG/HOME/usatlas1 > crontab -l
*/15 * * * * /bin/bash /atlas/data08/OSG/HOME/usatlas1/check_pilot.sh
5,20,35,50 * * * * /bin/bash /atlas/data08/OSG/HOME/usatlas1/check_anal_pilot.sh

Typically new pilots are sent via Email tar file to mailto:smckee@umich.edu Shawn McKee but it is also possible to check them out from the Panda repository using CVS (directions unknown for doing this). As user usatlas1, do the following:
cd data/pilot_local/submit
( place the pilot tar file here, eg, pilot2-SPOCK26.tgz )
cd panda
rename old pilot2 directory, eg, mv pilot2 pilot2.sep21
unpack tar file here, eg, tar zxf ../pilot2-SPOCK26.tgz
cp -arvp pilot2/* ../../submit
cd ../../submit

###  Now, do this only if a new local submit pilot was distributed
###  This is in file pilot_local_submit.py
###  Modifications are then made to the submission intervals
[submit]# diff old new
9,11c9,11
< jobsub_interval=300 # seconds # every 5 minutes
< jobqueue_limit=100 # original 100
< jobsub_bunch=60 # original 60
---
> jobsub_interval=60 # seconds # every 5 minutes
> jobqueue_limit=160 # original 100
> jobsub_bunch=12 # original 60

Similar changes, plus many more, were made for the analysis pilots in file anal_local_submit.py

Pilot logs are kept in ~usatlas1/data/pilot_local/submit named with a date and process id, eg:
 local_condor_submit_7698_2007-6-19-9.stderr
 local_condor_submit_7698_2007-6-19-9.stdout
 
 local_anal_condor_submit_26368_2007-9-17-14.stderr
 local_anal_condor_submit_26368_2007-9-17-14.stdout

Note on condor_submit: Note that to effectively use the Analysis queue with local pilots, the $CONDOR_BIN/condor_submit command was RENAMED to "real_condor_submit" and a script substituted to insert the correct directives into all submitted jobs.
#!/bin/bash
case $USER in
        usatlas1 | usatlas2 | usatlas3 | usatlas4 | osg | ivdgl )
           gpadd=group_gatekpr;;
        ligo )
           gpadd=group_ligo;;
        * )
           gpadd=group_generic;;
esac

# Looking for value localQue in the submission
# If it is not set, then set a default value for it

isQueHere=0
for arg in $@
do
  if [ -f $arg ]
  then
      foundQue=`grep localQue $arg|wc -l`
      if [ $foundQue -ne 0 ]
      then
          isQueHere=1
      fi
  fi
done
#
if [ $isQueHere -eq 1 ]
then
    /opt/condor/bin/real_condor_submit $* -append "+AccountingGroup = \"${gpadd}\""
else
    /opt/condor/bin/real_condor_submit $* -append "+AccountingGroup = \"${gpadd}\"" -append "+localQue = \"Default\""
fi

Mona Lisa monitoring

The Mona Lisa Daemon runs as a service on gate01.aglt2.org as well. If AGLT2 stops showing up on the Mona Lisa monitoring pages, simply restart the service.
service MLD restart

-- BobBall - 02 Jul 2007
Topic revision: r16 - 16 Oct 2009, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback