Restarting the MSU OSG Grid

How to restart the system after an outage.

Bring Up and Check Services

Cluster Services General cluster services are required, for instance Kerberos, YP, NFS, AFS. This list depends somewhat on what is going on, but the standard cluster services are required and not covered here.

MSU2

msu2.aglt2.org is the dCache admin node and runs pnfs server. Boot it first if down.

root@msu2 ~# service pnfs start
Starting pnfs services (PostgreSQL version): 
 Shmcom : Installed 8 Clients and 8 Servers
 Starting database server for admin (/opt/pnfsdb/pnfs/databases/admin) ... O.K.
 Starting database server for data1 (/opt/pnfsdb/pnfs/databases/data1) ... O.K.
 Starting database server for test (/opt/pnfsdb/pnfs/databases/test) ... O.K.
 Starting database server for dzero-cache (/opt/pnfsdb/pnfs/databases/dzero-cache) ... O.K.
 Waiting for dbservers to register ... Ready
 Starting Mountd : pmountd 
 Starting nfsd : pnfsd 

postgresql and pnfs services should start automatically at boot. /pnfs/fs should be mounted:

root@msu2 ~# df /pnfs/fs
Filesystem           1K-blocks      Used Available Use% Mounted on
localhost:/fs           400000     80000    284000  22% /pnfs/fs

root@msu2 ~# mount | grep pnfs
localhost:/fs on /pnfs/fs type nfs (rw,udp,intr,noac,hard,nfsvers=2,addr=127.0.0.1)

Start dCache services, note that replica manager is not currently in use, you'll see a message about it not starting:

Nov 2008, use "service dcache start"

root@msu2 ~# /opt/d-cache/bin/dcache-core start
Starting dcache services: 
Starting lmDomain  Done (pid=7810)
Starting dCacheDomain  Done (pid=7860)
Starting pnfsDomain  Done (pid=7910)
Starting dirDomain  Done (pid=7960)
Starting adminDoorDomain  Done (pid=8025)
Starting httpdDomain  6 Done (pid=8080)
Starting utilityDomain  Done (pid=8143)
Starting gPlazma-msu2Domain  Done (pid=8204)
Starting infoProviderDomain  Done (pid=8262)
Batch file doesn't exist : /opt/d-cache/config/replica.batch, can't continue ...
***TDR***   in dcache-srm start/stop script doing start
Using CATALINA_BASE:   /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_HOME:   /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_TMPDIR: /opt/d-cache/libexec/apache-tomcat-5.5.20/temp
Using JRE_HOME:       /opt/d-cache/jdk1.6.0_03

Pinging srm server to wake it up, will take few seconds ...
Done

MSU4

msu4.aglt2.org provides a RAID pool area. Boot system if it is down. Start dcache-core and dcache-pool services:

Note If msu4 says that pnfs service is not available, but you have it running on msu2, suspect firewall issues.

The dcache services mount /pnfs/msu-t3.aglt2.org/ (this mount not done in /etc/fstab), if they have been stopped, you can also drop the mount for a fresh start...

Nov 2008, use "service dcache start"

root@msu4 ~# /opt/dcache/bin/dcache-core start
/pnfs/msu-t3.aglt2.org/ not mounted - going to mount it now ... 
Starting dcache services: 
Starting dcap-msu4Domain  Done (pid=561037)
Starting gridftp-msu4Domain  Done (pid=561104)
Starting gsidcap-msu4Domain  Done (pid=561168)

root@msu4 ~# /opt/dcache/bin/dcache-pool start
start dcache pool: Starting msu4Domain  Done (pid=561339)

The pool should be mounted and available:

root@msu4 ~# df -h | grep pool
/dev/sdb              4.1T  575G  3.6T  14% /dpool/pool1

root@msu4 ~# time ls -l /pnfs/msu-t3.aglt2.org/dzero/cache/upload/

...

real    0m34.674s
user   0m0.130s
sys    0m0.877s

You can look at /var/log/dcache/msu4Domain.log to see that the pool is healthy.

msu-osg

msu-osg.aglt2.org provides the frontend services to for the compute element. E.g. it is the grid gatekeeper. It runs as a VMWare server client on msu1.aglt2.org. If it needs to be booted, login to msu1 and see the script /root/start-vmware-client.sh which will start the clients from the command-line in a way that releases them from the terminal (you can start them and logout and they continue to run). You can also start them from the VMWare GUI console (vmware command). You may need to start vmware services first (service vmware start).

The running client processes are called vmware-vmx:

[root@msu1 ~]# ps auxw | grep vmware-vmx
root     30590 77.3  1.2 328976 198644 ?     S<sl 16:15   0:36 /usr/lib/vmware/bin/vmware-vmx -C /vmware-disks/MSUROX/MSUROX.vmx -@ ""
root     30609 92.3  0.8 272020 144600 ?     S<sl 16:15   0:34 /usr/lib/vmware/bin/vmware-vmx -C /vmware-disks/MSU-OSG/MSU-OSG.vmx -@ ""

Condor

Login to msu-osg and see that the condor server is running (is is just for the three slots on the frontend, but when the workers are up, you'll see them as well):

[root@msu-osg ~]# condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@msu-osg.aglt LINUX      X86_64 Unclaimed Idle     0.920   654  0+00:00:04
slot2@msu-osg.aglt LINUX      X86_64 Unclaimed Idle     0.000   654  0+00:00:05
slot3@msu-osg.aglt LINUX      X86_64 Unclaimed Idle     0.000   654  0+00:00:06

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     3     0       0         3       0          0        0

               Total     3     0       0         3       0          0        0

Grid services

The globus grid services don't need started; the gatekeeper is run from xinetd. You can check the log at /msu/opt/osg/globus/var/globus-gatekeeper.log for connections.

Compute Nodes

The nodes dc2-102-22 to dc2-102-42 (20 nodes) are Dell DZero compute nodes currently in ROCKS. They can be reinstalled as needed, ROCKS is setup to give them their proper condor config and to put them into the ganglia "MSU OSG" cluster for monitoring.

-- TomRockwell - 17 Jun 2008
Topic revision: r12 - 19 May 2009, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback