Creation of New dCache Headnodes (Dell R610) in January 2011

As part of our Fall 2010 procurements we purchased 2 Dell R610 nodes to host the dCache services (head01 and head02). Each node has 48GB of RAM, 2xE5620 processors, 1 10GE Myricom NIC and 2x600GB SAS (in RAID-1) configuration. This page describes the work done to get these nodes into production as the new head01 and head02 instances.

I will discuss the steps taken for each in separate sections below.

General Items for Creating New Versions of Existing Nodes

There are a few general things that need to be done when creating a new version of existing nodes:
  • Many services may be running on existing nodes. Make sure to copy the /var/spool/cron/* files (and supporting files) to the new system and make sure they run
  • Copy the useful script files from /root (*.sh, *.pl, *.py)
  • Make sure the needed software running on the old nodes are available on the new nodes (compare the /etc/init.d services running for example)
  • Copy the security information over (/etc/grid-security, /root/.ssh, /root/.globus, etc.)
  • Consider retiring/removing unused services and software during the transition
  • Check the old vs new network configuration; replicate resiliency of the old system if bonding is in use

Creating of a new head01.aglt2.org

The new head01 was installed with SL5.5/x86_64 and assigned temporary IPs: 192.41.231.105/10.10.2.105/10.10.3.105 (Public/Private/iDRAC6).

This allowed the use of 'yum update' to make sure the system was fully updated. I also updated the /afs/atlas.umich.edu/hardware/Dell_BIOS_FW/r610update.sh to include the most recent BIOS/Firmware and updated the system. Next I updated to OMSA 6.4.

I verified selinux was set to disabled in /etc/sysconfig/selinux.

I copied over the whole /etc/grid-security directory from the old head01 to this machine. I needed to recreate the certificates.afs directory as a soft-link and remove the 'vomsdir' so I could reinstall it via the lcg-vomscerts (V6.3) rpm.

I needed to add the postgres user (uid=26) to the system. I copied the entries from the existing head01 for /etc/passwd and /etc/group and updated appropriately the /etc/shadow and /etc/gshadow files.

The most recent PostgreSQL file were installed using the YUM repo:

cat /etc/yum.repos.d/pgdg-90-centos.repo
[pgdg90]
name=PostgreSQL 9.0 $releasever - $basearch
baseurl=http://yum.pgrpms.org/9.0/redhat/rhel-5-$basearch
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-PGDG

[pgdg90-source]
name=PostgreSQL 9.0 $releasever - $basearch - Source
failovermethod=priority
baseurl=http://yum.pgrpms.org/srpms/9.0/redhat/rhel-5-$basearch
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-PGDG

Then I did:
yum install postgresql90.x86_64 postgresql90-server.x86_64 postgresql90-contrib.x86_64 postgresql90-devel.x86_64 postgresql90-plpython.x86_64

This installs postgresql into /var/lib/pgsql/9.0

In addition I needed to migrate the other following files:

  • /root/rsync-certificates.sh (needed for a crontab entry)
  • /root/ccc (directory and subdirectories) used for dCache consistency checking
  • /var/www/html/ccc (directory and subdirectories) used for presenting consistency checking results via the web
  • /var/spool/cron/* (root and other's crontabs)
  • /root/.ssh
  • /root/.globus
  • /etc/yp.conf and /etc/sysconfig/network (to get the right yp setup)

Also the RPMS installed needed to be made as close as possible. I checked on each system like this:

rpm -qa --qf "%{NAME}.%{ARCH}\n" | sort > head01_rpms_old.txt
(or output to head01_rpms_new.txt on head01 NEW)

These can then be compared using the comm utility to determine which packages are unique on each node. To try to get head01(new) as close as possible to the old node I did:

yum --enablerepo=dag install `cat rpms_on_old_head01.txt`

Where the rpms_on_old_head01.txt are the result of using comm and isolating packages only on old head01.

Needed to start 'ypbind' since it wasn't running.

The /pnfs mount needed to be added to the /etc/fstab.

The list of crontab entries on head01(old) is:

root
*/5 * * * * /usr/bin/perl /afs/atlas.umich.edu/Certficates/gums_test/ch_gums_server.pl
20 * * * * /bin/bash /root/rsync-certificates.sh
# Setup ping to "Switch" DNS for gums
* * * * * date >> /root/ping_gums.log; /bin/ping -c2 gums.aglt2.org 2>&1 >> /root/ping_gums.log
7 0 * * * /bin/sh /root/ccc/run_ccc.sh &> /root/ccc/run_ccc.log

Some problems on the "new" head01 for the above:
  • The ch_gums_server.pl script fails with
[root@head01 ~]# /usr/bin/perl /afs/atlas.umich.edu/Certficates/gums_test/ch_gums_server.pl
Can't locate object method "server_status" via package "linat02.grid.umich.edu" (perhaps you forgot to load "linat02.grid.umich.edu"?) at /afs/atlas.umich.edu/Certficates/gums_test/ch_gums_server.pl line 100.
There were two occurrences where the server_status was not pre-fixed with a \$ that were fixed. The code in AFS was updated.
  • The run_ccc.sh script won't be tested until after the cut-over (the directory was copied over to head01(new)).

postgres
0 4 * * *  vacuumdb --all --analyze

Tested for user postgres and works OK.

usatlas1
0 3  * * * find /pnfs/aglt2.org/*/loadtest -type f -mtime +3 -exec rm -v {} \; >> clean_testfiles.log 2>&1
30 3 * * * find /pnfs/aglt2.org/* -type d -mindepth 2 -mtime +7 -empty -exec rmdir {} \; >> clean_testfiles 2>&1

Problem here on the new node (and old node). The automounter was NOT working correctly on head01(old). After restarting autofs the ~usatlas1 directory was visible. Need to add automounter entries on head01(new). The last run on head01(old) was November 24, 2009! Copied over the /etc/auto.master, /etc/auto.atlas and /etc/auto.net. Edited /etc/auto.master to remove auto.home reference. Restarted autofs and tested OK.

Installing PostgreSQL on New Nodes

The next step was to get the appropriate version of PostgreSQL installed. We downloaded the needed RPMS (an upgrade from the 8.3.7 version on old head01) from http://yum.pgsqlrpms.org/8.3/redhat/rhel-5Server-x86_64/ to ~smckee/postgres_rpms/:

root@head01 ~/ccc# ls ~smckee/postgres_rpms/*el5*
/afs/atlas.umich.edu/home/smckee/postgres_rpms/compat-postgresql-libs-4-1PGDG.rhel5.x86_64.rpm
/afs/atlas.umich.edu/home/smckee/postgres_rpms/postgresql-8.3.9-1PGDG.rhel5.x86_64.rpm
/afs/atlas.umich.edu/home/smckee/postgres_rpms/postgresql-devel-8.3.9-1PGDG.rhel5.x86_64.rpm
/afs/atlas.umich.edu/home/smckee/postgres_rpms/postgresql-libs-8.3.9-1PGDG.rhel5.x86_64.rpm
/afs/atlas.umich.edu/home/smckee/postgres_rpms/postgresql-plpython-8.3.9-1PGDG.rhel5.x86_64.rpm
/afs/atlas.umich.edu/home/smckee/postgres_rpms/postgresql-server-8.3.9-1PGDG.rhel5.x86_64.rpm
/afs/atlas.umich.edu/home/smckee/postgres_rpms/postgresql-contrib-8.3.9-1PGDG.rhel5.x86_64.rpm
/afs/atlas.umich.edu/home/smckee/postgres_rpms/uuid-1.5.1-3.el5.x86_64.rpm

To allow this we needed to remove the existing postgres-libs-8.1 (i386 and x86_64 versions) first. One the 'rpm -ivh postgresql*.rpm' finished we had to do two things:

  • Run the service postgresql initdb (to create the initial directory/database setup)
  • Copy the existing postgresql.conf and pg_hba.conf from the /var/lib/pgsql/data directory on the original head01 to the new head01 and change ownership to postgres.postgres.

At this point we could succesfully start postgres via service postgresql start. Attempting to connect using psql without arguments would fail:
[root@head01 ~]# psql
psql: FATAL:  database "root" does not exist

On the original head01 we had setup a "dummy" root DB at some point. This was for some kind of monitoring or analysis activity and we need to determine if it is still required on the new head01. Since the plan is to use the pg_dumpall command over the network (or alternatively a backup) this original configuration should be transferred in any case. NOTE: ended up NOT using pg_dumpall since the "warm standby" configuration is more effective.

Using pg_dumpall to transfer the old DB to the new host?

We can use the pg_dumpall script (part of a standard PostgreSQL install) to migrate the old DB to the new host. The following command should be possible:

pg_dumpall -h 192.41.230.44 -p 5432 | psql

This command would be run on the new host as user postgres with the new (empty) postgres install. As noted above we did the install and migrated the pg_hba.conf and postgres.conf files appropriately. However when we tried this the pg_dumpall failed because it claimed there was no billing DB.

We tracked down the problem to be the default encoding. On SL4.x systems the LANG environment variable is set to:

LANG=en_US.iso885915

But the default on SL5.x seems to be:

LANG=en_US.UTF-8

This difference appears when the postgres DB is initialized. On the SL4.x system the encoding will become 'LATIN9' whereas on the SL5.x system it becomes 'UTF8'. The commands to create the DB's via the pg_dumpall fail because of this encoding difference. To fix this i made sure the part of the /etc/init.d/postgres script responsible for the 'initdb' action has LANG=en_US.iso885915 on the SL5.x destination system and I recreated the DB. The pg_dumpall command then runs successfully, transferring about 15MB/sec. For the existing DBs on old head01, the network data transfer took around 70 minutes. However there is additional work (indexing, etc.) which seems to add another few hours to the transfer.

Using A Backup to Transfer the DB

NOTE: An alternate way to do this is by making a checkpointed backup. This actually is better because we can test at some point-in-time and once things are working, do a new unpack and catch up to the current state of the DB. See the PostgreSQL documentation for details. If archiving is enabled the following script (appropriately edited) can be used to make a safe backup of the existing postgres DB. Note both the postgres and root users should have ssh-key access to the destination server to allow this to happen without responding to password prompts.

#!/bin/bash
#
psql -U postgres -c "SELECT pg_start_backup('checkpoint_pg');"
rm -f /tmp/pg_backup.tar
tar -chf /tmp/pg_backup.tar --exclude=/var/lib/pgsql/data/pg_xlog/* /var/lib/pgsql/data
psql -U postgres -c "SELECT pg_stop_backup();"
sleep 20
scp  /tmp/pg_backup.tar c-3-27:/tmp/
exit

This backup tar-ball can be restored on the destination host in the following way:

  • First move the existing data directory somewhere: mv /var/lib/pgsql/data /var/lib/pgsql/data.orig
  • Unpack the tarball: cd /; tar -xf /tmp/pg_backup.tar
  • Recreate the archive_status directory: mkdir /var/lib/pgsql/data/pg_xlog/archive_status
  • Copy any needed WAL files into pg_xlog: cp /var/lilb/pgsql/archive/* /var/lib/pgsql/data/pg_xlog
  • Edit the /var/lib/pgsql/data/postgres.conf (and other config files) as needed
  • Make sure user postgres owns the files: chown -R postgres.postgres /var/lib/pgsql/data

Alternately (for step 4) you can use a recovery.conf file like the following. This will allow you to roll-forward to any point in time:

# NOTE that the basename of %p will be different from %f; do not
# expect them to be interchangeable.
#
#
restore_command = 'cp /var/lib/pgsql/archive/%f %p'
#
#
#---------------------------------------------------------------------------
# OPTIONAL PARAMETERS
#---------------------------------------------------------------------------
#
# By default, recovery will rollforward to the end of the WAL log.
# If you want to stop rollforward before that point, you
# must set a recovery target.
#
# You may set a recovery target either by transactionId, or
# by timestamp. Recovery may either include or exclude the
# transaction(s) with the recovery target value (ie, stop either
# just after or just before the given target, respectively).
#
#recovery_target_time = '2004-07-14 22:39:00 EST'
recovery_target_time = '2010-02-08 10:35:00 EST'
#

At this point the postgres service on the destination can be started. It should catch back up to the content of the last WAL file specified. Note the the recovery.conf file is renamed recovery.done once the system completes reading the needed WAL files.

For testing we can get to a consistent DB (any point in time) and run tests. After that we can delete the /var/lib/pgsql/data area and restore (again) from backup tar-ball. This way we can catch back up to whatever is "current" assuming we have continued to archive the WAL files from the online DB.

Installing and Configuring dCache on New Head01

Then we downloaded the needed dCache RPMS (an upgrade from the existing 1.9.5-10 version) into ~smckee/dCache and installed them.

dcache-server-1.9.5-15.noarch.rpm
dcap-1.9.3-7.x86_64.rpm
dcache-srmclient-1.9.5-3.noarch.rpm

Here are the list of files which were migrated from head01(old) to head01(new):

[root@head01 opt]# diff -r d-cache d-cache.orig
Only in d-cache/config: authorized_keys.new
Only in d-cache/config: dCacheSetup.new
Only in d-cache/config: log4j.properties.new
Only in d-cache/config: passwd.new
Only in d-cache/config: PoolManager.conf.new
Only in d-cache/config: server_key.new
Only in d-cache/config: server_key.pub.new
Only in d-cache/etc: dcache.kpwd.new
Only in d-cache/etc: dcachesrm-gplazma.policy.new
Only in d-cache/etc: LinkGroupAuthorization.conf.new
Only in d-cache/etc: node_config.new
Only in d-cache/etc: srm_setup.env.new

For each of the above files I list the related changes:
  • dcache.kpwd - Use the "dcache.kpwd.template as the source and append all the info from the dcache.kpwd.new file
  • dcachesrm-gplazma.policy - Use the dcachesrm-gplazma.policy.new directly (copy it)
  • LinkGroupAuthorization.conf - Use the .new version directly (copy it). NOTE: Need to see if all areas are setup right (AGLT2_HOTDISK for example)
  • node_config - Use the node_config.new directly (copy it) This is the file that determines the specific dCache services for this node.
  • srm_setup.env - Use the srm_setup.env.new (copy it). Points to java in /usr/java/latest.
  • authorized_keys - Use the .new version (copy it)
  • server_key and server_key.pub - Use the .new versions (copy them)
  • log4j.properties - Use the installed version instead of the .new
  • passwd - Use the .new version (copy it and change protection to 600)
  • PoolManager.conf - Use the .new version (copy it)
  • dCacheSetup - Merge starting from the /etc/dCacheSetup.template and put in needed changes from the .new file. NOTE: There was the following line in the version from the old head01 but was not present in the .template file:
    • Missing in new template:
      PermissionHandlerDataSource=diskCacheV111.services.PnfsManagerFileMetaDataSource

Some notes here. The dCacheSetup file is (re)created when you run the install/install.sh from the dCache install location. I kept the original to compare after running the install.sh. The passwd and key files are just being migrated from the old head01. Some policy and functionality related files are in the etc directory. The node_config specifies what type of dCache node this is and what services it will run.

Note we had to soft-link /etc/ssh/ssh_host_key to /opt/d-cache/config/host_key before running the install.sh.

Here is the result of running the install/install.sh on head01(new):

[root@head01 d-cache]# install/install.sh
INFO:This node will need to mount the name server.
INFO:Skipping ssh key generation

 Checking MasterSetup  ./config/dCacheSetup O.k.

   Sanning dCache batch files

    Processing adminDoor
    Processing authdoor
    Processing chimera
    Processing dCache
    Processing dir
    Processing door
    Processing gPlazma
    Processing gridftpdoor
    Processing gsidcapdoor
    Processing httpd
    Processing httpdoor
    Processing info
    Processing infoProvider
    Processing lm
    Processing nfsv41
    Processing pnfs
    Processing pool
    Processing replica
    Processing srm
    Processing statistics
    Processing utility
    Processing xrootdDoor


 Checking Users database .... Ok
./dCacheSetup: line 291: gPlazmaRequestTimeout: command not found
 Checking Security       .... Ok
 Checking JVM ........ Ok
 Checking Cells ...... Ok
 dCacheVersion ....... Version production-1.9.5-15

INFO:Already Mounted head02.aglt2.org
WARNING:deleting previous version of Tomcat at /opt/d-cache/libexec/apache-tomcat-5.5.20
INFO:installing tomcat and axis ...
INFO:Done installing tomcat and axis
INFO:modifying java options in /opt/d-cache/libexec/apache-tomcat-5.5.20/bin/catalina.sh ...
INFO:modifying system CLASSPATH in /opt/d-cache/libexec/apache-tomcat-5.5.20/bin/setclasspath.sh ...
WARNING:Removing previous srm webapp directory
INFO:Creating srm webapp directory
INFO:Creating srm webapp deployment file
INFO:Done creating srm webapp deployment file
INFO:Starting up tomcat ...
Using CATALINA_BASE:   /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_HOME:   /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_TMPDIR: /opt/d-cache/libexec/apache-tomcat-5.5.20/temp
Using JRE_HOME:       /usr/java/jdk1.6.0_17
INFO:Done starting up tomcat
INFO:deploying srm v2 application using axis AdminClient ...
- Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled.
Processing file /opt/d-cache/etc/srmv1-deploy.wsdd
<Admin>Done processing</Admin>
- Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled.
Processing file /opt/d-cache/etc/srmv2.2-deploy.wsdd
<Admin>Done processing</Admin>
INFO:Done deploying srm v2 application using axis AdminClient
INFO:creating config files and adding configuration info into /opt/d-cache/srm-webapp/WEB-INF/web.xml ...
INFO:done creating config files and adding configuration info into /opt/d-cache/srm-webapp/WEB-INF/web.xml
INFO:enabling GSI HTTP in tomcat by modifying /opt/d-cache/libexec/apache-tomcat-5.5.20/conf/server.xml ...
INFO:commenting out AJP CoyoteConnector on port 8009...
INFO:Done commenting out AJP CoyoteConnector on port 8009
INFO:turning off sending of Multi Refs in /opt/d-cache/srm-webapp/WEB-INF/server-config.wsdd
INFO:Done turning off sending of Multi Refs in /opt/d-cache/srm-webapp/WEB-INF/server-config.wsdd
INFO:shutdown Tomcat
Using CATALINA_BASE:   /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_HOME:   /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_TMPDIR: /opt/d-cache/libexec/apache-tomcat-5.5.20/temp
Using JRE_HOME:       /usr/java/jdk1.6.0_17
INFO:installing config for startup/shutdown script
INFO:Installation complete
INFO:please use /opt/d-cache/bin/dcache start|stop|restart srm to startup, shutdown or restart srm server

I checked the "diffs" on the directory tree comparing the original RPM install tree (d-cache.orig) with the post-config/installed tree (d-cache):

Files d-cache.orig/bin/dcache and d-cache/bin/dcache differ
Files d-cache.orig/classes/cells.jar and d-cache/classes/cells.jar differ
Files d-cache.orig/classes/cells-protocols.jar and d-cache/classes/cells-protocols.jar differ
Files d-cache.orig/classes/chimera/chimera-core.jar and d-cache/classes/chimera/chimera-core.jar differ
Files d-cache.orig/classes/dcache-common.jar and d-cache/classes/dcache-common.jar differ
Files d-cache.orig/classes/dcache.jar and d-cache/classes/dcache.jar differ
Files d-cache.orig/classes/gplazma/gplazma.jar and d-cache/classes/gplazma/gplazma.jar differ
Files d-cache.orig/classes/infoDynamicSE.jar and d-cache/classes/infoDynamicSE.jar differ
Files d-cache.orig/classes/javatunnel.jar and d-cache/classes/javatunnel.jar differ
Files d-cache.orig/classes/srm.jar and d-cache/classes/srm.jar differ
Files d-cache.orig/classes/srm-tomcat.jar and d-cache/classes/srm-tomcat.jar differ
Files d-cache.orig/classes/xrootd-tokenauthz.jar and d-cache/classes/xrootd-tokenauthz.jar differ
Only in d-cache/config: acls
Only in d-cache/config: adminDoorSetup
Only in d-cache/config: authdoorSetup
Only in d-cache/config: authorized_keys
Only in d-cache/config: authorized_keys.new
Only in d-cache/config: chimeraSetup
Only in d-cache/config: dCacheSetup
Only in d-cache/config: dCacheSetup~
Only in d-cache/config: dCacheSetup.new
Only in d-cache/config: dirSetup
Only in d-cache/config: doorSetup
Only in d-cache/config: gPlazmaSetup
Only in d-cache/config: gridftpdoorSetup
Only in d-cache/config: gsidcapdoorSetup
Only in d-cache/config: head01.poollist
Only in d-cache/config: host_key
Only in d-cache/config: httpdoorSetup
Only in d-cache/config: httpdSetup
Only in d-cache/config: infoProviderSetup
Only in d-cache/config: infoSetup
Only in d-cache/config: lmSetup
Only in d-cache/config: log4j.properties.new
Only in d-cache/config: meta
Only in d-cache/config: nfsv41Setup
Only in d-cache/config: passwd
Only in d-cache/config: passwd.new
Only in d-cache/config: pnfsSetup
Files d-cache.orig/config/pool.batch and d-cache/config/pool.batch differ
Only in d-cache/config: PoolManager.conf.new
Only in d-cache/config: poolSetup
Only in d-cache/config: relations
Only in d-cache/config: replicaSetup
Only in d-cache/config: server_key
Only in d-cache/config: server_key.new
Only in d-cache/config: server_key.pub
Only in d-cache/config: server_key.pub.new
Only in d-cache/config: srmSetup
Only in d-cache/config: statisticsSetup
Only in d-cache/config: utilitySetup
Only in d-cache/config: xrootdDoorSetup
Files d-cache.orig/docs/images/bg.jpg and d-cache/docs/images/bg.jpg differ
Only in d-cache/etc: dcache.kpwd
Only in d-cache/etc: dcache.kpwd.new
Files d-cache.orig/etc/dcachesrm-gplazma.policy and d-cache/etc/dcachesrm-gplazma.policy differ
Only in d-cache/etc: dcachesrm-gplazma.policy.new
Only in d-cache/etc: dcachesrm-gplazma.policy.orig
Only in d-cache/etc: LinkGroupAuthorization.conf
Only in d-cache/etc: LinkGroupAuthorization.conf.new
Only in d-cache/etc: node_config
Only in d-cache/etc: node_config~
Only in d-cache/etc: node_config.new
Only in d-cache/etc: sedEm8vlI
Only in d-cache/etc: sedGcrVL3
Only in d-cache/etc: sedL8VRYB
Only in d-cache/etc: sedrTVxcp
Only in d-cache/etc: sedSlBCcM
Only in d-cache/etc: sedWfJuwM
Only in d-cache/etc: sedWLfTo7
Files d-cache.orig/etc/srm_setup.env and d-cache/etc/srm_setup.env differ
Only in d-cache/etc: srm_setup.env.new
Only in d-cache/etc: srm_setup.env.orig
Only in d-cache/jobs: adminDoor
Only in d-cache/jobs: adminDoor.lib.sh
Only in d-cache/jobs: authdoor
Only in d-cache/jobs: authdoor.lib.sh
Only in d-cache/jobs: chimera
Only in d-cache/jobs: chimera.lib.sh
Only in d-cache/jobs: dCache
Only in d-cache/jobs: dCache.lib.sh
Only in d-cache/jobs: dir
Only in d-cache/jobs: dir.lib.sh
Only in d-cache/jobs: door
Only in d-cache/jobs: door.lib.sh
Files d-cache.orig/jobs/generic.lib.sh and d-cache/jobs/generic.lib.sh differ
Only in d-cache/jobs: gPlazma
Only in d-cache/jobs: gPlazma.lib.sh
Only in d-cache/jobs: gridftpdoor
Only in d-cache/jobs: gridftpdoor.lib.sh
Only in d-cache/jobs: gsidcapdoor
Only in d-cache/jobs: gsidcapdoor.lib.sh
Only in d-cache/jobs: httpd
Only in d-cache/jobs: httpd.lib.sh
Only in d-cache/jobs: httpdoor
Only in d-cache/jobs: httpdoor.lib.sh
Only in d-cache/jobs: info
Only in d-cache/jobs: info.lib.sh
Only in d-cache/jobs: infoProvider
Only in d-cache/jobs: infoProvider.lib.sh
Only in d-cache/jobs: lm
Only in d-cache/jobs: lm.lib.sh
Only in d-cache/jobs: nfsv41
Only in d-cache/jobs: nfsv41.lib.sh
Only in d-cache/jobs: pnfs
Only in d-cache/jobs: pnfs.lib.sh
Only in d-cache/jobs: pool
Only in d-cache/jobs: pool.lib.sh
Only in d-cache/jobs: replica
Only in d-cache/jobs: replica.lib.sh
Only in d-cache/jobs: srm
Only in d-cache/jobs: srm.lib.sh
Only in d-cache/jobs: statistics
Only in d-cache/jobs: statistics.lib.sh
Only in d-cache/jobs: utility
Only in d-cache/jobs: utility.lib.sh
Files d-cache.orig/jobs/wrapper2.sh and d-cache/jobs/wrapper2.sh differ
Only in d-cache/jobs: xrootdDoor
Only in d-cache/jobs: xrootdDoor.lib.sh
Only in d-cache/libexec: apache-tomcat-5.5.20
Files d-cache.orig/libexec/chimera/chimera-nfs-run.sh and d-cache/libexec/chimera/chimera-nfs-run.sh differ
Files d-cache.orig/libexec/wait-for-cells.sh and d-cache/libexec/wait-for-cells.sh differ
Files d-cache.orig/share/dCacheConfigure/utils/split_quoted_variable and d-cache/share/dCacheConfigure/utils/split_quoted_variable differ
Files d-cache.orig/share/lib/daemon and d-cache/share/lib/daemon differ
Files d-cache.orig/share/lib/services.sh and d-cache/share/lib/services.sh differ
Only in d-cache: srm-webapp

Details of the differences are in the attached file dcache_diffs_after_config.txt

System is unable to be tested fully because turning off the network (to make sure it doesn't interfere with the production) seems to prevent dcache from starting.

Putting system in "warm standby" mode by removing/re-unpacking the /var/lib/pgsql/data area and using the following recovery.conf:
#restore_command = 'cp /var/lib/pgsql/archive/%f %p'
restore_command = 'pg_standby -d -s 20 -t /tmp/pgsql.trigger /var/lib/pgsql/archive %f %p 2>>/var/lib/pgsql/standby.log'

This will send debug info to the standby.log file and wake up every 20 seconds to check for new WAL files. Once the file /tmp/pgsql.trigger exists it will exit recovery mode. On the day of the transition we need to do the following for head01:

  • Once the OIM outage is active we stop dCache on head01(old)
  • Then we stop the postgresql on head01(old) and power down
  • Handle head02(old)->head02(new) transition first (see below)
  • Create the /tmp/pgsql.trigger file on head01(new) to exit recover mode and verify it does (echo "smart" > /tmp/pgsql.trigger)
  • Change the IP addresses to be the correct ones (public/private) on head01(new)
  • Power down head01(new)
  • Physically swap head01(new) with the existing head01(old)
  • Change the switch port configuration to make sure VLAN 4010 is untagged and VLAN 4001 is tagged
  • Power up head01(new) and verify network connectivity
  • Re-enable the dcache service via 'chkconfig dcache on' and start it, verifying it starts correctly
  • Update the Raritan dongle labels after the switch
  • Rebuild and test all storage nodes via ROCKS and manual checking of Domain.log files...

Creating of a new head02.aglt2.org

The new head01 (old c-3-27) was connected to our Raritan KVM system on channel 41 with a virtual-media capable dongle. This was used to allow remote installation of Scientific Linux 5.4 x86_64 via .iso file. A custom install was created from the SL5.4 installation procedure. This took about 50 minutes to run to completion, reboot and finalize its configuration. The system was setup as head02.aglt2.org but using the old IP addresses as follows: To temporarily allow network access the /etc/sysconfig/network-scripts/ifcfg-eth1 configuration was changed to use the original c-3-27 IP addresses (public/private) and the eth0/eth1 interface was started. This allowed the use of 'yum update' to make sure the system was fully updated.

After the system was updated we used the /afs/atlas.umich.edu/hardware/Dell_BIOS_FW/1950update.sh to update the BIOS/Firmware on this system. We then installed the OMSA software via /afs/atlas.umich.edu/hardare/Dell_OMSA/setup.sh. NOTE: Both head01/02 (new) wouldn't update their BIOS from the repo. Had to run the bios update .bin file (see ~smckee/DELL/PE1950).

I copied over the whole /etc/grid-security directory from the old head02 to this machine.

in addition I needed to migrate the other following files:
  • /var/www/html/* to the equivalent location (had to also install 'httpd' and enable on the new system)
  • /var/www/cgi-bin to the equivalent location
  • Needed ruby and ruby-libs to support Hiro's scripts for ownership checking and checksum for dCache
  • The /root/dcache_adm_script "tree"
  • Added sysstat package

Here is the list of "cron" jobs for root on head02:


*/5 * * * * cd /root/dcache_adm_script/dCache/RoutineMT;perl dcache_routine_maintain.pl;
15 0-23/2 * * * cd /root/dcache_adm_script/dCache/srm_err_report; perl report.pl;cd /root/dcache_adm_script/dCache/space_stat;perl get_sp_info.pl
*/5 * * * * cd /root/dcache_adm_script/dCache/srm_err_report;perl cacti_report.pl;cd /root/dcache_adm_script/dCache/space_stat;perl stat_space.pl
*/10 * * * * cd /root/dcache_adm_script/dCache;ruby monitorDirOwnership.rb
30 0-23/4 * * * cd /root/dcache_adm_script/dCache/stat_fileno_inpool; perl stat_fileno.pl
0 */4 * * * cd /root/dcache_adm_script/dCache/stat_pool_allocation;perl stat_poollist.pl
20 * * * * /bin/bash /root/rsync-certificates.sh
0 22 * * * /usr/bin/find /var/lib/pgsql/data/pg_log/*.log -mtime +10 -exec rm {} \;

Testing of these cron jobs revealed we were missing the following packages:
  • perl-DBD-MySQL-3.0007-2.el5
  • mysql-5.0.77-3.el5
  • compat-readline43-4.3-3
  • amanda-backup_client-2.6.1p1-1afs.SL5
  • amanda-afs-0.0.3-4
  • perl-DBD-Pg-1.49-2.el5_3.1
  • perl-DBI-1.52-2.el5

There were also a bunch of Perl packages installed in /usr/lib/perl5/site_perl/5.8.5 on head02(old). I copied these over to /usr/lib/perl5/site_perl/5.8.8 on head02(new):
  • Amanda
  • CGI
  • dCache
  • File
  • GD
  • IPC
  • Mail
  • RepHot
  • Time

Also the postgres user has one cron entry:
0 */12 * * *  vacuumdb --all --analyze

I just copied /var/spool/cron/* from head02(old) to head02(new) to get these entries and removed the commented out entries.

Here is the list of "services" that are turned on for head01(old):
root@head02 ~# chkconfig --list | grep ":on" | sort
acpid           0:off   1:off   2:off   3:on    4:on    5:on    6:off
anacron         0:off   1:off   2:on    3:on    4:on    5:on    6:off
atd             0:off   1:off   2:off   3:on    4:on    5:on    6:off
cfenvd          0:off   1:off   2:on    3:on    4:on    5:on    6:off
cfexecd         0:off   1:off   2:on    3:on    4:on    5:on    6:off
cfservd         0:off   1:off   2:on    3:on    4:on    5:on    6:off
chimera-nfs-run.sh      0:off   1:off   2:on    3:on    4:on    5:on    6:off
cpuspeed        0:off   1:on    2:on    3:on    4:on    5:on    6:off
crond           0:off   1:off   2:on    3:on    4:on    5:on    6:off
cups            0:off   1:off   2:on    3:on    4:on    5:on    6:off
dataeng         0:off   1:off   2:off   3:on    4:on    5:on    6:off
dcache          0:off   1:off   2:on    3:on    4:on    5:on    6:off
dkms_autoinstaller      0:off   1:off   2:off   3:on    4:on    5:on    6:off
dsm_om_connsvc  0:off   1:off   2:off   3:on    4:on    5:on    6:off
dsm_om_shrsvc   0:off   1:off   2:off   3:on    4:on    5:on    6:off
gmond           0:off   1:off   2:on    3:on    4:on    5:on    6:off
httpd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
instsvcdrv      0:on    1:off   2:off   3:on    4:on    5:on    6:on
iptables        0:off   1:off   2:on    3:on    4:on    5:on    6:off
irqbalance      0:off   1:off   2:off   3:on    4:on    5:on    6:off
isdn            0:off   1:off   2:on    3:on    4:on    5:on    6:off
jexec           0:on    1:on    2:on    3:on    4:on    5:on    6:on
lldpd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
lm_sensors      0:off   1:off   2:on    3:on    4:on    5:on    6:off
lvm2-monitor    0:off   1:on    2:on    3:on    4:on    5:on    6:off
mdmonitor       0:off   1:off   2:on    3:on    4:on    5:on    6:off
messagebus      0:off   1:off   2:off   3:on    4:on    5:on    6:off
microcode_ctl   0:off   1:off   2:on    3:on    4:on    5:on    6:off
mongrel_cluster 0:off   1:off   2:on    3:on    4:on    5:on    6:off
monit           0:off   1:off   2:on    3:on    4:on    5:on    6:off
mptctl          0:off   1:off   2:on    3:on    4:on    5:on    6:off
netfs           0:off   1:off   2:off   3:on    4:on    5:on    6:off
network         0:off   1:off   2:on    3:on    4:on    5:on    6:off
nfslock         0:off   1:off   2:off   3:on    4:on    5:on    6:off
ntpd            0:off   1:off   2:on    3:on    4:on    5:on    6:off
openafs-client  0:off   1:off   2:on    3:on    4:on    5:on    6:off
openibd         0:off   1:off   2:on    3:on    4:on    5:on    6:off
portmap         0:off   1:off   2:on    3:on    4:on    5:on    6:off
postfix         0:off   1:off   2:on    3:on    4:on    5:on    6:off
postgresql      0:off   1:off   2:on    3:on    4:on    5:on    6:off
pound           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rawdevices      0:off   1:off   2:off   3:on    4:on    5:on    6:off
readahead       0:off   1:off   2:off   3:off   4:off   5:on    6:off
readahead_early 0:off   1:off   2:off   3:off   4:off   5:on    6:off
rephot          0:off   1:off   2:on    3:on    4:on    5:on    6:off
rpcgssd         0:off   1:off   2:off   3:on    4:on    5:on    6:off
rpcidmapd       0:off   1:off   2:off   3:on    4:on    5:on    6:off
snmpd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
sshd            0:off   1:off   2:on    3:on    4:on    5:on    6:off
syslog-ng       0:off   1:off   2:on    3:on    4:on    5:on    6:off
sysstat         0:off   1:on    2:on    3:on    4:on    5:on    6:off
xinetd          0:off   1:off   2:off   3:on    4:on    5:on    6:off
ypbind          0:off   1:off   2:on    3:on    4:on    5:on    6:off
yum-autoupdate  0:off   1:off   2:off   3:on    4:on    5:on    6:off

I noticed on the new head02 that selinux was in permissive mode rather than disabled. I changed this to disabled in /etc/sysconfig/selinux.

One major set of services to migrate are the pound and mongrel_cluster services, used to provide dCache file checksum's. These are located in Hiro's area on head02(old) in /home/hito. I migrated this whole area via scp between the nodes. I then setup Hiro's account hito using the same uid/gid and migrated the /etc/init.d/mongrel_cluster and /etc/init.d/pound service init entries. This was tested (using the "monit.d" tests) and worked fine.

See head01 info above for the PostgreSQL DB migration. The same setup ("warm standby") was used for the head02(old) and head02(new) systems.

For dCache we installed the v1.9.5-15 RPMS and copied the following files from head01(new):

  • dCacheSetup - Copy, no changes
  • srm_setup.env - Copy, no changes

The node_config file was copied from head02(old) and had ADMIN_NODE changed to NAMESPACE_NODE.

Found we needed to copy /etc/exports from head02(old) to head02(new) so that /pnfs could be NFSv3 mounted.

Updated the dcache.conf file on head02(new) to reflect the change in the dcap doors. We now have 3 doors (door, door1, door2) on ports 22125, 22136,22137. The /root/dcache.conf file was updated and then put into the system as below:

[root] # mount localhost:/ /mnt
[root] # mkdir /mnt/admin/etc/config/dCache
[root] # touch /mnt/admin/etc/config/dCache/dcache.conf
[root] # touch /mnt/admin/etc/config/dCache/'.(fset)(dcache.conf)(io)(on)'
[root] # echo "<door host>:<port>" > /mnt/admin/etc/config/dCache/dcache.conf

On the day of the transition we need to do the following for head02:

  • Once the OIM outage is active we stop dCache on head02(old) after stopping head01(old) (see above)
  • Then we stop the postgresql on head02(old) and power down
  • Create the /tmp/pgsql.trigger file on head02(new) to exit recover mode and verify it does (echo "smart" > /tmp/pgsql.trigger)
  • Change the IP addresses to be the correct ones (public/private) on head02(new)
  • Power down head02(new)
  • Physically swap head02(new) with the existing head02(old)
  • Change the switch port configuration to make sure VLAN 4010 is untagged and VLAN 4001 is tagged
  • Power up head02(new) and verify network connectivity
  • Re-enable the dcache service via 'chkconfig dcache on' and start it, verifying it starts correctly
  • Update the Raritan dongle labels after the switch
  • Continue with head01(new) as above

-- ShawnMcKee -17 Jan 2011

-- ShawnMcKee - 17 Jan 2011
Topic revision: r1 - 17 Jan 2011, ShawnMcKee
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback