Upgrade Planning for AGLT2 SL5 Systems
We need to upgrade our remaining SL5 systems to SL6 soon. We should use this page to track which systems still need upgrading and list the relevant services and options.
ATGRID (in the grid.umich.edu network) provides our central syslog-ng service. All our systems are setup to report to port 5140 on atgrid via syslog or equivalent. Current this system is a VM (hardware version 9) with two disks in LVM (160GB and 250GB), 12GB of RAM and 4 CPUs. The syslog-ng has been replaced by a commercial product.
My recommendation is to go with a revamp of our logging involving a few components (See http://www.elasticsearch.org/overview/
). The ElasticSearch + Kibana was presented at our HEPiX meeting in Fall 2013 at http://indico.cern.ch/event/247864/session/6/contribution/55
We need to decide how best to proceed with the above components. The upgraded ATGRID could just run a central rsyslog service (see HEPiX slides) or it could also run the other components through Kibana.
NOTE: Demo version now up in new VM at http://18.104.22.168/kibana
Currently the NFS server for /atlas/data08 and the OSG HOME area (on SSDs). The machine is in the grid.umich.edu network. Should be a straightforward reinstall as SL6. Need to save the relevent configuration details and make sure the NFS areas are not overwrtitten. This is set for upgrade on May 6.
Our Kerberos servers, AFS DB servers and GUMS servers. These are VMs. Could clone for backup and try recreating them one at a time.
BEWARE, the entire content of the /root directory should be saved on linat03 prior to any rebuild.
These three are done. Some notes on the procedure and issues encountered:
- Took a backup (mysqldump) of gums database and restored after build. Created 'gums'@'localhost' user with password recorded. Gums config was changed to reference database via localhost rather than via the public hostname (should be more efficient over a local socket). Cfengine config seeds an initial copy of the config file with empty password to be filled in by user.
- For Kerberos slaves, just rebuild and did kprop from master. Had to remake keytab on each after build, and manually copy master key stashfile (cfengine will produce output indicating to do the keyfile if not found, and won't start krb5kdc until keyfile and principal db exist). For master, without switching another machine to be master first, it's best to preserve and re-install the /etc/krb5.keytab file and contents of /var/kerberos/krb5kdc. The kadmin master needs to exist to remake keytab. I don't think I tried a kprop from one of the others, just put the old principal db files back.
- For AFS Db machines, need to keep and copy back /usr/afs/etc/rxkad.keytab before starting server processes (cfengine policy checks before starting and reports on the action needed). Had issue where we synced a bad (empty) copy of the databases by starting the server on linat02 without the keytab, and then later installed it. It appears that linat02 then became coordinator and synced an empty db to the rest. Restoring from the backup taken pre-install resolved it.
UMCFE and MSUCFE
These are the cfengine policy servers. First step is to make the policy server list fault tolerant, ie, if at UM, and umcfe does not respond, then msucfe will instead be consulted. This is made necessary so that the policy servers themselves can bootstrap their policy.
UMROCKSI, MSUINFO, and MSUINFOX
umrocksi was updated, msunifo/x were already at SL6. New cfengine bundle "named" was created. Existing bundle "resolv" and "named" together in dns.cf. MSU zone files from bundle "bind9" transferred to management with named bundle. New bundle has common named.conf template influenced by default named.conf and existing named.conf from all 3 servers. New bundle applied to all 3 dns servers and bind9 bundle deleted. Arrangement of each site slaving zones from the other retained. Any of the three DNS servers should work equally well for any host at either site.
There is probably not a need for two DNS servers at MSU since they can use the one here as secondary. We should consider removing msuinfox.
SQUIDS (cache, cache2, cache3 at UM, cache0, cache1 at MSU)
All squid at both sites are still running SL5 and should be rebuilt to SL6. cache, cache2 and cache3 are now complete.
SVN SERVER ndt
This is done. Some notes:
- Previously the system was one large 100GB volume with one /
partition. I created a new 16GB "server-generic" sized volume and
built onto that. The existing disk was resized a little bit larger
and made into an LVM physical volume. Then two LV were made - one
100GB volume to mount at /repos for the SVN repository and one 5GB
volume to mount at /home/rancid. This will make it easier to size
those volumes larger if we ever need to do so.
- Currently root login is not enabled since it listens on a public IP.
If at all possible I'd like to discontinue using SVN as root. If
there's a workflow that absolutely cannot be adjusted then I guess
it's not possible and we'll go back to root login.
- As before, I setup the keytab so you can login to the system using a
- Pretty much everything is now managed by cfengine - viewvc configs, the webserver and /var/www/html/dq2 setup for the storage queries done from aglt2se.shtml on gate02, and very basic rancid setup (user/group and crontab). The configs and install in /home/rancid are not managed.
- 02 Mar 2014