Auto Test Programs over AGLT2

Cluster Related

PNFS mount point test

Purpose Make sure every computer node has "/pnfs/aglt2.org" mounted , and every gridftp door nodes has both "/pnfs/aglt2.org" mounted and "/pnfs/ftpBase " which is a soft link to "/pnfs/aglt2" exists.

Frequency every 4hours

Alert Sending emails to aglt2-hardware@umich.edu

cron service

umopt1.aglt2.org (check all computer nodes at UM)
0 0-23/4 * * * /usr/bin/perl /home/install/wuwj_extras/bin/check_pnfs_UM_wn.pl 2> /dev/null 1> /dev/null

msurox.aglt2.org(check all computer nodes and msu fs servers at MSU)
0 0-23/4 * * * /usr/bin/perl /home/install/extras/bin/check_pnfs_MSU.pl 2> /dev/null 1> /dev/null

umopt1.grid.umich.edu run as wuwj which needs to update the afs token every 30 days (check fs servers at UM, because um fs servers don't allow passwordless root ssh )
0 0-23/4 * * * /usr/bin/perl /home/install/wuwj_extras/bin/check_pnfs_UM_sv.pl 2> /dev/null 1> /dev/null

Host Cert expiration check

Purpose check the expiration date of all host certs which are stored on umopt1.aglt2.org

Frequency every day

Alert
Sending emails about expiring certs(less than a month to the expiration) to aglt2-hardware@umich.edu
display expiration date of certs on this web pagemonitor_cert

cron service

umopt1.aglt2.org (check all certs for UM and MSU nodes)
0 5 * * * /usr/bin/perl /home/install/extras/bin/check_cert.pl

Dcache Related

check dead pools of dCache

Purpose check if there are any pools whose status is dead

Frequency every 5 minutes

Alert send emails to wenjing and Shawn if any fs pools are becoming dead..

Cronjob

hea02.aglt2.org
*/5 * * * * cd /root/dcache_adm_script/dCache/check_poolstate;perl report_dead.pl

cleandb

Purpose clean stale db entries from srm database which would stop a user to write a file to dcache with the same name which failed before..

Frequency every 10 minutes

Alert None

Cronjob

head02.aglt2.org
*/10 * * * * cd /root/dcache_adm_script/dCache/clean_sp_db/;/usr/bin/perl cleandb.pl

srm put/get report/statistics

Purpose
stats the successful and failed rate of SRM PUT/Get requests within each space token area
classify error messages
rotate srm requests db (delete entries from 4 hours ago)

Frequency every 4 hours

Alert send email to wenjing, shawn and bob if there are any unusual (fatal )failures..

Cronjob

head02.aglt2.org
0 0-23/4 * * * cd /root/dcache_adm_script/dCache/srm_err_report; perl report.pl

stat_fileno

Purpose compare the file numbers from the pool cell and the file numbers registered in PNFS DB, see if any pools failed to register to PNFS

Frequency every day

Alert display the File numbers of each pool cell and registered DB in this monitor pageFileNO_Stat

Cronjob

head02.aglt2.org
0 8 * * * cd /root/dcache_adm_script/dCache/stat_fileno_inpool; perl stat_fileno.pl poollist

Stat Usage of Typical Pools

Purpose
for each space tokens, list all affiliated pools and their usage.
for each fs nodes, list all its pools's group and their usage

Frequency every day

Alert display the stat in this webpageTypical Pool Usage

Cronjob

head01.aglt2.org
0 */8 * * * cd /root/dcache_admin_script/stat_dcache_pools;perl stat_poollist.pl

Purpose

Frequency

Alert

Cronjob


-- WenjingWu - 11 Dec 2008
Topic revision: r3 - 16 Oct 2009, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback