Dcache TroubleShooting

Case 1.. node c-104-2 has some broken files which mean the md5sum of local file doesnt match the md5sum of source file.. we decided to recopy these files from BNL..

HOWTO
1. get the pnfsid list on c-104-2 from the database 
#psql -U srmdcache companion <sql >pnfsid_list
#more sql
select pnfsid from cacheinfo where pool like '%c-104-2%';

2. for each pnfsid, get its logical name from PNFS..
# cat '.(nameof)(pnfsid)'

3. query the BNL LRC www Interface to get the Physical file name of each file  (curl http://dms02.usatlas.bnl.gov:8000/dq2/lrc/PoolFileCatalog?lfns=$lfn 2);

4. query the AGLT2 LRC www Interface to get the Physical file name of each file  (curl http://linat10.grid.umich.edu:8000/dq2/lrc/PoolFileCatalog?lfns=$lfn 2);

5. generate a list to delete its origin logical file name in PNFS,
6. generate a list for srmcp command to copy the file, like
srm://dcsrm.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/log02/mc08/log/mc08.005985.WW_Herwig.digit.log.e325_s404_d117_tid022104_sub01790316/log.022104._00254.job.log.tgz.1_
_DQ2-1210951747 srm://head01.aglt2.org/pnfs/aglt2.org/dq2/mc08.005985.WW_Herwig.digit.log.e325_s404_d117_tid022104_sub01790316/log.022104._00254.job.log.tgz.1

7. delete the original pnfs names in PNFS server.
8. copy files by using "srmcp -copyjob srm_copy.list" 

PS: scripts for doing this:
#!/usr/bin/perl
open(PN,"pnfs_list")||die "cannot open file\n";
open(SRM,">srm_copy.list")||die "can not open file\n";
open(RM,">remove.list")||die "can not open file\n";
open(LS,">ls.list")||die "can not open file\n";

my $pnfsid;
my $lfn;
my $pfn_aglt2;
my $pfn_bnl;
my $curl_tmp;
my $pnfs_path;
while(<PN>)
{
        if($_=~/([\d|\w]{24})/)
        {
                $pnfsid=$1;
        }
        $lfn=`cat /pnfs/aglt2.org/dq2/'.(nameof)($pnfsid)'`;
        #print "lfn=".$lfn;
        $lfn=~s/\n//i;
        $curl_tmp=`curl http://dms02.usatlas.bnl.gov:8000/dq2/lrc/PoolFileCatalog?lfns=$lfn 2>/dev/null|tr -d "\n"`;
        #print $curl_tmp;
        if($curl_tmp=~/\"(srm\S*)\"\/\>/)
        {
                $pfn_bnl=$1;
        #       print "pfn_bnl=".$pfn_bnl."\n";
        }
        if($pfn_bnl ne "")
        {
                $curl_tmp=`curl http://linat10.grid.umich.edu:8000/dq2/lrc/PoolFileCatalog?lfns=$lfn 2>/dev/null|tr -d "\n"`;
                #print $curl_tmp;
                if($curl_tmp=~/\"(srm\S*)\"\/\>/)
                {
                         $pfn_aglt2=$1;
                }

                if($pfn_aglt2 ne "")
                {
                        $pnfs_path=$pfn_aglt2;
                        $pnfs_path=~s/srm\:\/\/head01.aglt2.org//i;
                        print LS "ls -l ".$pnfs_path."\n";
                        print RM "rm -fr ".$pfn_aglt2."\n";
                        print SRM $pfn_bnl." ".$pfn_aglt2."\n";
                }

        }


}

CASE 2: several cells complain about Memory heap problems, and on umfs07(4 pools), dcache can only enable the first 3 pools, the 4th pool would be disabled at the startup..

solution: Add more memory to the Virtual Java Machine . each dcache cell service is ran by a seperate Jave machine, so determine the maximum memory you can assign to each service.. on our FS pool nodes, we typically have 5 dcache services on each node, so we give each of them 2GB memory..
#vi /opt/d-cache/config/dCacheSetup
java_options="-server -Xmx2048m -XX:MaxDirectMemorySize=2048m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Djava.net.preferIPv4Stack=true
 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Dlog4j.configuration=file:${ourHomeDir}/config/log4j.properties "

the default value is "-server -Xmx512m -XX:MaxDirectMemorySize=512m", change both values to 2048.

CASE3 DB trouble

srm Log Errors:
10/15 06:31:07 Cell(SRM-head01@srm-head01Domain) : Request id=-2141913243: getSrmStatusOfGetRequestResponse: StatusCode = SRM_REQUEST_INPROGRESS FileStatusCode = SRM
_REQUEST_INPROGRESS
Sleep 3 s, try to get connection ... tries left: 648
createConnection(): Got exception org.postgresql.util.PSQLException, SQLState: 53300
solution
there was a  moment, when I was querying the db,
select * from srmspace;

nothing is in the db..
that table is the one to keep our space token information...

but without doing anything, it came back a while later..

but there were truely some error in the db entry..
dcache=> select id,vogroup,state,usedspaceinbytes,allocatedspaceinbytes from 
srmspace where description='ATLASUSERDISK';
   id   |    vogroup     | state | usedspaceinbytes | allocatedspaceinbytes
--------+----------------+-------+------------------+-----------------------
 610964 | /atlas         |     0 |                0 |
 610963 | /atlas/usatlas |     0 |     115977252441 |             409875348

 the first entry , the allocatespaceinbytes is null
with this null, all requests to the userdisk would fail, complaining not dst 
file is not a valid name
after I set it to 0,
 the transfers succeed..

the same error to atlasproddisk..

so I corrected them...
these 2 entries, I think they are the one I inserted manualy some time ago..
I dont know if their values dissapeared, or just I missed them in the first 
place..

but it seems the write request to these space tokens would faile with a 
valid value there..

CASE4 Postgresql DB transaction ID wrap-around (Jan 19th 2009)

Pnfs server is down because of the postgresql DB which hosts the information for PNFS server ran into the transaction ID wraparound problem, this usually happens when a large number of deletion or update happened to the postgresql db .. for the daily mantainance, we have a cron job which runs every 12 hours to vacuum the db automatically, but usually if delete a large number of entries (more than 10,000 files) from PNFS name space before the cron job runs, the transaction ID wrap around would happen..

cron job on head02

0 */12 * * * psql -U pnfsserver admin -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data1 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data2 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data3 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data4 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data5 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data6 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data7 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data8 -c "VACUUM ANALYZE VERBOSE;";psql -U pnfsserver data9 -c "VACUUM ANALYZE VERBOSE;";psql -U postgres template0 -c "VACUUM ANALYZE VERBOSE;";psql -U postgres template1 -c "VACUUM ANALYZE VERBOSE;";psql -U postgres prod -c "VACUUM ANALYZE VERBOSE;";psql -U postgres caldata -c "VACUUM ANALYZE VERBOSE;"psql -U postgres calibdata -c "VACUUM ANALYZE VERBOSE;"
to recover the system from this problem , please refer to the page Vacuum Postgresql DB

after the vacuum,

do these on head02

1) service postgresql start
2) service pnfs stop; service pnfs start
3) service dcache stop

4) examine system to see that there are no dcache processes running

5) service dcache start

but this time, after this, srm is still not working, after checking the head01, the logfile (/opt/d-cache/libexec/apache-tomcat-5.5.20/logs/catalina.out) still have complains about postgresql db on head01(not the one we vacuumed which is on head02), the reason which causes this is:

1.. /var is full, which is the partition also hosts the postgresql db.. this is because during the pnfs down time, I didnot stop dcache on head01, and there were a lot of communication during the downtime which also made a huge /var/log/dCacheDomain.log(51GB), I rotated this log file to release some space for the postgresql db on head01..

Note:: before releasing space on /var, stop postgresql service, after successfully releasing the space, start the postgresql service.. and then restart dcache service on head01..

CASE4 Files lost their pool information(dcache metadata dont know which pool has this file)

This is from an email session:

Errors in the dashboard:

From AGLT2_PRODDISK to BNLPANDA is failing at high rate:
>Fail(137.0)/Success(442.0)

number of  errors with following message:  137
Error message from FTS:  [FTS] FTS State [Failed] FTS Retries [1] Reason
[SOURCE error during PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare
source file in 180 seconds] Source Host [head01.aglt2.org]

Things shawn has found

Hi Wenjing,

I am looking into these failures.  I found the most recent one at:

http://dashb-atlas-data.cern.ch/dashboard/request.py/file-placement?site=BNLPANDA&fromDate=2009-02-12%2009:23:00&toDate=2009-02-12%2013:23:00&state=ATTEMPT_DONE&errorType=182

Using your file query page I get:

pnfs/aglt2.org/atlasproddisk/mc08/HITS/mc08.105013.J4_pythia_jetjet.simul.HITS.e344_s479_tid039686_sub04198759/HITS.039686._20141.pool.root.1 pnfsid 00090000000000000238A268 size 59846145 chksum 639c4aa6 pools

putfilerequests history SUC usatlas1 msufs07.aglt2.org at Wed Feb 11 19:10:50 EST 2009 state Done : SrmPutDone called

getfilerequests history CANCEL at Thu Feb 12 07:12:42 EST 2009 state Canceled : set by setStatus to "Done"

srmspacefile record state=2 deleted=0 ATLASPRODDISK/atlas/production

Some quick questions: Should there be pool(s) listed for this file? Any idea what is wrong?

Reply from Wenjing

apparently, these files are either lost for ever, or they are physically in
pools  but failed at the PNFS register..

here is a page to compare the consistency to see if any pools have
unregistered files:

http://head02.aglt2.org/dcache_pool_fileno.html
but this cannot tell if this pnfsid is failed or not,

so to find the file, need to go the each pool physically to look for it,
if it is in some pool, do a pnfs registere from that pool,
otherwise, it is just lost forever..(sometimes, it may be not present in the
pools in the first place)

Shawn's question again

It did show:

"
putfilerequests history
SUC usatlas1 msufs07.aglt2.org at Wed Feb 11 19:10:50 EST 2009 state Done :
SrmPutDone called"

Isn't this the registration time/node?

Reply From wenjing

nope, this is just the recently access history to this file:

put means the file is written at that time..
SUC means the put operation is successful..
msufs07 is the gridftp server it used to transfer the file,
and the one followed is the "errormessage".

and 2 GET operations are actually cancelled..


Note: this file is not size 0, so it must have been written somewhere..
but I find no information in msufs07 gridftp log either by pnfsid or by 
filename, usually there would be some entries like that..

and I find the file in msufs09_4

[root@msufs09 ~]# ll -h  /dcache3/pool/data/00090000000000000238A268
-rw-r--r--  1 root root 58M Feb 11 19:10 
/dcache3/pool/data/00090000000000000238A268


so it is just a failure of regieteration..
this is the information about the consistency

poolname  db_fno  cell  disk    used_space
msufs09_4  31921  31997  31990  8912         register 76 from cell msufs09_4

to keep the consistency,

need to login to the admin interface from either head01 or head02:

ssh -l admin -c blowfish -p 22223 localhost
[head01.aglt2.org] (local) admin > cd msufs09_4
[head01.aglt2.org] (msufs09_4) admin > pnfs register
[head01.aglt2.org] (msufs09_4) admin > show pinboard


and then check the file number in the mysql db,

on head02:
mysql companion -e "select count(*) from cacheinfo where pool='msufs09_4'";


if this still is not right, maybe try to delete the entries from the mysql:

"delete from cacheinfo where pool='msufs09_4'"

and repeat the pnfs register from the admin interface..

Now you query the file:

/pnfs/aglt2.org/atlasproddisk/mc08/HITS/mc08.105013.J4_pythia_jetjet.simul.HITS.e344_s479_tid039686_sub04198759/HITS.039686._20141.pool.root.1
pnfsid 00090000000000000238A268
size 59846145
chksum 639c4aa6
pools msufs09_4

putfilerequests history
SUC usatlas1 msufs07.aglt2.org at Wed Feb 11 19:10:50 EST 2009 state Done : 
SrmPutDone called

getfilerequests history
CANCEL at Thu Feb 12 07:12:42 EST 2009 state Canceled : set by setStatus to 
"Done"
CANCEL at Thu Feb 12 10:30:07 EST 2009 state Canceled : set by setStatus to 
"Done"

srmspacefile record
state=2 deleted=0 ATLASPRODDISK/atlas/production

but the reason why the pnfsid failed in the first place, is still unknown..
it happend to msufs08 before, that was because msufs08 had an zombie dcache 
process..
but I checked msufs09, the process seems fine.

Case5 dcache pool is flooded with files regardless of the allocated size in the configuration file

questions to dcache supporters.
Hi, Dear  Supporters

several of our pools are just flooded by tremendous data  these days,
 by thinking that it may be usefull to allocate the usable space for each pool,
 I set up the diskspace in the pool's setup file like this:
 set max diskspace 9030g
 
but the actual used space for this  pool is 9504213620= 9062G.. 
which is pretty much the total capacity of the whole partition, and then I see transfer errors like "No space left on the device". 
This is like a disater to me, 
it seems there is no way to tell dcache to stop flooding itself..

 But I  do remember last time when I ran into a very similar situation, I restarted the pool, but the pool service refused to be started 
because it said the over all file size in the pool had exceeded the allocated space.. So I think it would check the configured space at the start up,
but maybe not during the running? if so, how best could we avoid this?
 

Reply from dcache supporters


dCache will fill files until the limit you have configured. dCache will 
only account for the logical size of the file and has no knowledge of 
file system overhead. It is your responsibility to leave sufficient room 
for partially filled disk blocks, and other file system overheads. You 
must also leave sufficient room for the control or meta directories. You 
  could even decide to put those on a different file system.

In Java 5 there is no way to determine the amount of free space on a 
file system. This functionality was added in Java 6, but as long as we 
still support Java 5 we cannot make use of this new feature.

Finally I have to admit that dCache 1.8 and earlier has suffered from 
over allocation bugs. Due to these bugs dCache would "miscalculate" how 
much data it had stored on a pool and regularly end up storing more than 
it was configured to store. 1.9.0 and 1.9.1 have been reported to be 
much better in this respect.

To know if you suffer from this over allocation bug, you should do a 
'rep ls' on the pool and compute the sum of the individual files and 
compare that to the totals in the 'info' command. Make sure the pool is 
idle while you do so (no transfers). Also compare this to an ls of the 
data directory and see if there is any mismatch.

Notice that if you use the du command, it defaults to reporting physical 
file size (i.e. disk usage). Use the --apparent-size option to obtain 
the size as dCache sees it. To my knowledge there is no portable way to 
obtain the disk usage of a file from within Java, so there is no way 
dCache can take the file system overhead into account.

Cheers,

/gerd


-- WenjingWu - 20 May 2008
Topic revision: r8 - 16 Mar 2009, WenjingWu
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback