What To Do after losing the dcache partions of a Node?

Due to the rocks rebuilding failures. dcache partitions could be wiped during the rebuilding, thus we have lost the files physically, but dcache wont be aware of the loss, so the corresponding file names would still be in the PNFS name space, and Location Cache would still be in the Dcache companion database..

And, we also have a policy, if a node crashes and loses its dcache data, we would retire this node from dCache system permanently.

Just follow these steps..

take node c-6-32 for example.

1 retire it from dcache

1.1 remove it from site-info.def
cd /home/install/extras

delete these entries

DCACHE_POOLS="${DCACHE_POOLS} c-6-32.aglt2.org:all:/dcache"
DCACHE_POOLS="${DCACHE_POOLS} c-6-32.aglt2.org:all:/dcache1"
 

1.1.2 check in the changes to svn

svn ci site-info.def -m "wuwj remove node c-6-32 from dcache"
Note:: this node only have pool service running, if the node have dcap,gsiftp doors running, also remove those entries.

1.2 remove the pools from PoolManager

cd /opt/d-cache/config

delete these entries.

psu create pool c-6-32_1
psu create pool c-6-32_2
psu addto pgroup ResilientPools c-6-32_1
psu addto pgroup ResilientPools c-6-32_2

root@head01 /opt/d-cache/config# !ssh
ssh -1 -l admin -c blowfish -p 22223 localhost

    dCache Admin (VII) (user=admin)


[head01.aglt2.org] (local) admin > cd PoolManager
[head01.aglt2.org] (PoolManager) admin > reload -yes


2 decide the logical file name of the lost files, and move it from PNFS name space..

2.1 decide which logical file names have the unique copy on the crashed node,
2.1.1 get the list of pnfsid
some files have more than 1 copy, if only 1 copy is lost, should keep the file name for the reference of the other copies.

the list of pnfsid which has the unique copy on this crashed node can be archived by the following sql cmd on head02

 mysql companion -e "select pnfsid from cacheinfo where pnfsid in (select pnfsid from cacheinfo where pool like '%c-6-32%') group by pnfsid having count(pnfsid)='1';"

2.1.2 get the list of logical file name
on head02

root@head02 ~# cd dcache_adm_script/dCache/ls_pnfsfile_attribute/
dump the list of pnfsid to the file "filelist";
root@head02 ~/dcache_adm_script/dCache/ls_pnfsfile_attribute# ./ls.pl 

ls.pl would generate a list called "File_attr.out".. use awk to get the filenames, and remove them from PNFS.

3 remove the cache location from dcache companion db..

on head02, execute the following sql statements
mysql companion -e "delete * from cacheinfo where pool like '%c-6-32%'";



-- WenjingWu - 30 Jan 2009
Topic revision: r1 - 30 Jan 2009, WenjingWu
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback