AGLT2 SRM Hangs

Starting in late April 2009 AGLT2 was having more and more dCache/SRM issues. One problem that significantly increased in frequency was SRM failing to respond.

We can monitor SRM via SRMwatch 1.1. Normally this works fine but after the dCache upgrade it stopped responding. I restarted the SRMwatch on head01 so we can see SRM status (you just run /opt/d-cache/srmwatch-1.1/deploy_srmwatch if SRMwatch is not responding or gives a 404 error). See AGLT2_srmwatch.

srm_fail_aglt2.png

You can see it stopped working about 1:40 AM until I restarted this morning, then it failed again.

I looked into log files on head01 and searched for \x93FATAL\x94 in the catalina.out (for SRM on head01):

2009-05-09 10:42:37.584 () [] org.dcache.srm.server.SRMServerV2.getFailedResponse(SRMServerV2.java:330) FATAL  - getFailedResponse invocation failed for SrmPingResponse.setReturnStatus

This is what the catalina.out file shows when SRM is down. Looking at the context I find:

ec/apache-tomcat-5.5.20/logs/catalina.out
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
        at java.lang.Thread.run(Thread.java:619)
2009-05-09 10:42:37.584 () [] org.dcache.srm.server.SRMServerV2.getFailedResponse(SRMServerV2.java:330) FATAL  - getFailedResponse invocation failed for SrmPingResponse.setReturnStatus
2009-05-09 10:42:56.845 () [] org.dcache.srm.server.SRMServerV2.handleRequest(SRMServerV2.java:255) ERROR  - SRM Authorization failed
org.dcache.srm.SRMAuthorizationException: authRequestID 1544188991 Message to gPlazma timed out for authorization of /O=dutchgrid/O=users/O=nikhef/CN=Kors Bos

So this appears to be an authorization issue\x85something is wrong with gPlazma. I found this (See http://www.gridpp.ac.uk/wiki/Random_dCache_failures_in_SAM ):

gPlazma timed out
This is probably due to the fact that the gPlazma cell is being used, rather than the module. The difference here is that with the module, other dCache cells directly call the methods of gPlazma to do the authorisation. However, with the cell, there is a dedicated process which other cells must talk to. This can lead to time outs if there are problems with inter-cell communication. 
+ lcg-cp -v --vo ops lfn:SRM-put-heplnx204.pp.rl.ac.uk-1186472286 file:/home/samops/.same/SRM/nodes/heplnx204.pp.rl.ac.uk/testFile.txt
the server sent an error response: 530 530 Authorization Service failed: diskCacheV111.services.authorization.AuthorizationServiceException: authRequestID  761915796 Message to gPlazma timed out for authentification of /C=CH/O=CERN/OU=GRID/CN=Judit Novak 0973 - ops
lcg_cp: Invalid argument
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops
+ result=1\x94   

You can refer to https://srm.fnal.gov/twiki/bin/view/SrmProject/GPlazmaHowTo for info on \x93cell\x94 vs \x93module\x94 modes. At AGLT2 we are exclusively using the \x93Cell\x94 mode and have \x93Module\x94 false. This is the \x93preferred\x94 mode but apparently can block. (Note this is OLDER, dCache 1.7 based docs)

At this point there are a number of things to consider to improve the situation since we have the newest dCache and GUMS 1.3.14. See the details in the following:

  • Upgrade Postgres. While we upgraded postgres on the PNFS node during our dCache upgrade (8.1.14 to 8.3.7) we didn't do that on the dCache ADMIN node (head01)
  • Reconfigure gPlazma to use the new XACML option as well as provide a backup text file authorization for common accounts
  • Use both "Cell" and "Module" options for the dCacheConfig

We are trying each of these to attempt to increase the robustness of the setup at AGLT2. I will try to update the above links with anything we learn from these modifications.

-- ShawnMcKee - 09 May 2009
Topic revision: r1 - 09 May 2009, ShawnMcKee
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback