Starting on the morning of May 4th 2009 AGLT2 began upgrading from dCache 1.8.0-15p12 to 18.104.22.168 as well as migrating to Chimera. This was motivated by an increasing number of issues, many of whom were supposed to be addressed in the 1.9.2 version of dCache. One pressing problem was that SRM was failing repeatedly over the prior weekend.
Initial migration for dCache was quick but the upgrade from PNFS to Chimera was VERY slow. Enventually the problem was traced to the version of Postgres we were using: 8.1.14. By Thursday late afternoon on May 7th we had only migrated 2 million out of 3 million PNFS entries to Chimera and the update rate was only 3.3/second. We stopped the SQL injection into the 'chimera' DB at that point and upgraded postgres
to version 8.3.7. The Postgres upgrade took approximately 2 hours. After it finished we were able to inject records into Chimera at an average of 140/second with bursts of up to 1200/second. Chimera upgrade was completed around 10 PM on May 7th.
At this point we need to register all the pools to Chimera. This was started around 11 PM and we expected it to complete within a few hours. However it took longer than expected and the node "hung" around 3AM. We restarted the node around 8 AM on May 8th and completed the registration of files by around noon. We were able to successfully begin file transfers during the afternoon but were encountering problems. It turned out the code to check the Adler32 values for file transfers depended upon the ability to get Adler32 checksums via the /pnfs file system and this was not working in the same way for Chimera. By 6PM on May 8th we had reverted the DQ2 services to the non-checking (default) configuration and files were able to move. TODO
Get a new way of finding Adler32 values from Chimera.
Remaining problems are with SRM hanging or failing! See this topic
for diagnosis and further upgrade/reconfiguration details.
- 09 May 2009