Rocks Issues

Please specify U-M or MSU in these logs. Thanks, Tom

Logs

Initial entries 9:15am 3 Jan 2006 o linat09 crashes, and sometimes does not yet users log in, yet root can still log in

o The following cluster nodes are down on 12/31/05: 10-100, 10-101, 10-102, 10-103, 10-104, 10-109, 10-120 and 9-64 -- Resolved, see below

o The following are eating cpu as if it doesn't have a limit on 12/31/05: 13-2 and 9-88 -- Resolved, see below

o TieSheng reports compute-12-34.local has file corruption problems (still) on 12/30/05. This copy was to data11, so that still points to linat11 as a problem machine.
Edits at 10:45am 3 Jan 2006 o Nodes 10-100/101/102/103/104/109 and 9-64 were all on the same power strip, and it's circuit breaker blew out. 9-64 is moved to a different stip in rack 9, but will not come up (md0 problem?). The others are all back up now, although 101 and 103 required a manual reboot via reset.

o 13-2 and 9-88 were doing nothing, and condor was not running. Their load was stuck at 37 or so in all 3 fields. I rebooted them, and eventually they both came back up as well.

o linat10 crashed every day at 4am while running the slocate.cron file in /etc/cron.daily. This has been removed to /etc/cron.orig_daily directory until it can be determined why this occurs.
Edits at 1:15pm 4 Jan 2006

o Same power strip with 10-100/etc trips again. Bring it up with only 5 nodes, not 6, leaving 10-109 off-air for the time being. This time 102 gave the problem 101 and 103 gave last time, ie, while booting the remount of /root(?) with rw failed, and the kernel paniced. Two resets later it booted fine.

o 9-77 was idle, with condor not running. Restarted condor, and later the node crashed. Pushed reset and it reloads from umrocks. However, it then crashed again, in CPU1, upon sleeping (or waking from sleep) while in the wrong kernel state. Now running memory diagnostics.

o 10-120 cpu burned yesterday as its fan failed. The board appears damaged as well. Richard is placing the 2469 MB it in now.

o 9-64 is back on after resetting a few BIOS settings and zeroing out the start and end of both disks, forcing it to reload from umrocks.

o Checked linat11 BIOS log for memory errors, and found none. However, the console log had both page allocation errors from the nfsd and a warning of a possible deadlock in kmem_alloc.

-- BobBall - 03 Jan 2006
Edits at 1:10pm 6 Jan 2006 Here is a comprehensive list of trouble nodes, or nodes not working in the cluster o Nodes that are powered, but not in the cluster (but in the racks): 3,5,9,10,13,25,28,30,32,38,54,56,63,66,72,76,116,117,121

o Nodes that are in the racks, but not powered: 26,69,60,70,105,108,127

o Nodes that are out of the racks, having a single cpu and are ready to be tested in the cluster: 18,40,59,

o Nodes that are out of the racks, with no cpu: 14,15,17,20,31,36,47,48

o Nodes that are out of the rack with no cpu or power: 4

o Nodes that are out of the rack with no cpu, board or power: 29

We have a few cpu's and power supplies not being used, although the cpu's are questionable and a large bag of unused memory. The bottleneck is definetely processors but I will begin going through the machines that are not in the cluster but are in the racks and looking at them. One thing that might be valuable would be to create a testbed to submit Atlas jobs to a node outside of the cluster so that I can test and verify nodes before adding them to the cluster and crashing important jobs. Like a harddrive fully configured that I can swap from node to node and then I could be trained on how to run the jobs.

-- RichardFrench - 06 Jan 2006

The following nodes have SMART errors and will probably suffer corruption or disk failure soon: 10-122 unreadable sectors 11-61 unreadable sectors 13-23 unreadable sectors 13-6 unreadable sectors 9-65 unreadable sectors, uncorrectable sectors 9-67 unreadable sectors 9-71 unreadable sectors 9-67 unreadable sectors, uncorrectable sectors 9-82 unreadable sectors, uncorrectable sectors 12-39 SMARTD check failed

-- JeffGregory - 11 Jan 2006

Modified mail settings to allow mail to be sent from the client nodes. The following changes were made Added an entry to /etc/sysconfig/iptables to allow access to port 25 from eth0 Added local and local.grid.umich.edu to the 10.1.1.1 entry in /etc/hosts.allow Moved /etc/postfix/main.cf to /etc/postfix/main.cf.shawn and /etc/postfix/main.cf.default to /etc/postfix/main.cf

-- JeffGregory - 11 Jan 2006

Migrating ROCKs nodes on UMROCKS to new address space

We needed to put our UMROCKS cluster into a new address space for AFS reasons.

Prior to making changes I backed up the ROCKS DB on umrocks.grid.umich.edu using:
cd /var/db
mysqldump -u root -p --opt cluster > mysql-backup-cluster.before_10_subnet

I also prepared a simple sed script:
/IPADDR/s/10.255.255/10.1.1/
/NETMASK/s/255.0.0.0/255.255.255.0/
/GATEWAY/s/10.1.1.1/10.1.1.2/

The job is to change the addresses of the worker nodes from 10.255.255.x to 10.1.1.x. We will create a new gateway on the switch rather than using NAT on the headnode. Also the switch will be able to connect AFS requests on the 10.1.1.0/24 subnet to the appropriate AFS file server.

In the ROCKS MyPhpAdmin web page I edited the app_globals table to change the following variables:

Component Old Value New Value
PrivateNetmask 255.0.0.0 255.255.254.0
PrivateGateway 10.1.1.1 10.1.1.2
PrivateNTPHost 10.1.1.1 10.1.1.2
PrivateNetwork 10.0.0.0 10.1.1.0
PrivateBroadcast 10.255.255.255 10.1.1.255
PrivateNetmaskCIDR 8 23

I then "dumped" the DB via:

 mysqldump -u root -p --database cluster mysql-backup-cluster.after_fix_ip

Did sed to fix 10.255.255 to 10.1.1 in this dump file and then reloaded it with

 mysql --user=root -p --database cluster < mysql-backup-cluster.after_fix_ip
Topic revision: r12 - 31 Oct 2007, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback