ClusterChanges < AGLT2

You are here: Foswiki>AGLT2 Web>ClusterChanges (16 Oct 2009, TomRockwell)Edit Attach

MSU 2008
- May
  - BNX2 DKMS Ganglia

MSU 2008

May

BNX2 DKMS Ganglia

Found that the existing bnx2 network driver was the cause of the large spikes in the ganglia network plots. It intermittently puts bad data into /proc/net/dev. This gets feed through the ganglia system and recorded in to rrd databases as rates at about 3e17. Newest ganglia (trunk version on there SVN server) has a check for unreasonably large network rates. The ganglia developers also pointed out that the bnx2 network driver was the source of the trouble. This led us to upgrade to latest bnx2 drive version 1.7.1c.

http://sourceforge.net/mailarchive/message.php?msg_id=645587.59821.qm%40web32602.mail.mud.yahoo.com

Also for 3.0.8, I would like to drop in the trunk version of libmetrics/linux/metrics.c.
It [will soon] contain a fix for a nasty overflow problem in some Braodcom NICs (BCM5708,
bnx2 driver) that leads to spurious petabyte spikes in the network metrics. The problem is
fixed in later driver releases, but is present in some popular "enterprise" distros like
RHEL4. The risk is minimal and I am running it for more than a week, but it is definitely
not for 3.0.7.

Cheers
Martin
------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

Shawn put together an RPM of the bnx2 installer from the dkms system. Installed on all systems with this script:

bash-3.00# cat /home/install/site-tools/tmp/install-dkms-bnx2.sh
#!/bin/bash

# TDR
# Install dkms rpm, bnx2 rpm for dkms and restart to load module

rpm -U /home/install/contrib/4.3/x86_64/RPMS/dkms-2.0.17-1.el4.noarch.rpm
rpm -U /home/install/contrib/4.3/x86_64/RPMS/bnx2-1.7.1c-1dkms.noarch.rpm
service network stop
rmmod bnx2
service network start
hostname
ethtool -i eth0
echo $0 done

Found that ganglia monitoring completely stopped for some groups. Restarted gmond on group pole machines and also gmetad.

It would be possible to go through the rrd databases and remove these spikes, however, especially on the averaged plots, a large fraction (50%) of the datapoints are effected and unrecoverable.

May 8 Fixed a couple nodes that were missed at MSU.

-- TomRockwell - 06 May 2008

Topic revision: r4 - 16 Oct 2009, TomRockwell

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback