BNX2 DKMS Ganglia
Found that the existing bnx2 network driver was the cause of the large spikes in the ganglia network plots. It intermittently puts bad data into /proc/net/dev. This gets feed through the ganglia system and recorded in to rrd databases as rates at about 3e17. Newest ganglia (trunk version on there SVN server) has a check for unreasonably large network rates. The ganglia developers also pointed out that the bnx2 network driver was the source of the trouble. This led us to upgrade to latest bnx2 drive version 1.7.1c.
Also for 3.0.8, I would like to drop in the trunk version of libmetrics/linux/metrics.c.
It [will soon] contain a fix for a nasty overflow problem in some Braodcom NICs (BCM5708,
bnx2 driver) that leads to spurious petabyte spikes in the network metrics. The problem is
fixed in later driver releases, but is present in some popular "enterprise" distros like
RHEL4. The risk is minimal and I am running it for more than a week, but it is definitely
not for 3.0.7.
email: k n o b i AT knobisoft DOT de
Shawn put together an RPM of the bnx2 installer from the dkms system. Installed on all systems with this script:
bash-3.00# cat /home/install/site-tools/tmp/install-dkms-bnx2.sh
# Install dkms rpm, bnx2 rpm for dkms and restart to load module
rpm -U /home/install/contrib/4.3/x86_64/RPMS/dkms-2.0.17-1.el4.noarch.rpm
rpm -U /home/install/contrib/4.3/x86_64/RPMS/bnx2-1.7.1c-1dkms.noarch.rpm
service network stop
service network start
ethtool -i eth0
echo $0 done
Found that ganglia monitoring completely stopped for some groups. Restarted gmond on group pole machines and also gmetad.
It would be possible to go through the rrd databases and remove these spikes, however, especially on the averaged plots, a large fraction (50%) of the datapoints are effected and unrecoverable.
Fixed a couple nodes that were missed at MSU.
- 06 May 2008