Cacti Setup for Dell Nodes

The Dell PE1950 and PE2950 nodes have a large number of fans and temperature probes which are not exposed via SNMP. The presents a problem for monitoring their status. The ipmitool can be used to dump information on these components so we need to provide a way to "pass-thru" SNMP requests to IPMI or do the equivalent.

Getting the Info

For our Dell nodes the command ipmitool sdr will show a lot of information:
root@c-1-29 ~# ipmitool sdr
Temp             | -39 degrees C     | ok
Temp             | -31 degrees C     | ok
Temp             | 40 degrees C      | ok
Temp             | 40 degrees C      | ok
Ambient Temp     | 20 degrees C      | ok
CMOS Battery     | 0x00              | ok
ROMB Battery     | 0x00              | ok
VCORE            | 0x01              | ok
VCORE            | 0x01              | ok
CPU VTT          | 0x01              | ok
1.5V PG          | 0x01              | ok
1.8V PG          | 0x01              | ok
3.3V PG          | 0x01              | ok
5V PG            | 0x01              | ok
1.5V PXH PG      | 0x01              | ok
5V Riser PG      | 0x01              | ok
Backplane PG     | 0x01              | ok
Linear PG        | 0x01              | ok
0.9V PG          | 0x01              | ok
0.9V Over Volt   | 0x01              | ok
CPU Power Fault  | 0x01              | ok
FAN MOD 1A RPM   | 7350 RPM          | ok
FAN MOD 1B RPM   | 7425 RPM          | ok
FAN MOD 1C RPM   | 4725 RPM          | ok
FAN MOD 1D RPM   | 4650 RPM          | ok
FAN MOD 2A RPM   | 7350 RPM          | ok
FAN MOD 2B RPM   | 7500 RPM          | ok
FAN MOD 2C RPM   | 4650 RPM          | ok
FAN MOD 2D RPM   | 4725 RPM          | ok
FAN MOD 3A RPM   | 8100 RPM          | ok
FAN MOD 3B RPM   | 7500 RPM          | ok
FAN MOD 3C RPM   | 4650 RPM          | ok
FAN MOD 3D RPM   | 4725 RPM          | ok
FAN MOD 4A RPM   | 7500 RPM          | ok
FAN MOD 4B RPM   | 7875 RPM          | ok
FAN MOD 4C RPM   | 4800 RPM          | ok
FAN MOD 4D RPM   | 4725 RPM          | ok
Presence         | 0x01              | ok
Presence         | 0x01              | ok
Presence         | 0x01              | ok
Presence         | 0x02              | ok
Presence         | 0x01              | ok
Presence         | 0x01              | ok
DRAC5 Conn 2 Cbl | 0x01              | ok
PFault Fail Safe | Not Readable      | ns
Status           | 0x80              | ok
Status           | 0x80              | ok
Status           | 0x01              | ok
Status           | Not Readable      | ns
Status           | 0x01              | ok
RAC Status       | 0x07              | ok
OS Watchdog      | 0x00              | ok
SEL              | Not Readable      | ns
Intrusion        | 0x00              | ok
PS Redundancy    | Not Readable      | ns
Fan Redundancy   | 0x01              | ok
CPU Temp Interf  | Not Readable      | ns
Drive            | 0x01              | ok
Cable SAS A      | 0x01              | ok
ECC Corr Err     | Not Readable      | ns
ECC Uncorr Err   | Not Readable      | ns
I/O Channel Chk  | Not Readable      | ns
PCI Parity Err   | Not Readable      | ns
PCI System Err   | Not Readable      | ns
SBE Log Disabled | Not Readable      | ns
Logging Disabled | Not Readable      | ns
Unknown          | 0xc0              | ok
CPU Protocol Err | Not Readable      | ns
CPU Bus PERR     | Not Readable      | ns
CPU Init Err     | Not Readable      | ns
CPU Machine Chk  | Not Readable      | ns
Memory Spared    | 0x00              | ok
Memory Mirrored  | 0x01              | ok
Memory RAID      | 0x01              | ok
Memory Added     | Not Readable      | ns
Memory Removed   | Not Readable      | ns
Memory Cfg Err   | 0x01              | ok
Mem Redun Gain   | 0x01              | ok
PCIE Fatal Err   | 0x01              | ok
Chipset Err      | 0x01              | ok
Err Reg Pointer  | 0x01              | ok
Mem ECC Warning  | 0x01              | ok
Mem CRC Err      | 0x01              | ok
USB Over-current | 0x01              | ok
POST Err         | Not Readable      | ns
Hdwr version err | Not Readable      | ns
Mem Overtemp     | 0x01              | ok
Mem Fatal SB CRC | 0x01              | ok
Mem Fatal NB CRC | 0x01              | ok

From this list we want to track the fan and temperature information.

Getting IPMI into SNMP

We want the ipmitool info accessible via snmp however the tool takes a while to run:

root@c-3-20 /etc/snmp# time ipmitool sdr >/dev/null

real    0m3.869s
user    0m0.000s
sys     0m0.000s

This can be sped up in two ways. One is by using the ipmitool sdr dump dell_sdr.txt command which dumps the sdr info for the local node. This will signifantly speed up processing:

root@c-3-20 /etc/snmp# time ipmitool sdr dump ./dell_sdr.txt 
Dumping Sensor Data Repository to './dell_sdr.txt'

real    0m2.760s
user    0m0.000s
sys     0m0.000s

root@c-3-20 /etc/snmp# time ipmitool -S dell_sdr.txt sdr > /dev/null

real    0m2.382s
user    0m0.000s
sys     0m0.000s

We can also use the /dev/shm (shared memory) area to store the output:

root@c-3-20 /etc/snmp# time ipmitool -S dell_sdr.txt sdr > /dev/shm/dell.ipmi 

real    0m1.003s
user    0m0.000s
sys     0m0.000s

This is a light enough load to be able to run every minute.

Exposing Dell SDR Info via SNMP

The net-snmp package allows extensions to be added to the snmp host system. This is done by adding a line like:

extend .1.3.6.1.4.1.2021.8.5 1 /bin/cat /dev/shm/dell.ipmi

to the /etc/snmpd.conf file. This line specifies a new OID (.1.3.6.1.4.1.2021.8.5) which is the output of a command.

Since we only want the sdr info corresponding to the fans and temps of interest I created a perl script to run the ipmitool command, parse the output and output a single line in a format Cacti will like:

#!/usr/bin/env perl
#
#  Uses ipmitool to "dump" Dell sensor info for P1950
#
# Shawn McKee <smckee@umich.edu> 
######################################################

$ipmitool = "/usr/bin/ipmitool -S /etc/snmp/sdr.dmp sdr";
if ( ! -e "/etc/snmp/sdr.dmp" ) {
    system("/usr/bin/ipmitool sdr dump /etc/snmp/sdr.dmp");
}

# Parse ipmitool output for Dell SDR values
open(CS,"$ipmitool |");

$ntemp=0;
@tempname=("TempCPU1delta","TempCPU2delta","TempChassis1","TempChassis2","TempAmbient","CPUTempInterf");
$nfan=0;
@fanname=("FanMod1A","FanMod1B","FanMod1C","FanMod1D","FanMod2A","FanMod2B","FanMod2C","FanMod2D","FanMod3A","FanMod3B","FanMod3C","FanMod3D","FanMod4A","FanMod4B","FanMod4C","FanMod4D");
while (<CS>) {
#    print;
   ($name,$value,$status)=split(/\|/);
   $name =~ s/\s//g;
   $value =~ s/\s//g;
   $status =~ s/\s//g;
   if ($name =~ /FAN/) {
       $name=$fanname[$ntemp++];
       $value=~/(\d+)/;
       $value=$1;
       $SDR{$name}=$value;
#       print "FAN name=|$name|, value=|$value|, status=|$status|\n";
   } elsif ($name =~ /Temp/) {       
       $name=$tempname[$nfan++];
       $value=~/([-+\d]+)/;
       $value=$1;
       $SDR{$name}=$value;
#       print "TEMP name=|$name|, value=|$value|, status=|$status|\n";
   } else {
#     print "Found name=|$name|, value=|$value|, status=|$status|\n";
   }
}
close(CS);
foreach $key (sort keys %SDR) {
    $key !~ /Inter/ && print "$key:$SDR{$key} ";
}
print "\n";

This script will automatically make a sdr.dmp file the first time it runs.

The output looks like:
root@c-3-20 /etc/snmp# perl dump_dell.pl
FanMod1A:7050 FanMod1B:7050 FanMod1C:4500 FanMod1D:4650 FanMod2A:7275 FanMod2B:7575 FanMod2C:4650 FanMod2D:4650 FanMod3A:7725 FanMod3B:7425 FanMod3C:4800 FanMod3D:4875 FanMod4A:7500 FanMod4B:7725 FanMod4C:4800 FanMod4D:4800 TempAmbient:19 TempCPU1delta:-43 TempCPU2delta:-34 TempChassis1:40 TempChassis2:40 

To make this easy to update we create a dell.cron file:
#!/bin/bash

/etc/snmp/dump_dell.pl  > /dev/shm/dell.ipmi

This can be added to the 'root' cron to run every minute. The /dev/shm/dell.ipmi file wil always contain the most recent measurements and the snmp extension uses a lightweight 'cat' to expose the info.

Setup on Dell Nodes

The following should be done on each Dell node:
  • Copy the dell.cron, dump_dell.pl files to /etc/snmp/ on the node.
  • Edit the '/etc/snmp/snmpd.conf' file and add a line:
    • extend .1.3.6.1.4.1.2021.8.5 1 /bin/cat /dev/shm/dell.ipmi
  • Add a 'root' cron entry for dell.cron to run every minute.

Testing

To test that things are working try a "remote" snmp command. From another node do:
  • snmpwalk -v 2c -c usatlasgrid .1.3.6.1.4.1.2021.8.5

You should get something like:
[umopt1:~]# snmpwalk -v 2c -c usatlasgrid c-3-20 .1.3.6.1.4.1.2021.8.5
UCD-SNMP-MIB::extTable.5.1.0 = INTEGER: 1
UCD-SNMP-MIB::extTable.5.2.1.2.1.49 = STRING: "/bin/cat"
UCD-SNMP-MIB::extTable.5.2.1.3.1.49 = STRING: "/dev/shm/dell.ipmi"
UCD-SNMP-MIB::extTable.5.2.1.4.1.49 = ""
UCD-SNMP-MIB::extTable.5.2.1.5.1.49 = INTEGER: 5
UCD-SNMP-MIB::extTable.5.2.1.6.1.49 = INTEGER: 1
UCD-SNMP-MIB::extTable.5.2.1.7.1.49 = INTEGER: 1
UCD-SNMP-MIB::extTable.5.2.1.20.1.49 = INTEGER: 4
UCD-SNMP-MIB::extTable.5.2.1.21.1.49 = INTEGER: 1
UCD-SNMP-MIB::extTable.5.3.1.1.1.49 = STRING: "FanMod1A:7125 FanMod1B:7050 FanMod1C:4500 FanMod1D:4650 FanMod2A:7275 FanMod2B:7575 FanMod2C:4650 FanMod2D:4650 FanMod3A:7725 FanMod3B:7425 FanMod3C:4875 FanMod3D:4875 FanMod4A:7500 FanMod4B:7725 FanMod4C:4875 FanMod4D:4875 TempAmbient:19 TempCPU1delta:-42 TempCPU2delta:-34 TempChassis1:40 TempChassis2:40 "
UCD-SNMP-MIB::extTable.5.3.1.2.1.49 = STRING: "FanMod1A:7125 FanMod1B:7050 FanMod1C:4500 FanMod1D:4650 FanMod2A:7275 FanMod2B:7575 FanMod2C:4650 FanMod2D:4650 FanMod3A:7725 FanMod3B:7425 FanMod3C:4875 FanMod3D:4875 FanMod4A:7500 FanMod4B:7725 FanMod4C:4875 FanMod4D:4875 TempAmbient:19 TempCPU1delta:-42 TempCPU2delta:-34 TempChassis1:40 TempChassis2:40 "
UCD-SNMP-MIB::extTable.5.3.1.3.1.49 = INTEGER: 1
UCD-SNMP-MIB::extTable.5.3.1.4.1.49 = INTEGER: 0
UCD-SNMP-MIB::extTable.5.4.1.2.1.49.1 = STRING: "FanMod1A:7125 FanMod1B:7050 FanMod1C:4500 FanMod1D:4650 FanMod2A:7275 FanMod2B:7575 FanMod2C:4650 FanMod2D:4650 FanMod3A:7725 FanMod3B:7425 FanMod3C:4875 FanMod3D:4875 FanMod4A:7500 FanMod4B:7725 FanMod4C:4875 FanMod4D:4875 TempAmbient:19 TempCPU1delta:-42 TempCPU2delta:-34 TempChassis1:40 TempChassis2:40 "

Interpreting the Values

There are 5 temps reported:

root@c-1-29 ~# ipmitool sdr
Temp             | -39 degrees C     | ok
Temp             | -31 degrees C     | ok
Temp             | 40 degrees C      | ok
Temp             | 40 degrees C      | ok
Ambient Temp     | 20 degrees C      | ok

The first two are CPU temps reported from sensors on the CPU. They are reported relative to the CPU critical temperature. The 3rd and 4th values (40 degree C) seem to be unused? The Ambient Temp seems to be a chassis temp near the front of the chassis (20C = 68F), is this beleivable?

http://lists.us.dell.com/pipermail/linux-poweredge/2007-July/032172.html

-- ShawnMcKee - 24 Sep 2007
Topic revision: r5 - 16 Oct 2009 - 20:14:39 - TomRockwell
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback