Monitoring: AGLT2 Compute Summary Page

The initial idea was to simply extract and rearrange some lines of the HTML Ganglia page for the MSU site, in order to reorder the load_one graphs into a compact table that would mimic the physical racks layout.

This project evolved into an AGLT2-wide summary page gathering in one place Condor, Panda, and Ganglia information relating to the compute production and analysis compute load.

Page Location

A python script on a desktop computer collects information from the various sources and updated the MSU dept web server:

AGL Compute Summary Page

This page may move to www.aglt2.org in the future.

Panda Info

The top section copies the job summary table for the production site AGLT2, and the job summary table for the analysis site ANALY_AGLT2; both are taken from panda.cern.ch.

The integration time of 6 hours was chosen for the tables as a compromise to show the current success/failure rate of the compute jobs with reasonable statistics (e.g. 3 hours seemed a bit short, and 24 hours averaged errors over too many jobs).

The title name of each table (e.g. "Panda Details for AGLT2 (6h)") links to the source Panda page.

Condor Info

This graph is what is displayed in the Condor Job Status page at gate01.aglt2.org, showing the number of running and queued jobs for production, analysis, and T3 jobs.

The graph links to the source Condor Status Page

Panda Graphs

These graphs show the 24 hour plots of Panda Jobs for AGLT2 Production and ANALY_AGLT2 Analysis queues.

These graphs are taken from the site summaries at gridinfo.triumf.ca

Each graph links to the source Panda page including the same graphs for Hour/Day/Month/Year.

Ganglia Info

The first implementation was just extracting and reordering the HTML data from the "MSU T2" ganglia page into a table following the physical arrangement of the nodes.

The next step was to realize that it would be rather easy to change the color coding ganglia uses to illustrate the degree of usage of each node. The range values ganglia uses (0-25-50-75-100-100+) are not very useful to us, as we would like more emphasis and discrimination around the 100% value. The color coding is also counter-intuitive from our perspective, as ganglia shows loaded nodes in orange and red, while we would consider this the "good" state and would rather see it green. The script thus replaces these ranges and colors with new ones:

load_one range color label
00.00 - 00.85 white idle
00.85 - 04.00 light blue lightly loaded
04.00 - 07.60 light green below potential
07.60 - 08.40 dark green matched load
08.40 - 12.00 light orange overloaded
12.00 - 16.00 dark oragen trouble?
16.00+ red ouch!

Each node is represented by a cell of the corresponding background color and filled with the ganglia load_one graph of the same background color scaled down to the cell size. The shrunken graphs still give (once you know what you are looking at) some sense of time history of the node's activity.

note: The ganglia graph re-coloring is made possible and quite simple by the realization that both the load_one value, and the graphs color are passed to the ganglia server which generates each graph.

When a node is flagged "down" in ganglia, the summary page will show the number of seconds, then days since the node was last heard from.

The MSU rack arrangement of compute nodes is quite regular and uniform, and the script expects to find nodes with particular addresses and thus flags the slots used for each "Switch", and will find missing nodes as "Missing?".

The UM rack arrangement of compute nodes is less uniform, and the script solely relies on the rocks naming convention xx-racknum-slotnum to determine the location of each node, without knowing that nodes are missing (unless they are flagged "down"). The naming for blade nodes does not follow such geographic nomenclature and they are thus treated as a special case with knowledge of their rack location. (note: there is no HTML directive I know to rotate an image, and the ganglia graphs for blade nodes are thus not very readable. I may explore using JavaScript in the future.)

Each compute node cell within a table links to the node's Ganglia page.

Each 24hour UM or MSU ganglia graph links to the source Ganglia page.

Refresh Rate

The script updates the HTML file every 60 s.

The HTML file is set to auto-refresh in your browser every 5 mn.

Routing and Caching

The aglt2.org domain is not reachable from everywhere. This means that e.g. all Ganglia graphs are out of reach from a typical home network.

Even when aglt2.org is unreachable, the UM and MSU Ganglia Rack sections will still show the color coded load_one bin of each node, and the summary table will still show the relative number of nodes in each bin range. The main added value of this page over the raw ganglia information is thus still achieved even though the per-node ganglia graphs themselves are not present.

In order to make all summary graphs available when aglt2.org is not reachable, the Condor status graph and the MSU and UM Ganglia 24hour summary graphs are copied to and served from the MSU dept web server.

-- PhilippeLaurens - 04 Jun 2009
Topic revision: r1 - 04 Jun 2009, PhilippeLaurens
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback