Using "Monit" for Monitoring and Repairing AGLT2 Services
NOTE: THIS PAGE IS NOW MOSTLY OBSOLETE, WITH MONIT INSTALLED VIA CFENGINE
application monitors and "repairs" host and service problems. It is easy to deploy and configure. If you have the DAG
repo setup in yum simply do:
yum install monit
has many built-in options for testing resources and services. See man monit
once it is installed for an overview.
The default setup installs the current system with 'monit'. The default is to create a 'monit' service (which is chkconfig'ed off) but it is more robust to remove this service and use inittab
chkconfig --del monit
Then edit /etc/inittab and append something like:
#+SPM monit daemon 09Apr2009
mo:2345:respawn:/usr/bin/monit -Ic /etc/monit.conf
Before "starting" this we need to fix the default configuration.
The config file is /etc/monit.conf
and the following lines are the ones to make sure are present (and suitably customized for your install):
set daemon 60
set logfile syslog facility log_daemon
set mailserver 10.10.1.3, umopt1.aglt2.org, umopt1.grid.umich.edu, localhost
set eventqueue basedir /var/monit slots 100
set alert firstname.lastname@example.org
set httpd port 2812 and use address linat02.grid.umich.edu
check system linat02.grid.umich.edu if loadavg (5min) > 4 then alert
Some quick comments on the options in the monit.conf file above. First you need to
set mailserver <Your_smtp_server>
to be an appropriate and accessible mail server from this host. As you can see you are allowed to provide a list
of servers. The
set alert <email_address>
should be configured to use an appropriate email destination.
The set httpd
line needs to be setup for this host. Put in your own password for the 'admin' user. NOTE: protect this file so only 'root' can read it! You can also specify the hosts/subnets which are allowed to connect. To enable "ssl" you add ssl enable
but NOTE this requires a pemfile line (as shown). If you have host certificates already you can create the 'monit.pem' file as follows:
- Copy the hostkey.pem cp /etc/grid-security/hostkey.pem /etc/grid-security/monit.pem NOTE doing this gives the monit.pem file the right protection.
- Add the hostcert.pem cat /etc/grid-security/hostcert.pem /etc/grid-security/monit.pem
The check system
line also needs to be customized using the install host name. The last line "includes" whatever other configurations you want to apply to this host. This is nice and creates a "plugin" environment where you can supply common service, device or resource configurations that can be easily shared between monit
To start monit via the inittab simply do
Managing Monit on AGLT2 Nodes
The 'monit' service is very persistent
and if you turn off services it is monitoring it will quickly restart them (and/or alert on that fact). If you need to change the state of a service "manually" be sure to reconfigure monit
to disable monitoring for that service. This can be done via the web interface (see list below) or via the monit
command line interface.
Here are some useful monit
monit -t # This tests the current configuration's syntax for validity
monit status # Gives information on monit's status (details of what it is monitoring and their status)
monit unmonitor <x> # Turns off monitoring for <x>
monit reload # Reload the (updated?) configuration
monit -h # Get list of commands possible
Also if you update or change services that 'monit' is watching you may
need to update the corresponding configuration in /etc/monit.d/
. If you don't AND something about the service configuration is different after your change, 'monit' may complain or fail to properly handle this service until you fix the config.
Current "Monit" Service/Resource Configurations
For AGLT2 we are primarily monitoring the following services and resources:
- MySQL via a msyqld.conf configuration. Needs customization for the PID file, MySQL port and MySQL socket. Will restart the 'msyql' or 'mysqld' service as required if the service fails.
- ntpd via a ntpd.conf configuration. This one is fairly generic and shouldn't require customization. Checks the ntp service directly on udp port 123 as well. Will (re)start ntpd as required if it is not running or fails.
- Root filesystem via filesystem.conf configuration. This is also generic and shouldn't require customization. Monitors the '/' filesystem and alerts if the flags change (e.g. changes to RDONLY) or if the disk usage goes over 98%.
- LFC via lfcdaemon.lfc configuration. Monitors the lfcdaemon process and the lfc log file. Can restart the lfcdaemon if either the CPU usage is > 80% or the log file has not been updated in 60 minutes. Alerts are sent if CPU usage > 60% or the log file isn't changing in 5 minutes.
We need to create additional configurations for the following:
- httpd and/or apache
- dCache services --- there is a large list of possibilities here
List of "Monit" URLs for AGLT2
NOTE: These are only accessible from AGLT2 IPs!
- linat02 monit services (AFS DB server, GUMS server, NIS/KRB5 server)
- linat03 monit services (AFS DB server, GUMS server, NIS/KRB5 server)
- linat04 monit services (AFS DB server, GUMS server, NIS/KRB5 server)
- linat05 monit services (Web Server, CFengine server)
- linat06 monit services (AFS File server)
- linat07 monit services (AFS File server)
- linat08 monit services (AFS File server)
- gate02 monit services (Globus Gatekeeper)
- gate01 monit services (Globus Gatekeeper)
- lfc monit services (AGLT2 LFC server)
- dq2 monit services (AGLT2 DQ2 server)
- 09 Apr 2009