How-to monitor linux heartbeat with SNMP
Monitoring is one of the most vital part of all online business right now. A server what fail to deliver its content to a client it’s a big problem, because of this server disruptive service or downtime is our the worst enemies. Some downtimes are impossible to be predicted and monitoring your system is the best thing you can do. Did you ever asked yourself what means 99% availability? 7 hours per month of downtime. 7 hours for a client can be very frustrating.
In this article I will try to show you how to monitor your heartbeat (Linux HA) servers with snmp and nagios. Heartbeat it comes directly with an agent (hbagent) what can send traps to snmpd daemon when an event has been detected on the running heartbeat. Knowing when a load balancer is going down is helping you to detect and fix the problem on your environment very fast and efficient.
Loadbalancer (HA) Environment
* Debian GNU/Linux version 5.0 “Lenny” 64 bit
* heartbeat2 2.1.3-6lenny0
* ipvsadm 1.24-2.1
* ldirectord 2.1.3-6lenny0
* iproute 20080725-2
* snmpd 5.4.1~dfsg-12
Nagios Server Environment
* Debian GNU/Linux version 5.0 “Lenny” 64 bit
* snmp 5.4.1~dfsg-12
* nagios 3.1.0 (off course with all dependency installed like gd etc)
* bash
Configuring SNMP on HA Server
You don’t need a read write community to interrogate the snmp server so we will use a read-only one.
Just edit /etc/snmp/snmpd.conf
com2sec readonly default community
uncomment master directive
master agentx
and add after
trap2sink localhost
Now restart snmpd server.
/etc/init.d/snmpd restart
Restarting network management services: snmpd.
Configuring Heartbeat
Open /etc/ha.d/ha.cf and add
respawn root /usr/lib/heartbeat/hbagent
Now restart your heartbeat
/etc/init.d/heartbeat restart
Stopping High-Availability services:
Done.Waiting to allow resource takeover to complete:
Done.Starting High-Availability services:
Done.
If you have minimum 2 nodes of HA you will not have any problem with your service, but to be sure just plan a short downtime when you are doing this.
Now you can check if you have in your snmp informations about heartbeat
snmpwalk -c community -On localhost -v2c -mLINUX-HA-MIB enterprises.4682
LINUX-HA-MIB::LHATotalNodeCount.0 = Counter32: 2
LINUX-HA-MIB::LHALiveNodeCount.0 = Counter32: 2
LINUX-HA-MIB::LHACurrentNodeID.0 = INTEGER: 1
LINUX-HA-MIB::LHAResourceGroupCount.0 = Counter32: 0
LINUX-HA-MIB::LHANodeName.1 = STRING: lb-2
LINUX-HA-MIB::LHANodeName.2 = STRING: lb-1
LINUX-HA-MIB::LHANodeType.1 = INTEGER: normal(1)
LINUX-HA-MIB::LHANodeType.2 = INTEGER: normal(1)
LINUX-HA-MIB::LHANodeStatus.1 = INTEGER: active(3)
LINUX-HA-MIB::LHANodeStatus.2 = INTEGER: active(3)
LINUX-HA-MIB::LHANodeUUID.1 = STRING: 52e7034b-c221-4aae-a23e-ac1b7f6ec638
LINUX-HA-MIB::LHANodeUUID.2 = STRING: 81f09ed-5f41-42c5-8c39-9ea9055ed1c5
[ ... snip ... ]
Now we can start monitoring with nagios.
Snmp informations and our code
What informations we can get from snmp about heartbeat?
1. LINUX-HA-MIB::LHATotalNodeCount.0 – Number of nodes
2. LINUX-HA-MIB::LHALiveNodeCount.0 – Number of Live nodes
3. LINUX-HA-MIB::LHANodeStatus.x – Status of the node x
So I will try to build a pseudocode,based on snmp variables what we have, to show you how my nagios script will work
if LHATotalNodeCount.0 == 0 then exit(critical) //probably hbagent is not running
if LHATotalNodeCount.0 != LHALiveNodeCount.0 then //is possible to have some lost nodes
if LHALiveNodeCount.0 == 0 exit(critical) // yes we have lost all nodes
else
exit (warning) //we lost just some nodesif LHANodeStatus.x != 3 then // 3 means active
fnode++if fnode == LHATotalNodeCount.0 then //no node active
exit (critical)exit (ok)
The bash script code is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | #!/bin/bash # Autor: Stanila Constantin Adrian # Date: 20/03/2009 # Description: check the number of active heartbeats # http://www.randombugs.com # Get program path REVISION=1.3 PROGNAME=`/bin/basename $0` PROGPATH=`echo $0 | /bin/sed -e 's,[\\/][^\\/][^\\/]*$,,'` #nagios error codes . $PROGPATH/utils.sh usage () { echo "\ Nagios plugin to heartbeat. Usage: $PROGNAME -H host -C community $PROGNAME [--help | -h] $PROGNAME [--version | -v] Options: -H Hostname for snmp disk query -C Community for snmp disk query --help -l Print this help information --version -v Print version of plugin " } help () { print_revision $PROGNAME $REVISION echo; usage; echo; support } # Verifies if check_snmp exists to ensure snmp utils are installed ... probably better we check for snmpwalk ... on the next version if [ ! -x ${PROGPATH}/check_snmp ] then echo "UNKNOWN - ${PROGPATH}/check_snmp not exists" exit $STATE_UNKNOWN fi while test -n "$1" do case "$1" in --help | -h) help exit $STATE_OK;; --version | -v) print_revision $PROGNAME $REVISION exit $STATE_OK;; -H) shift HOST=$1;; -C) shift COMMUNITY=$1;; *) usage; exit $STATE_UNKNOWN;; esac shift done if [ "$HOST" == "" ] then echo "Parameter -H is necessary" exit $STATE_UNKNOWN fi if [ "$COMMUNITY" == "" ] then echo "Parameter -C is necessary" exit $STATE_UNKNOWN fi # Exec snmp query OID=.1.3.6.1.4.1.4682 declare -i I=0 #LINUX-HA-MIB::LHATotalNodeCount.0 NODES=$(snmpwalk -v 1 -On -c ${COMMUNITY} ${HOST} ${OID}.1.1.0 | cut -d"=" -f2 | cut -d":" -f2 | sed 's/ //g' | tr '\n' ' ') #LINUX-HA-MIB::LHALiveNodeCount.0 LNODES=$(snmpwalk -v 1 -On -c ${COMMUNITY} ${HOST} ${OID}.1.2.0 | cut -d"=" -f2 | cut -d":" -f2 | sed 's/ //g' | tr '\n' ' ') #Nodes == "" if [ $NODES =="" ]; then echo -e "HEARTBEAT Agent is not running !" exit $STATE_CRITICAL fi for index in `seq 1 ${NODES}` do #LINUX-HA-MIB::LHANodeStatus.x ACT=$(snmpwalk -v 1 -On -c ${COMMUNITY} ${HOST} ${OID}.2.1.4.${index} | cut -d"=" -f2 | cut -d":" -f2 | sed 's/ //g' | tr '\n' ' ') if [ $ACT != 3 ]; then let I=I+1 fi done #if Number of failures == number of nodes we have a big problem if [ $I == $NODES ]; then echo -e "HEARTBEAT is running out of nodes !" exit $STATE_CRITICAL fi #If Number of nodes != of number of Live Nodes then we have a minor problem if [ $NODES != $LNODES ]; then echo -e "HEARTBEAT lost some nodes !" exit $STATE_WARNING fi # if Number of failures != 0 the nwe have a minor problem (we already checked if I==NODES) if [ $I != 0 ]; then echo -e "HEARTBEAT lost some nodes !" exit $STATE_WARNING fi echo -e "All Heartbeats up and running !" exit $STATE_OK |
Upload your script in your scripts folder and don’t forget to set it as executable (chmod +x) and add the following line in your commands.cfg configuration file
define command{
command_name check_heartbeat
command_line /path to your folder/check_snmp_heartbeat.sh -H $HOSTADDRESS$ -C $ARG1$
}
And your are done to configure your service to do heartbeat checks.
define service{
use generic-service
hostgroup_name loadbalancers
servicegroups heartbeats
service_description Check Heartbeats
check_command check_heartbeat!community
}
Now we will check if your nagios configuration is ok
nagios -v /etc/nagios3/nagios.cfg
[ ... snip ... ]
Total Warnings: 0
Total Errors: 0Things look okay – No serious problems were detected during the pre-flight check
If you have 0 warnings and 0 errors you can restart nagios.
/etc/init.d/nagios restart
Good luck.
























It looks like the above only works if you are using the new CRM style. There is a patch if CRM is off, but it is not in the Debian package.
My syslog has the following:
lha-snmpagent: [14971]: ERROR: oc_ev_activate error [1]
lha-snmpagent: [14971]: debug: Membership service currently not available. Will try again later. errno [2]
lha-snmpagent: [14971]: ERROR: CIB connection signon failed.
lha-snmpagent: [14971]: ERROR: init_cib() failed.
heartbeat: [14928]: WARN: Managed /usr/lib/heartbeat/hbagent process 14971 exited with return code 252.
heartbeat: [14928]: ERROR: Client /usr/lib/heartbeat/hbagent “respawning too fast”
Oh well, thanks anyway.
Leave your response!