Home » Clustering, Debian, Featured, How-to, Linux, Recovery, Shell

How-to monitor linux heartbeat with SNMP

20 March 2009 2 Comments

Monitoring Heartbeat

Monitoring Heartbeat


Monitoring is one of the most vital part of all online business right now. A server what fail to deliver its content to a client it’s a big problem, because of this server disruptive service or downtime is our the worst enemies. Some downtimes are impossible to be predicted and monitoring your system is the best thing you can do. Did you ever asked yourself what means 99% availability? 7 hours per month of downtime. 7 hours for a client can be very frustrating.

In this article I will try to show you how to monitor your heartbeat (Linux HA) servers with snmp and nagios. Heartbeat it comes directly with an agent (hbagent) what can send traps to snmpd daemon when an event has been detected on the running heartbeat. Knowing when a load balancer is going down is helping you to detect and fix the problem on your environment very fast and efficient.

Loadbalancer (HA) Environment
* Debian GNU/Linux version 5.0 “Lenny” 64 bit
* heartbeat2 2.1.3-6lenny0
* ipvsadm 1.24-2.1
* ldirectord 2.1.3-6lenny0
* iproute 20080725-2
* snmpd 5.4.1~dfsg-12

Nagios Server Environment
* Debian GNU/Linux version 5.0 “Lenny” 64 bit
* snmp 5.4.1~dfsg-12
* nagios 3.1.0 (off course with all dependency installed like gd etc)
* bash 🙂

Configuring SNMP on HA Server

You don’t need a read write community to interrogate the snmp server so we will use a read-only one.

Just edit /etc/snmp/snmpd.conf

com2sec readonly default community

uncomment master directive

master agentx

and add after

trap2sink localhost

Now restart snmpd server.

/etc/init.d/snmpd restart
Restarting network management services: snmpd.

Configuring Heartbeat

Open /etc/ha.d/ha.cf and add

respawn root /usr/lib/heartbeat/hbagent

Now restart your heartbeat

/etc/init.d/heartbeat restart
Stopping High-Availability services:
Done.

Waiting to allow resource takeover to complete:
Done.

Starting High-Availability services:
Done.

If you have minimum 2 nodes of HA you will not have any problem with your service, but to be sure just plan a short downtime when you are doing this.

Now you can check if you have in your snmp informations about heartbeat

snmpwalk -c community -On localhost -v2c -mLINUX-HA-MIB enterprises.4682
LINUX-HA-MIB::LHATotalNodeCount.0 = Counter32: 2
LINUX-HA-MIB::LHALiveNodeCount.0 = Counter32: 2
LINUX-HA-MIB::LHACurrentNodeID.0 = INTEGER: 1
LINUX-HA-MIB::LHAResourceGroupCount.0 = Counter32: 0
LINUX-HA-MIB::LHANodeName.1 = STRING: lb-2
LINUX-HA-MIB::LHANodeName.2 = STRING: lb-1
LINUX-HA-MIB::LHANodeType.1 = INTEGER: normal(1)
LINUX-HA-MIB::LHANodeType.2 = INTEGER: normal(1)
LINUX-HA-MIB::LHANodeStatus.1 = INTEGER: active(3)
LINUX-HA-MIB::LHANodeStatus.2 = INTEGER: active(3)
LINUX-HA-MIB::LHANodeUUID.1 = STRING: 52e7034b-c221-4aae-a23e-ac1b7f6ec638
LINUX-HA-MIB::LHANodeUUID.2 = STRING: 81f09ed-5f41-42c5-8c39-9ea9055ed1c5
[ … snip … ]

Now we can start monitoring with nagios.

Snmp informations and our code

What informations we can get from snmp about heartbeat?

1. LINUX-HA-MIB::LHATotalNodeCount.0 – Number of nodes
2. LINUX-HA-MIB::LHALiveNodeCount.0 – Number of Live nodes
3. LINUX-HA-MIB::LHANodeStatus.x – Status of the node x

So I will try to build a pseudocode,based on snmp variables what we have, to show you how my nagios script will work

if LHATotalNodeCount.0 == 0 then exit(critical) //probably hbagent is not running

if LHATotalNodeCount.0 != LHALiveNodeCount.0 then //is possible to have some lost nodes
if LHALiveNodeCount.0 == 0 exit(critical) // yes we have lost all nodes
else
exit (warning) //we lost just some nodes

if LHANodeStatus.x != 3 then // 3 means active
fnode++

if fnode == LHATotalNodeCount.0 then //no node active
exit (critical)

exit (ok)

The bash script code is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
#!/bin/bash                                      
# Autor: Stanila Constantin Adrian               
# Date: 20/03/2009                               
# Description: check the number of active heartbeats
# http://www.randombugs.com                               
 
# Get program path
REVISION=1.3      
PROGNAME=`/bin/basename $0`
PROGPATH=`echo $0 | /bin/sed -e 's,[\\/][^\\/][^\\/]*$,,'`
 
#nagios error codes
. $PROGPATH/utils.sh
usage () {
    echo "\
Nagios plugin to heartbeat.
 
Usage:
  $PROGNAME -H host -C community 
  $PROGNAME [--help | -h]        
  $PROGNAME [--version | -v]     
 
Options:
  -H    Hostname for snmp disk query
  -C    Community for snmp disk query
  --help -l     Print this help information
  --version -v  Print version of plugin    
"                                          
}                                          
 
help () {
    print_revision $PROGNAME $REVISION
    echo; usage; echo; support        
}                                     
 
 
# Verifies if check_snmp exists to ensure snmp utils are installed ... probably better we check for snmpwalk ... on the next version
if [ ! -x ${PROGPATH}/check_snmp ]                                                                                                  
then                                                                                                                                
  echo "UNKNOWN - ${PROGPATH}/check_snmp not exists"                                                                                
  exit $STATE_UNKNOWN                                                                                                               
fi                                                                                                                                  
 
while test -n "$1"
do                
  case "$1" in    
    --help | -h)  
      help        
      exit $STATE_OK;;
    --version | -v)   
      print_revision $PROGNAME $REVISION
      exit $STATE_OK;;                  
    -H)                                 
      shift                             
      HOST=$1;;                         
    -C)                                 
      shift                             
      COMMUNITY=$1;;                    
    *)                                  
      usage; exit $STATE_UNKNOWN;;      
  esac
  shift                                 
done                                    
 
if [ "$HOST" == "" ]
then        
  echo "Parameter -H is necessary"
  exit $STATE_UNKNOWN
fi
 
if [ "$COMMUNITY" == "" ]
then
  echo "Parameter -C is necessary"
  exit $STATE_UNKNOWN
fi
 
# Exec snmp query
OID=.1.3.6.1.4.1.4682
 
declare -i I=0
#LINUX-HA-MIB::LHATotalNodeCount.0
NODES=$(snmpwalk -v 1 -On -c ${COMMUNITY} ${HOST} ${OID}.1.1.0 | cut -d"=" -f2 | cut -d":" -f2 | sed 's/ //g' | tr '\n' ' ')
#LINUX-HA-MIB::LHALiveNodeCount.0
LNODES=$(snmpwalk -v 1 -On -c ${COMMUNITY} ${HOST} ${OID}.1.2.0 | cut -d"=" -f2 | cut -d":" -f2 | sed 's/ //g' | tr '\n' ' ')
 
#Nodes == ""
if [ $NODES =="" ]; then
        echo -e "HEARTBEAT Agent is not running !"
        exit $STATE_CRITICAL
fi
 
for index in `seq 1 ${NODES}`
do
    #LINUX-HA-MIB::LHANodeStatus.x
    ACT=$(snmpwalk -v 1 -On -c ${COMMUNITY} ${HOST} ${OID}.2.1.4.${index} | cut -d"=" -f2 | cut -d":" -f2 | sed 's/ //g' | tr '\n' ' ')
    if [ $ACT != 3 ]; then
        let I=I+1
    fi
done
 
#if Number of failures == number of nodes we have a big problem
if [ $I == $NODES ]; then
        echo -e "HEARTBEAT is running out of nodes !"
        exit $STATE_CRITICAL
fi
#If  Number of nodes != of number of Live Nodes then we have a minor problem
if [ $NODES != $LNODES ]; then
        echo -e "HEARTBEAT lost some nodes !"
        exit $STATE_WARNING
fi
# if Number of failures != 0 the nwe have a minor problem (we already checked if  I==NODES)
if [ $I != 0 ]; then
        echo -e "HEARTBEAT lost some nodes !"
        exit $STATE_WARNING
fi
 
echo -e "All Heartbeats up and running !"
exit $STATE_OK

Upload your script in your scripts folder and don’t forget to set it as executable (chmod +x) and add the following line in your commands.cfg configuration file

define command{
command_name check_heartbeat
command_line /path to your folder/check_snmp_heartbeat.sh -H $HOSTADDRESS$ -C $ARG1$
}

And your are done to configure your service to do heartbeat checks.

define service{
use generic-service
hostgroup_name loadbalancers
servicegroups heartbeats
service_description Check Heartbeats
check_command check_heartbeat!community
}
Now we will check if your nagios configuration is ok

nagios -v /etc/nagios3/nagios.cfg
[ … snip … ]
Total Warnings: 0
Total Errors: 0

Things look okay – No serious problems were detected during the pre-flight check

If you have 0 warnings and 0 errors you can restart nagios.

/etc/init.d/nagios restart

Good luck.


2 Comments »

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.