Archive for the ‘ Nagios Plugins ’ Category

Haven’t updated in a while, so I’ll write a good post to make up for it.

Recently, I’ve encountered the need to set up fully redundant nagios servers in a typical pair setup. However, reading the documentation [here], the solutions seemed lacking. The official method is to simply run two different machines with the same configuration. The master should do everything a normal nagios server should do, but the slave should have its host & service checks and notifications disabled. Then, a separate mechanism is set up so that the slave can “watch” the master, and enable the aforementioned features if the master goes down. Then, still checking, the slave will disable those features again when the master comes back alive.

Well, this solution sounds just fine in theory, but in practice it really creates several more problems than it solves. For instance, acknowledgements, scheduled downtimes, and comments made do NOT synchronize with the official method. Their mechanism does not allow for this, as it uses the obsessive-compulsive service and host parameters, which can be executed after every single check is run. It therefore has no access to the comment/acknowledgement data, so it simply cannot synchronize.

So can this data be synchronized? The short answer is yes, but I’ll explain the game plan before we dive in. Nagios can (unknowingly) provide us with another synchronization method, its own internal retention.dat file. This file is, by default, written to every 90 minutes by the nagios process, and contains all of the information necessary to restore nagios to the state it was in when it exited. Sounds like exactly what we need! So we’re going to now stop thinking about running nagios, and start thinking about how we can take this blob of ASCII data, and ensure it never gets corrupted and is as frequently as possible being updated. This is, after all, the true goal of the situation.

First and foremost, we will need a nagios installation! You can follow their documentation on this one and set one up for yourself. It’s ok, I can wait.

Second, we need another nagios installation on a second server! Hop to it.

Third, all of the nagios configuration files need to be constantly synchronized between these two hosts. I use puppet to synchronize my config files over the servers I administer, so unfortunately my implementation is highly specific. You may need to find other ways to synchronize your config files, but this should not be terribly difficult (perhaps a Makefile with a versioning repository and ssh keys?). This is needed in the official nagios failover deployment as well. Anyway, one “gotcha” I faced was the need to change the configuration parameter “retain_state_information=90” to “retain_state_information=1”. Do not forget to do this or else synchronizations will only occur once every 90 minutes.

Fourth, you will need to deploy this script on both hosts, and configure the requirements. You will see embedded ERB syntax in this script, that is because puppet allows me to configure discrepancies in my deployment inline, as the final configuration files are generated on-the-fly, then pushed to the clients.

[root@puppet ~]# cat /usr/bin/nagios-watchdog.sh.erb
#!/bin/bash

# Executable variables. Useful.
RM=”/bin/rm -f”
MV=”/bin/mv”
ECHO=”/bin/echo -e”
FIXFILES=”/sbin/fixfiles”
MAILER=/usr/sbin/sendmail
SUBJECT=”URGENT: nagios master process switch has taken place.”
RECIPIENT=”sysadmin@example.com”
SERVICE=/etc/init.d/nagios
RETENTIONFILE=/var/log/nagios/retention.dat

# This is where we point the servers at each-other (configure this properly in your deployment!)…
<% if fqdn == “nagios1.example.com” %>
MASTERHOST=192.168.1.2
<% else %>
MASTERHOST=192.168.1.1
<% end %>

# Ensure only one copy can run at a time…
PIDFILE=/var/run/nagios-watchdog.pid
if [ -e ${PIDFILE} ]; then
exit 1;
else
touch ${PIDFILE};
fi

# Checks the actual daemon status on the other host…
su nagios -c “ssh ${MASTERHOST} \”/etc/init.d/nagios status\” >/dev/null 2>&1″

# Is the other host doing all the work?
if [ $? -eq 0 ]; then
# Stop what I’m doing…
${SERVICE} stop >/dev/null 2>&1

# Copy the retention data from the other nagios process…
su nagios -c “scp ${MASTERHOST}:${RETENTIONFILE} /tmp/”;

# Verify that we didn’t get a corrupted copy…
if [ `grep “{” /tmp/retention.dat | wc -l` -eq `grep “}” /tmp/retention.dat | wc -l` ]; then
${MV} /tmp/retention.dat ${RETENTIONFILE};
else
${RM} /tmp/retention.dat;
fi
${FIXFILES} restore /var/log/nagios
else
${SERVICE} status >/dev/null 2>&1
if [ $? -ne 0 ]; then
${ECHO} “From: nagios-watchdog@`hostname`\nSubject: ${SUBJECT}\nTo: ${RECIPIENT}\nNow running on host: `hostname`” | ${MAILER} ${RECIPIENT};
${SERVICE} start >/dev/null 2>&1;
fi
fi

${RM} ${PIDFILE}

exit 0;

There is a single requirement to this script, you must give no-password ssh keys to the nagios accounts on each host, but you can use those securely by using the allowed commands directives of the authorized_keys file.

Fifth, and finally, we must implement a mutex operation around running nagios processes. Recall that we are synchronizing copies of nagios internal state data, and having a running nagios process is just a luxury. If you look at the script above, it simply ensures that nagios is running one, but not both servers, and ensures that the newest retention.dat file always has priority. The mutex operation doesn’t need to be infinetely accurate, I used the following relatively barbaric solution:

[root@nagios1 ~]# crontab -l
1,5,9,13,17,21,25,29,33,37,41,45,49,53,57 * * * * /usr/bin/watchdog-nagios.sh
0 6 * * * /usr/bin/nagios-reports.sh
0 12 * * * /usr/bin/nagios-reports.sh

[root@nagios2 ~]# crontab -l
3,7,11,15,19,23,27,31,35,39,43,47,51,55,59 * * * * /usr/bin/watchdog-nagios.sh
0 6 * * * /usr/bin/nagios-reports.sh
0 12 * * * /usr/bin/nagios-reports.sh

Sixth, and optionally, as you can see above, I’ve also set up redundant reporting. We do a similar test to ensure that at a maximum, only one email report is dispatched for the given timeframe. In this solution, reports could be theoretically lost forever if a specific set of circumstances is met, but that was deemed acceptable in this deployment. To see the real magic behind that script:

[root@puppet ~]# cat /var/lib/puppet/files/nagios/usr/bin/nagios-reports.sh.erb
#!/bin/bash

/etc/init.d/nagios status >/dev/null 2>&1

if [ $? -eq 0 ]; then
/usr/bin/nagios-reporter.pl –type=24 –embedcss –warnings
fi

And voila, of course nagios-reporter.pl could be any report generation tool you wish, just be sure to call it in the method that suits your reporting needs.

Seventh, and convenient to have, I also wrote these two quick PHP scripts. Throw them in /var/www/html on each nagios box and do not redirect straight to nagios. Then setup DNS in a round-robin multiple A-record fashion. That is,

[root@puppet ~]# dig +short nagios.example.com
192.168.1.1
192.168.1.2

Once you get that set up, insert these two files into the aforementioned directories:

[root@puppet ~]# cat /var/lib/puppet/files/nagios/var/www/html/index.php.erb | sed ‘s/</\</g’
<HTML>
<HEAD>
<TITLE>INTERNal Redirect</TITLE>
</HEAD>
<FRAMESET ROWS=”30,100%” BORDER=”1″ STYLE=”border-style:solid” noresize>
<FRAME SRC=”switcher.php” NAME=”switcher”/>
<?php

// This will set the $me and $you variables correctly…
$me = “<%= fqdn %>”;
if($me == “nagios1.example.com”)
{ $you = “nagios2.example.com”; }
else
{ $you = “nagios1.example.com”; }

# Test whether or not nagios is running locally.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, “https://localhost/nagios/cgi-bin/status.cgi”);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$output = curl_exec($ch);
curl_close($ch);
$pos = strpos($output, “Whoops!”);

if($pos === false)
{ echo(“<FRAME SRC=\”https://$me/nagios/\” NAME=\”activenode\”/>”); }
else
{ echo(“<FRAME SRC=\”https://$you/nagios/\” NAME=\”activenode\”/>”); }

?>

</FRAMESET>
</HTML>

[root@puppet ~]# cat /var/lib/puppet/files/nagios/var/www/html/switcher.php.erb | sed ‘s/</\</g’
<HTML>
<HEAD>
<TITLE>Switcher</TITLE>
</HEAD>
<BODY>
<CENTER>
<FONT SIZE=”-1″>

<?php
$me = “<%= fqdn %>”;
if($me == “nagios1.example.com”)
{ $you = “nagios2.example.com”; }
else
{ $you = “nagios1.example.com”; }

# Test whether or not nagios is running locally.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, “https://localhost/nagios/cgi-bin/status.cgi”);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$output = curl_exec($ch);
curl_close($ch);
$pos = strpos($output, “Whoops!”);

if($pos === false)
{ $current = $me; }
else
{ $current = $you; }

echo(”
Currently using: $current [<a href=\”javascript:parent.document.location.reload(true)\”>Redetect active node</a>]”);
?>

</FONT>
</CENTER>
</BODY>
</HTML>

So now, you can simply visit http://nagios.example.com and nagios will always be displayed, unless a particularly bizarre set of circumstances has occurred. If this is the case, don’t panic! Remember, we set our minds correctly in the beginning and the integrity of the retention.dat files is not in question. The scipts may just take a minute or two to adjust themselves properly. For those that worry the DNS failover wouldn’t work, I’ve verified that it does on some of the popular browsers. There is no 90-second timeout delay, either, as in all but the rarest circumstances. I verified that a timeout can occur if the first connect() call’s SYN packets are dropped completely, but this is the aforementioned rare circumstance. Most testing on this is done at the iptables level, but be sure to REJECT the SYN packets (not DROP!) if you want an accurate account of the speed of the failover in your web-browser during a real-life outage. Also ensure that your router will send proper ICMP Host Unreachable responses if one of the addressed hosts is offline.

I think that’s pretty much all you need to get going. This deployment has been running for a little while now in a production environment, and has been rock-solid. It’s a bit more work than the official solution, but it solved my monitoring needs and extensive testing, both real-world and artificial, has not yet revealed any issue with this solution.

Helping to fix Wall Street

This combines two of my favorite things!! Ok, maybe not FAVORITE persay, but recently, I’ve been writing a lot of nagios plugins. And recently, I’ve been hearing a lot about how the economy has been doing rather poorly. It’s as if the people running the economy need something to monitor it and make sure it’s doing okay.

And thus it was born: [the nagios check_economy plugin].

Right now it just monitors the Dow Jones Industrial Average and reports back on a range that you provide to it. My range at the current moment is warning if <=9000, critical if <=8500. The script will break if the people I'm mooching the stock data from redesign their website. Actually, you could probably even modify this script such that it checks other stock symbols besides DJI. The possibilities are endless! Dependencies include the curl package, an internet connection, and some time to waste. @mstarr, you may wish to file this one under humor. 🙂

For the impatient: [Download check_categorized_updates now]

I’ve been writing a lot of nagios plugins lately, and here’s the newest of the group. After googling around, I wasn’t able to find any nagios plugins that would support checking if, in the list of available packages, there were any that fell under the category of “security updates”. You know, like how PackageKit organizes the security updates.

I also decided to take the script one step further. You can specify required packages with the “-r” flag, and if they are found in the possible updates list, even for a feature enhancement, the plugin will report the system as “critical”. Otherwise, it reports as Warning.

Please do note in using this plugin that I parse out the metadata that yum prints by seeing if the output is greater than three lines. This will most definitely change from place to place. Also, this utility requires yum to be installed and it’s been designed on a RHEL/Fedora system. Updating it to use apt-get or the sun updating mechanism shouldn’t be terribly difficult, though, just a matter of changing the grep patterns.

So if the packages “kernel.x86_64” and “libtiff_x86_64” are available to be updates, and libtiff is a security update, here’s what the various combinations of options will return:

# ./check_updates => Warning
# ./check_updates -s => Critical

# yum update libtiff

# ./check_updates => Warning
# ./check_updates -s => Warning
# ./check_updates -r kernel.x86_64 => Critical
# ./check_updates -s -r kernel.x86_64 => Critical

# yum update kernel

# ./check_updates => Ok
# ./check_updates -s => Ok
# ./check_updates -r kernel.x86_64 => Ok
# ./check_updates -s -r kernel.x86_64 => Ok

A better Nagios SNMP plugin

The nagios plugin that you find in the package nagios-plugins-snmp was insufficient for my needs in a new nagios deployment. The biggest reason that it was insufficient was that it gets integer values, and then can only issue a warning or alert if that integer value is GREATER than what value you gave it. The deployment I’m setting up required that the values could be checked against a range, which is necessary when receiving SNMP data from a thermometer or hygrometer. This plugin supports receiving a range of values to check against, and then it returns the appropriate exit code.

I’ve written a tad bit of documentation in the top of the file, but here it is again, in block quotes! Download link is just below the block quote area.

# This script written with haste by Benjamin Rose, July 8th 2009 @ 11:45:42 AM
# It was written because the check_snmp plugin provided by the nagios package
# does not support range matching. It can check if the snmp value is greater than
# a given number but not less than, nor a range consisting of either a high value
# or a low value. Hence, this script, given a mode of 1 and a good range with
# which to work, will report back appropriately.
#
# Modes:
# 1 = Number comparison, reports on a given range. Argument order given in
# the usage statement.
# 2 = String comparison, which for now is just “Open” or otherwise.
#
# TODO:
# 1) Change the order of the arguments, putting mode in front of the
# variables, and then change the usage based on the given mode.
# 2) Allow the user to configure which strings are “good” and which are “bad”.

Plugin link: [snmp_plugin_wrapper]