Setting up fully redundant failover nagios servers

Posted by brose

Oct 4

Haven’t updated in a while, so I’ll write a good post to make up for it.

Recently, I’ve encountered the need to set up fully redundant nagios servers in a typical pair setup. However, reading the documentation [here], the solutions seemed lacking. The official method is to simply run two different machines with the same configuration. The master should do everything a normal nagios server should do, but the slave should have its host & service checks and notifications disabled. Then, a separate mechanism is set up so that the slave can “watch” the master, and enable the aforementioned features if the master goes down. Then, still checking, the slave will disable those features again when the master comes back alive.

Well, this solution sounds just fine in theory, but in practice it really creates several more problems than it solves. For instance, acknowledgements, scheduled downtimes, and comments made do NOT synchronize with the official method. Their mechanism does not allow for this, as it uses the obsessive-compulsive service and host parameters, which can be executed after every single check is run. It therefore has no access to the comment/acknowledgement data, so it simply cannot synchronize.

So can this data be synchronized? The short answer is yes, but I’ll explain the game plan before we dive in. Nagios can (unknowingly) provide us with another synchronization method, its own internal retention.dat file. This file is, by default, written to every 90 minutes by the nagios process, and contains all of the information necessary to restore nagios to the state it was in when it exited. Sounds like exactly what we need! So we’re going to now stop thinking about running nagios, and start thinking about how we can take this blob of ASCII data, and ensure it never gets corrupted and is as frequently as possible being updated. This is, after all, the true goal of the situation.

First and foremost, we will need a nagios installation! You can follow their documentation on this one and set one up for yourself. It’s ok, I can wait.

Second, we need another nagios installation on a second server! Hop to it.

Third, all of the nagios configuration files need to be constantly synchronized between these two hosts. I use puppet to synchronize my config files over the servers I administer, so unfortunately my implementation is highly specific. You may need to find other ways to synchronize your config files, but this should not be terribly difficult (perhaps a Makefile with a versioning repository and ssh keys?). This is needed in the official nagios failover deployment as well. Anyway, one “gotcha” I faced was the need to change the configuration parameter “retain_state_information=90” to “retain_state_information=1”. Do not forget to do this or else synchronizations will only occur once every 90 minutes.

Fourth, you will need to deploy this script on both hosts, and configure the requirements. You will see embedded ERB syntax in this script, that is because puppet allows me to configure discrepancies in my deployment inline, as the final configuration files are generated on-the-fly, then pushed to the clients.

[root@puppet ~]# cat /usr/bin/nagios-watchdog.sh.erb
#!/bin/bash

# Executable variables. Useful.
RM=”/bin/rm -f”
MV=”/bin/mv”
ECHO=”/bin/echo -e”
FIXFILES=”/sbin/fixfiles”
MAILER=/usr/sbin/sendmail
SUBJECT=”URGENT: nagios master process switch has taken place.”
RECIPIENT=”sysadmin@example.com”
SERVICE=/etc/init.d/nagios
RETENTIONFILE=/var/log/nagios/retention.dat

# This is where we point the servers at each-other (configure this properly in your deployment!)…
<% if fqdn == “nagios1.example.com” %>
MASTERHOST=192.168.1.2
<% else %>
MASTERHOST=192.168.1.1
<% end %>

# Ensure only one copy can run at a time…
PIDFILE=/var/run/nagios-watchdog.pid
if [ -e ${PIDFILE} ]; then
exit 1;
else
touch ${PIDFILE};
fi

# Checks the actual daemon status on the other host…
su nagios -c “ssh ${MASTERHOST} \”/etc/init.d/nagios status\” >/dev/null 2>&1″

# Is the other host doing all the work?
if [ $? -eq 0 ]; then
# Stop what I’m doing…
${SERVICE} stop >/dev/null 2>&1

# Copy the retention data from the other nagios process…
su nagios -c “scp ${MASTERHOST}:${RETENTIONFILE} /tmp/”;

# Verify that we didn’t get a corrupted copy…
if [ `grep “{” /tmp/retention.dat | wc -l` -eq `grep “}” /tmp/retention.dat | wc -l` ]; then
${MV} /tmp/retention.dat ${RETENTIONFILE};
else
${RM} /tmp/retention.dat;
fi
${FIXFILES} restore /var/log/nagios
else
${SERVICE} status >/dev/null 2>&1
if [ $? -ne 0 ]; then
${ECHO} “From: nagios-watchdog@`hostname`\nSubject: ${SUBJECT}\nTo: ${RECIPIENT}\nNow running on host: `hostname`” | ${MAILER} ${RECIPIENT};
${SERVICE} start >/dev/null 2>&1;
fi
fi

${RM} ${PIDFILE}

exit 0;

There is a single requirement to this script, you must give no-password ssh keys to the nagios accounts on each host, but you can use those securely by using the allowed commands directives of the authorized_keys file.

Fifth, and finally, we must implement a mutex operation around running nagios processes. Recall that we are synchronizing copies of nagios internal state data, and having a running nagios process is just a luxury. If you look at the script above, it simply ensures that nagios is running one, but not both servers, and ensures that the newest retention.dat file always has priority. The mutex operation doesn’t need to be infinetely accurate, I used the following relatively barbaric solution:

[root@nagios1 ~]# crontab -l
1,5,9,13,17,21,25,29,33,37,41,45,49,53,57 * * * * /usr/bin/watchdog-nagios.sh
0 6 * * * /usr/bin/nagios-reports.sh
0 12 * * * /usr/bin/nagios-reports.sh

[root@nagios2 ~]# crontab -l
3,7,11,15,19,23,27,31,35,39,43,47,51,55,59 * * * * /usr/bin/watchdog-nagios.sh
0 6 * * * /usr/bin/nagios-reports.sh
0 12 * * * /usr/bin/nagios-reports.sh

Sixth, and optionally, as you can see above, I’ve also set up redundant reporting. We do a similar test to ensure that at a maximum, only one email report is dispatched for the given timeframe. In this solution, reports could be theoretically lost forever if a specific set of circumstances is met, but that was deemed acceptable in this deployment. To see the real magic behind that script:

[root@puppet ~]# cat /var/lib/puppet/files/nagios/usr/bin/nagios-reports.sh.erb
#!/bin/bash

/etc/init.d/nagios status >/dev/null 2>&1

if [ $? -eq 0 ]; then
/usr/bin/nagios-reporter.pl –type=24 –embedcss –warnings
fi

And voila, of course nagios-reporter.pl could be any report generation tool you wish, just be sure to call it in the method that suits your reporting needs.

Seventh, and convenient to have, I also wrote these two quick PHP scripts. Throw them in /var/www/html on each nagios box and do not redirect straight to nagios. Then setup DNS in a round-robin multiple A-record fashion. That is,

[root@puppet ~]# dig +short nagios.example.com
192.168.1.1
192.168.1.2

Once you get that set up, insert these two files into the aforementioned directories:

[root@puppet ~]# cat /var/lib/puppet/files/nagios/var/www/html/index.php.erb | sed ‘s/</\</g’
<HTML>
<HEAD>
<TITLE>INTERNal Redirect</TITLE>
</HEAD>
<FRAMESET ROWS=”30,100%” BORDER=”1″ STYLE=”border-style:solid” noresize>
<FRAME SRC=”switcher.php” NAME=”switcher”/>
<?php

// This will set the $me and $you variables correctly…
$me = “<%= fqdn %>”;
if($me == “nagios1.example.com”)
{ $you = “nagios2.example.com”; }
else
{ $you = “nagios1.example.com”; }

# Test whether or not nagios is running locally.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, “https://localhost/nagios/cgi-bin/status.cgi”);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$output = curl_exec($ch);
curl_close($ch);
$pos = strpos($output, “Whoops!”);

if($pos === false)
{ echo(“<FRAME SRC=\”https://$me/nagios/\” NAME=\”activenode\”/>”); }
else
{ echo(“<FRAME SRC=\”https://$you/nagios/\” NAME=\”activenode\”/>”); }

?>

</FRAMESET>
</HTML>

[root@puppet ~]# cat /var/lib/puppet/files/nagios/var/www/html/switcher.php.erb | sed ‘s/</\</g’
<HTML>
<HEAD>
<TITLE>Switcher</TITLE>
</HEAD>
<BODY>
<CENTER>
<FONT SIZE=”-1″>

<?php
$me = “<%= fqdn %>”;
if($me == “nagios1.example.com”)
{ $you = “nagios2.example.com”; }
else
{ $you = “nagios1.example.com”; }

# Test whether or not nagios is running locally.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, “https://localhost/nagios/cgi-bin/status.cgi”);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$output = curl_exec($ch);
curl_close($ch);
$pos = strpos($output, “Whoops!”);

if($pos === false)
{ $current = $me; }
else
{ $current = $you; }

echo(”
Currently using: $current [<a href=\”javascript:parent.document.location.reload(true)\”>Redetect active node</a>]”);
?>

</FONT>
</CENTER>
</BODY>
</HTML>

So now, you can simply visit http://nagios.example.com and nagios will always be displayed, unless a particularly bizarre set of circumstances has occurred. If this is the case, don’t panic! Remember, we set our minds correctly in the beginning and the integrity of the retention.dat files is not in question. The scipts may just take a minute or two to adjust themselves properly. For those that worry the DNS failover wouldn’t work, I’ve verified that it does on some of the popular browsers. There is no 90-second timeout delay, either, as in all but the rarest circumstances. I verified that a timeout can occur if the first connect() call’s SYN packets are dropped completely, but this is the aforementioned rare circumstance. Most testing on this is done at the iptables level, but be sure to REJECT the SYN packets (not DROP!) if you want an accurate account of the speed of the failover in your web-browser during a real-life outage. Also ensure that your router will send proper ICMP Host Unreachable responses if one of the addressed hosts is offline.

I think that’s pretty much all you need to get going. This deployment has been running for a little while now in a production environment, and has been rock-solid. It’s a bit more work than the official solution, but it solved my monitoring needs and extensive testing, both real-world and artificial, has not yet revealed any issue with this solution.

Filed under: Linux tips, Nagios Plugins, Programming Tips

17 comments

Comment by Brian Epstein on October 5, 2010 at 11:12 am

This is great. Definitely makes the most sense for a HA pair of nagios servers. I definitely recommend this to anyone running Nagios.

Thanks brose,
ep
Comment by Mary Starr on October 6, 2010 at 4:22 pm

Hi Ben, I just posted your howto to the Nagios Community site. Thanks for your hard work…we appreciate it and so does the rest of the community!
Pingback by Barracuda Networks Barracuda Load Balancer 640 with 1yr Energize Updates | barracuda spam appliance on October 10, 2010 at 11:21 pm

[…] Setting up fully redundant failover nagios servers :All My Base […]
Comment by Bryce on October 12, 2010 at 2:29 pm

Wow Ben, that’s some great work. I love the testing that you did, really sounds like it’s the proper way to do it.
Thanks
Comment by Dale Stubblefield on November 21, 2010 at 12:55 pm

Excellent ideas.
Comment by Dave on December 20, 2010 at 1:24 pm

Brose, I am using rsync on a cron job to sync my config files…
When I try to use the watchdog script, running it as a shell script I get:
./nagios-watchdog.sh.erb
./nagios-watchdog.sh.erb: line 15: syntax error near unexpected token `newline’
./nagios-watchdog.sh.erb: line 15: `’

if I call the script with ERB, I get:
erb ./nagios-watchdog.sh.erb
./nagios-watchdog.sh.erb:15: undefined local variable or method `fqdn’ for main:Object (NameError)

is there something I am not seeing here?
This is CentOS 5.4 and Ruby 1.8.5 and erb 2.0.4
Thanks
Comment by brose on December 20, 2010 at 1:41 pm

Dave,

The embedded ERB syntax is parsed by Puppet in my environment, then the parsed files are distributed to the hosts. If you are not using Puppet, you will need to manually fill in the sections with the appropriate IP addresses. That is, this script on nagios1 should have MASTERHOST=nagios2, and nagios2 should have MASTERHOST=nagios1.
Comment by Jason on April 14, 2011 at 4:32 pm

is this just for Core or will this work for Nagios XI using databases for configs and NDOutils?
Comment by brose on April 14, 2011 at 4:50 pm

Jason,

All of my testing and work has been with Nagios core, the free open-source version available from rhel/epel and the like. However, as long as Nagios XI still uses the retention.dat mechanism for restoring it’s exit state on startup, this should work. In fact, it may be even easier, as it sounds like you can point them both to the same database for config, and don’t need to worry about syncing with puppet or some such. I cannot provide any information on NDOutils or any plugins, it all depends whether or not they store their retention information in retention.dat. There is no reason why my script cannot be modified to scp multiple files, though. For example, I have extended it in our environment to also copy over the logs/ directory. This syncs historical data, making end-of-year reports reliable.
Comment by Aaron on May 2, 2011 at 5:57 pm

This guide was very helpful!

One addition I made was to check for running nagios by using exec ps -C nagios3 instead of the curl operation in your script. Both ways seem to work well.
Comment by brose on May 3, 2011 at 9:32 am

Aaron,
Thanks! Glad it was of use. I ended up going with the curl route as it did not require me to compile any custom SELinux modules. If you do the ps method, you will need to somehow enable httpd access to read the process table. You are correct, however – both ways will work well.
Comment by Michael Edwards on July 28, 2011 at 1:05 pm

As of 3.2.3 there is a separate option called “retention_update_interval” in addition to the retain_state_information option mentioned in this blog posting.

This is what I ended up after making some changes for greater compatability with my somewhat default nagios install as well as RHEL3/5 compatability.

#!/bin/bash

# Executable variables. Useful.
RM=”/bin/rm -f”
MV=”/bin/mv”
ECHO=”/bin/echo -e”
FQDN=”/bin/hostname –fqdn”
FIXFILES=”/sbin/fixfiles”
MAILER=/usr/sbin/sendmail
SUBJECT=”URGENT: nagios master process switch has taken place.”
RECIPIENT=”isadmin@vtls.com”
SERVICE=/etc/init.d/nagios
RETENTIONFILE=/usr/local/nagios/var/retention.dat

# This is where we point the servers at each-other (configure this properly in your deployment!)
#This should be for the other server of the pair
MASTERHOST=10.0.0.2

# Ensure only one copy can run at a time
PIDFILE=/var/run/nagios-watchdog.pid
if [ -e ${PIDFILE} ]; then
exit 1;
else
touch ${PIDFILE};
fi

# Checks the actual daemon status on the other host
#echo “su – nagios -c \”ssh ${MASTERHOST} ‘/etc/init.d/nagios status’\””
su – nagios -c “ssh ${MASTERHOST} /etc/init.d/nagios status”
#>/dev/null 2>&1

# Is the other host doing all the work?
if [ $? -eq 0 ]; then
# Service running on MASTERHOST. Stop my service so there is only one.
#echo “Nagios running on MASTERHOST”
#echo ” ${SERVICE} stop ”
${SERVICE} stop >/dev/null 2>&1

# Copy the retention data from the other nagios process
#echo “su nagios -c \”scp ${MASTERHOST}:${RETENTIONFILE} /tmp/\””
su – nagios -c “scp ${MASTERHOST}:${RETENTIONFILE} /tmp/”;

# Verify that we didnt get a corrupted copy
if [ `grep “{” /tmp/retention.dat | wc -l` -eq `grep “}” /tmp/retention.dat | wc -l` ]; then
${MV} /tmp/retention.dat ${RETENTIONFILE};
else
${RM} /tmp/retention.dat;
fi
#${FIXFILES} restore /var/log/nagios
else
# echo “Service not running on MASTERHOST”
${SERVICE} status >/dev/null 2>&1
if [ $? -ne 0 ]; then
# echo “Service not running here either. Sending notification.”
${ECHO} “From: nagios-watchdog@`hostname`\nSubject: ${SUBJECT}\nTo: ${RECIPIENT}\nNow running on host: `hostname`” | ${MAILER} ${RECIPIENT};
# echo “Starting nagios on localhost.”
${SERVICE} start >/dev/null 2>&1;
fi
fi

${RM} ${PIDFILE}

exit 0;
Comment by Pedro Albuquerque on October 14, 2011 at 12:52 pm

Hi,

I think “retain_state_information=1” is just to retain information before its shutdown and not to sync every minute.
In Nagios Core 3.2.3, there is the variable “retention_update_interval” which determines how often (in minutes) that Nagios will automatically save retention data during normal operation.
Does anyone already figured out which variable should be configured to retain information every minute?

Cheers.
Comment by Praveen Diwakar on August 24, 2016 at 5:07 am

hi
I am trying HA solution quite a while but getting following error.

I have configured the two nagios servers as instructed above . I have been able to sync the retention .dat file. My problem is that I am using it in private network means a private IP [192.168.x.x] ,and when I run the nagios.watchdog.sh.erb script it give me “host key verification error” .

I have removed the known host files but still same error. I am able to ssh both servers without password.

Please give me solution on this!!!!!!!!!!

Thanks
Comment by ilie dumitru on January 9, 2017 at 3:36 am

I do it some years ago, in a following way:

Behind nagios work httpd and mysql, so there is the key of solution

A virtual ip address positioned on the master
The mysql slave, having data every 2 second from master’s bin-log (mysql replication)
A perl script, watching the master to know the state of services (ip up, httpd, mysqld etc) and if problems, can connect on the slave, restart the services, rise the virtual ip address, change my.cnf, and restart mysql server as master. The services Nagios are restart after
The perl script, positioned on a third server, can connect with public key declared in .ssh/authorized_keys

The nrpe conf files of all machines watches, must accept the 2 addr of nagios server (master and slave)

It works fine
Comment by bish on February 1, 2017 at 3:27 pm

‘cat | sed ‘ ?
Comment by Jorge on June 5, 2017 at 4:23 pm

Worked great!!

Thank you =)

All My Base

Setting up fully redundant failover nagios servers

About this blog

Recent Posts

Categories

Archives

Links

Meta