<p>This morning, before getting ready for the day, I took a few moments as usual to check the status of the monitoring, expecting DZ-ARN-01 to have reached “green” Availability and Reliability. This was indeed the case, but I noticed two things when poking about :</p>
- MA-01-CNRST had 11 alarms against it - this looks like some kind of drastic failure at the site, perhaps an electrical outage over the weekend.
- The nagios messages were not being sent to Slack, even though I had spent some time last week configuring this.
Somewhat miffeed, I went to take a look at the nagios probes running on the AAROC nagios instance, mutting unspeakable epitaphs under my breath. Lo, I was greeted with
Damn right I believe this to be in error !</figcaption>
Horrified, I immediately thought that there had been a break-in. However, after a quick check it was clear that everything was in order, except for the fact that
Filesystem Size Used Avail Use% Mounted on
xxx G xxx G 0 100% /
/dev/hda1 xxx M xxx M xxx M xxx % /boot
tmpfs xxx G xxx xxx G xxx % /dev/shm
Yep, that'll do it...
Hulk smash full disk ! Actually wait, Hulk diligently clean full disk
A quick check showed that it was the rejected messages director in
/var/spool/nagios2metricstore that were the offending culprits. I cleaned this out and restarted stuff. We're still not getting messages into Slack, but at least the monitoring is back up.
Great way to start a Monday
Tagged with operations • nagios • monitoring • status • downtime
This is a companion discussion topic for the original entry at http://www.africa-grid.org//blog/2015/04/27/nagios-failures/