nanog mailing list archives

Re: recommendations for external montioring services?


From: Mark Gauvin <MGauvin () dryden ca>
Date: Tue, 13 Dec 2011 19:43:35 -0600

Solar winds as you send in the specific mib required to monitor and a  
week later it's general release


Sent from my iPhone

On 2011-12-13, at 7:11 PM, "Robert Brockway"  
<robert () timetraveller org> wrote:

On Mon, 12 Dec 2011, Eric J Esslinger wrote:

I'm not looking to monitor a massive infrastructure: 3 web sites, 2  
mail
servers (pop,imap,submission port, https webmail), 4 dns servers
(including lookups to ensure they're not listening but not  
talking), and
one inbound mx. A few network points to ping to ensure connectivity
throughout my system. Scheduled notification windows (for example,
during work hours I don't want my phone pinged unless it's everything
going offline. Off hours I do. Secondary notifications if problem
persists to other users, or in the event of many triggers. That  
sort of
thing). Sensitivity settings (If web server 1 shows down for 5 min,
that's not a big deal. Another one if it doesn't respond to repeated
queries within 1 minute is a big deal) A Weekly summary of issues  
would
be nice. (especially the 'well it was down for a short bit but we  
didn't
notify as per settings') I don't have a lot of money to throw at  
this. I

Hi Eric.  The feature set you are describing should be in any  
monitoring
system worthy of the name.  I've used Nagios to good effect for the  
best
part of the last 12 years or so.  Before that I used Big Brother,  
which
sucked in various ways.

I did an evaluation on a wide variety of FOSS monitoring systems 2-3  
years
ago and Nagios won at the time (again).  Generally I found the
alternatives had problems that I considered to be quite serious  
(such as
being overly complicated or doing checks so frequently that they  
loaded
the systems they were supposed to be monitoring[1]).

I'm currently trialing Icinga, a fork of Nagios.

Puppet can be set up to manage Nagios/Icinga config which cuts down  
on the
admin overhead.

Nagios/Icinga can be hooked up to Collectd to provide performance  
data as
well as alert monitoring.

One concern about external monitoring services is the level of  
visibility
they need to have in to your network to adequately monitor them.

My recommendation is to do a proper risk assessment on the available
options.

DO have detailed internal monitoring of our systems but sometimes  
that
is not entirely useful, due to the fact that there are a few 'single
points of failure' within our network/notification system, not to
mention if the monitor itself goes offline it's not exactly going  
to be
able to tell me about it. (and that happened once, right before the  
mail
server decided to stop receiving mail).

There are a couple of ways to deal with this.  Some monitoring
applications can fail-over to a standby server if the primary  
fails.  But
this isn't even really necessary.  You will arguably gain higher
reliability by running multiple _independent_ monitors and have them
monitor each other[2].  I have often used this approach.

The principal aim here is to guarantee that you are alerted to any  
single
failure (a production service, system or a monitor).  Multiple
simultaneous failures could still produce a blackspot.  It is  
possible to
design a system that will discover multiple simultaneous failures,  
but it
takes more effort and resources.


[1] Sometimes I wonder if the people developing certain systems have  
any
operational experience at all.

[2] A system designed to fail-over on certain conditions may fail to
fail-over, ah, so to speak.

Cheers,

Rob

-- 
Email: robert () timetraveller org        Linux counter ID #16440
IRC: Solver (OFTC & Freenode)
Web: http://www.practicalsysadmin.com
Director, Software in the Public Interest (http://spi-inc.org/)
Free & Open Source: The revolution that quietly changed the world
"One ought not to believe anything, save that which can be proven by  
nature and the force of reason" -- Frederick II (26 December 1194 –  
13 December 1250)

Current thread: