nanog mailing list archives

Re: Monitoring service that has a human component?


From: Heath Jones <hj1980 () gmail com>
Date: Fri, 7 Dec 2018 10:05:00 +1100

Hi David - Just a bit of insight from my own experience:

Common issues when monitoring (and the associated escalation processes)
don't work and similar issues are seen as you described:
- Inconsistent HTTP response codes across services and service layers
(nginx vs the backend tomcat), means you can't use them properly.
- Monitoring on arbitrary metrics (90% of something) as opposed to metrics
linked to an actual outcome (response times for example).
- No runbook in place (engineer to change some setting to switch on/off
maintenance mode).
- No central view of what engineer is doing what to which systems.

Some fairly simple example of when I've seen things work pretty well:
Organisation uses HTTP code monitoring, alerting on 5xx but not 503.
Services configured (and tested!) to return other, specific 5xx errors, but
keep 503 as a 'known and expected maintenance' mode.
Runbook in place to let other engineers know what's happening (slack
message for example) and then maintenance page on the reverse proxy.
Monitor and report on the common 90% metrics (disk space, memory) but no
alerts.
Don't fill up the disk with logs, only to delete them and let it fill up
again.. :)
Remove all non-actionable alerts.

Of course a good solution could be to implement a rolling-upgrade / ha
maintenance strategy, but in reality (depending on how ancient the app is)
this can be quite hard.

ps. This is a really good read:
https://landing.google.com/sre/sre-book/toc/index.html


Cheers
Heath




On Thu, Dec 6, 2018 at 9:03 AM David H <ispcolohost () gmail com> wrote:

Hey all, was curious if anyone knows of a website monitoring service that
has the option to incorporate a human component into the decision and
escalation tree?  I’m trying to help a customer find a way around false
positives bogging down their NOC staff, by having a human determine the
difference between a real error, desired (but different) content, or
something in between like “Hey it’s 3am and we’ve taken our website offline
for maintenance, we’ll be back up by 6am.”  Automated systems tend to only
know if test A, or steps A through C, are failing, then this is ‘down’ and
do my preconfigured thing, but that ends up needlessly taking NOC time if
the customer themselves is performing work on their own site, or just
changed it and whatever content was being watched, is now gone.  So, the
goal would be to have the end user be the first point of contact if it
looks like more of a customer-side issue.  If they can’t be reached to
confirm, THEN contact NOC, and unlike email alerts, keep contacting until a
human acknowledges receipt of the alert.



Thanks


Current thread: