nanog mailing list archives

Re: Operations task management software?


From: Lee <ler762 () gmail com>
Date: Wed, 27 Jul 2016 19:19:45 -0400

On 7/27/16, David Hubbard <dhubbard () dino hostasaurus com> wrote:
Hi all, curious if anyone has recommendations on software that helps manage
routine duties assigned to operations staff?

Have computers do the routine scut work - not people.

For example, let’s say we have a P&P that says someone from the netops group
must check that Rancid is successfully backing up all router configs
bi-weekly.

You've got the source code for rancid, so change rancid-run to do something like
  LOGFILE=$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S`; export LOGFILE
change the
  ) >$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S` 2>&1
to
  ) >$LOGFILE 2>&1

and then in control_rancid do something like
  grep "clogin error:" $LOGFILE | sort | uniq -c >$TMP.fail
  if [ -s $TMP.fail ]; then
     # got some output, mail the report
     ...

Do the same type thing for checking on
backup failures, backup internet circuit status, out of band interfaces, etc.

Automate the checks, put the scripts in crontab & mail out an
"OhNoes!" or "all clear" msg at the end.   At which point you're left
with the problem of making sure the managers are looking at the emails
& making sure whatever problems are found actually get fixed :)

Regards,
Lee



 Ideally, it would send an email reminder to this pre-defined
group of people saying hey, it’s Monday, someone needs to check this and
come acknowledge the task as having been completed.  If that doesn’t occur,
pre-defined manager X is notified on Tuesday.  If manager X doesn’t get
someone to complete the task, director Y is notified, so on and so forth.
Then, perhaps periodically it emails manager X anyway and says hey, it’s
been three months, you need to audit netops to ensure they’re actually doing
the Rancid audit and not just checking that it was done.  This could be
applied to the staff who check on backup failures, backup internet circuit
status, out of band interfaces, etc.

A data center I looked at recently had QR code stickers on all of their
infrastructure stuff and there were staff assigned to check and log certain
displayed values each day.  The software would at least ensure they actually
visited the equipment by requiring they scan the relevant QR code when in
front of it.  So I figure something that does what I’m looking for properly
already exists.

Thanks,

David



Current thread: