nanog mailing list archives

Re: Mitigating human error in the SP


From: Chadwick Sorrell <mirotrem () gmail com>
Date: Tue, 2 Feb 2010 10:14:10 -0500

On Tue, Feb 2, 2010 at 9:09 AM, Paul Corrao <pcorrao () voxeo com> wrote:
Humans make errors.

For your upper management to think  they can build a foundation of reliability on the theory that humans won't make 
errors is self deceiving.

But that isn't where the story ends.  That's where it begins.  Your infrastructure, processes and tools should all be 
designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of 
the service you provide to your customers.

So, for the example you gave there are a few things that could be put in place.  The first one, already mentioned by 
Chad, is that mission critical services should not be designed with single points of failure - that situation should 
be remediated.

Agreed.

Another question  to be asked - since this was provisioning work being done, and it was apparently being done on 
production equipment, could the work have been done at a time of day (or night) when an error would not have been as 
much of a problem?

As it stands now, business want to turn their services up when they
are in the office.  We do all new turn-ups during the day, anything
requiring a roll or maintenance window is schedule in the middle of
the night.

You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it 
lasted for a while.  That raises the next question.  Who besides the engineer making the mistake was aware of the 
fact that work on production equipment was occurring?  The reason this is important is because having the NOC know 
that work is occurring would give them a leg up on locating where the problem is once they get the trouble 
notification.

The actual error happened when someone was troubleshooting a turn-up,
where in the past the customer in question has had their ethertype set
wrong.  It wasn't a provisioning problem as much as someone
troubleshooting why it didn't come up with the customer.  Ironically,
the NOC was on the phone when it happened, and the switch was rebooted
almost immediately and the outage lasted 5 minutes.

Chad


Current thread: