nanog mailing list archives

Re: Mitigating human error in the SP


From: Paul Corrao <pcorrao () voxeo com>
Date: Tue, 2 Feb 2010 09:09:30 -0500

Humans make errors.  

For your upper management to think  they can build a foundation of reliability on the theory that humans won't make 
errors is self deceiving.

But that isn't where the story ends.  That's where it begins.  Your infrastructure, processes and tools should all be 
designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the 
service you provide to your customers.

So, for the example you gave there are a few things that could be put in place.  The first one, already mentioned by 
Chad, is that mission critical services should not be designed with single points of failure - that situation should be 
remediated.  

Another question  to be asked - since this was provisioning work being done, and it was apparently being done on 
production equipment, could the work have been done at a time of day (or night) when an error would not have been as 
much of a problem?

You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted 
for a while.  That raises the next question.  Who besides the engineer making the mistake was aware of the fact that 
work on production equipment was occurring?  The reason this is important is because having the NOC know that work is 
occurring would give them a leg up on locating where the problem is once they get the trouble notification.

Paul


On Feb 2, 2010, at 8:16 AM, Mark Smith wrote:

On Mon, 1 Feb 2010 21:21:52 -0500
Chadwick Sorrell <mirotrem () gmail com> wrote:

Hello NANOG,

Long time listener, first time caller.

A recent organizational change at my company has put someone in charge
who is determined to make things perfect.  We are a service provider,
not an enterprise company, and our business is doing provisioning work
during the day.  We recently experienced an outage when an engineer,
troubleshooting a failed turn-up, changed the ethertype on the wrong
port losing both management and customer data on said device.  This
isn't a common occurrence, and the engineer in question has a pristine
track record.


Why didn't the customer have a backup link if their service was so
important to them and indirectly your upper management? If your
upper management are taking this problem that seriously, then your
*sales people* didn't do their job properly - they should be ensuring
that customers with high availability requirements have a backup link,
or aren't led to believe that the single-point-of-failure service will
be highly available.


This outage, of a high profile customer, triggered upper management to
react by calling a meeting just days after.  Put bluntly, we've been
told "Human errors are unacceptable, and they will be completely
eliminated.  One is too many."


If upper management don't understand that human error is a risk factor
that can't be completely eliminated, then I suggest "self-eliminating"
and find yourself a job somewhere else. The only way you'll avoid
human error having any impact on production services is to not change
anything - which pretty much means not having a job anyway ...


I am asking the respectable NANOG engineers....

What measures have you taken to mitigate human mistakes?

Have they been successful?

Any other comments on the subject would be appreciated, we would like
to come to our next meeting armed and dangerous.

Thanks!
Chad





Current thread: