nanog mailing list archives

Re: Mitigating human error in the SP


From: Chadwick Sorrell <mirotrem () gmail com>
Date: Tue, 2 Feb 2010 20:28:44 -0500

Thanks for all the comments!

On Tue, Feb 2, 2010 at 1:01 PM, JC Dill <jcdill.lists () gmail com> wrote:
Chadwick Sorrell wrote:

This outage, of a high profile customer, triggered upper management to
react by calling a meeting just days after.  Put bluntly, we've been
told "Human errors are unacceptable, and they will be completely
eliminated.  One is too many."

Good, Fast, Cheap - pick any two.  No you can't have all three.

Here, Good is defined by your pointy-haired bosses as an
impossible-to-achieve zero error rate.[1]  Attempting to achieve this is
either going to cost $$$, or your operations speed (how long it takes people
to do things) is going to drop like a rock.  Your first action should be to
make sure upper management understands this so they can set the appropriate
priorities on Good, Fast, and Cheap, and make the appropriate budget
changes.

It's going to cost $$$ to hire enough people to have the staff necessary to
double-check things in a timely manner, OR things are going to slow way down
as the existing staff is burdened by necessary double-checking of everything
and triple-checking of some things required to try to achieve a zero error
rate.  They will also need to spend $$$ on software (to automate as much as
possible) and testing equipment.  They will also never actually achieve a
zero error rate as this is an impossible task that no organization has ever
achieved, no matter how much emphasis or money they pour into it (e.g.
Windows vulnerabilities) or how important (see Challenger, Columbia, and the
Mars Climate Orbiter incidents).

When you put a $$$ cost on trying to achieve a zero error rate,
pointy-haired bosses are usually willing to accept a normal error rate.  Of
course, they want you to try to avoid errors, and there are a lot of simple
steps you can take in that effort (basic checklists, automation, testing)
which have been mentioned elsewhere in this thread that will cost some money
but not the $$$ that is required to try to achieve a zero error rate.  Make
sure they understand that the budget they allocate for these changes will be
strongly correlated to how Good (zero error rate) and Fast (quick
operational responses to turn-ups and problems) the outcome of this
initiative.

jc

[1]  http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm

2. "What I need is a list of specific unknown problems we will encounter."
(Lykes Lines Shipping)

6. "Doing it right is no excuse for not meeting the schedule." (R&D
Supervisor, Minnesota Mining & Manufacturing/3M Corp.)






Current thread: