nanog mailing list archives

Re: Mitigating human error in the SP


From: JC Dill <jcdill.lists () gmail com>
Date: Tue, 02 Feb 2010 10:01:11 -0800

Chadwick Sorrell wrote:
This outage, of a high profile customer, triggered upper management to
react by calling a meeting just days after.  Put bluntly, we've been
told "Human errors are unacceptable, and they will be completely
eliminated.  One is too many."

Good, Fast, Cheap - pick any two.  No you can't have all three.

Here, Good is defined by your pointy-haired bosses as an impossible-to-achieve zero error rate.[1] Attempting to achieve this is either going to cost $$$, or your operations speed (how long it takes people to do things) is going to drop like a rock. Your first action should be to make sure upper management understands this so they can set the appropriate priorities on Good, Fast, and Cheap, and make the appropriate budget changes.

It's going to cost $$$ to hire enough people to have the staff necessary to double-check things in a timely manner, OR things are going to slow way down as the existing staff is burdened by necessary double-checking of everything and triple-checking of some things required to try to achieve a zero error rate. They will also need to spend $$$ on software (to automate as much as possible) and testing equipment. They will also never actually achieve a zero error rate as this is an impossible task that no organization has ever achieved, no matter how much emphasis or money they pour into it (e.g. Windows vulnerabilities) or how important (see Challenger, Columbia, and the Mars Climate Orbiter incidents).

When you put a $$$ cost on trying to achieve a zero error rate, pointy-haired bosses are usually willing to accept a normal error rate. Of course, they want you to try to avoid errors, and there are a lot of simple steps you can take in that effort (basic checklists, automation, testing) which have been mentioned elsewhere in this thread that will cost some money but not the $$$ that is required to try to achieve a zero error rate. Make sure they understand that the budget they allocate for these changes will be strongly correlated to how Good (zero error rate) and Fast (quick operational responses to turn-ups and problems) the outcome of this initiative.

jc

[1]  http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm

2. "What I need is a list of specific unknown problems we will encounter." (Lykes Lines Shipping)

6. "Doing it right is no excuse for not meeting the schedule." (R&D Supervisor, Minnesota Mining & Manufacturing/3M Corp.)




Current thread: