nanog mailing list archives
Re: Mitigating human error in the SP
From: Chadwick Sorrell <mirotrem () gmail com>
Date: Tue, 2 Feb 2010 10:14:10 -0500
On Tue, Feb 2, 2010 at 9:09 AM, Paul Corrao <pcorrao () voxeo com> wrote:
Humans make errors. For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving. But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers. So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated.
Agreed.
Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem?
As it stands now, business want to turn their services up when they are in the office. We do all new turn-ups during the day, anything requiring a roll or maintenance window is schedule in the middle of the night.
You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification.
The actual error happened when someone was troubleshooting a turn-up, where in the past the customer in question has had their ethertype set wrong. It wasn't a provisioning problem as much as someone troubleshooting why it didn't come up with the customer. Ironically, the NOC was on the phone when it happened, and the switch was rebooted almost immediately and the outage lasted 5 minutes. Chad
Current thread:
- Re: Mitigating human error in the SP, (continued)
- Re: Mitigating human error in the SP Michael Dillon (Feb 03)
- Re: Mitigating human error in the SP David Hiers (Feb 03)
- Re: Mitigating human error in the SP Michael Dillon (Feb 02)
- Re: Mitigating human error in the SP Suresh Ramasubramanian (Feb 02)
- Re: Mitigating human error in the SP Steven Bellovin (Feb 02)
- Re: Mitigating human error in the SP Brian Raaen (Feb 03)
- Re: Mitigating human error in the SP Dave CROCKER (Feb 01)
- Re: Mitigating human error in the SP Suresh Ramasubramanian (Feb 01)
- Re: Mitigating human error in the SP Mark Smith (Feb 02)
- Re: Mitigating human error in the SP Paul Corrao (Feb 02)
- Re: Mitigating human error in the SP Chadwick Sorrell (Feb 02)
- Re: Mitigating human error in the SP Michael Dillon (Feb 02)
- Re: Mitigating human error in the SP David Hiers (Feb 02)
- Re: Mitigating human error in the SP Paul Corrao (Feb 02)
- Re: Mitigating human error in the SP James Downs (Feb 02)
- Message not available
- Re: Mitigating human error in the SP Chadwick Sorrell (Feb 02)