nanog mailing list archives
Re: Mitigating human error in the SP
From: Mark Smith <nanog () 85d5b20a518b8f6864949bd940457dc124746ddc nosense org>
Date: Tue, 2 Feb 2010 23:46:29 +1030
On Mon, 1 Feb 2010 21:21:52 -0500 Chadwick Sorrell <mirotrem () gmail com> wrote:
Hello NANOG, Long time listener, first time caller. A recent organizational change at my company has put someone in charge who is determined to make things perfect. We are a service provider, not an enterprise company, and our business is doing provisioning work during the day. We recently experienced an outage when an engineer, troubleshooting a failed turn-up, changed the ethertype on the wrong port losing both management and customer data on said device. This isn't a common occurrence, and the engineer in question has a pristine track record.
Why didn't the customer have a backup link if their service was so important to them and indirectly your upper management? If your upper management are taking this problem that seriously, then your *sales people* didn't do their job properly - they should be ensuring that customers with high availability requirements have a backup link, or aren't led to believe that the single-point-of-failure service will be highly available.
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
If upper management don't understand that human error is a risk factor that can't be completely eliminated, then I suggest "self-eliminating" and find yourself a job somewhere else. The only way you'll avoid human error having any impact on production services is to not change anything - which pretty much means not having a job anyway ...
I am asking the respectable NANOG engineers.... What measures have you taken to mitigate human mistakes? Have they been successful? Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous. Thanks! Chad
Current thread:
- Re: Mitigating human error in the SP, (continued)
- Re: Mitigating human error in the SP Ross Vandegrift (Feb 03)
- Re: Mitigating human error in the SP Christopher Morrow (Feb 03)
- Re: Mitigating human error in the SP Michael Dillon (Feb 03)
- Re: Mitigating human error in the SP David Hiers (Feb 03)
- Re: Mitigating human error in the SP Michael Dillon (Feb 02)
- Re: Mitigating human error in the SP Suresh Ramasubramanian (Feb 02)
- Re: Mitigating human error in the SP Steven Bellovin (Feb 02)
- Re: Mitigating human error in the SP Brian Raaen (Feb 03)
- Re: Mitigating human error in the SP Suresh Ramasubramanian (Feb 01)
- Re: Mitigating human error in the SP Paul Corrao (Feb 02)
- Re: Mitigating human error in the SP Chadwick Sorrell (Feb 02)
- Re: Mitigating human error in the SP Michael Dillon (Feb 02)
- Re: Mitigating human error in the SP David Hiers (Feb 02)
- Re: Mitigating human error in the SP James Downs (Feb 02)
- Message not available
- Re: Mitigating human error in the SP Chadwick Sorrell (Feb 02)