nanog mailing list archives

RE: Followup British Telecom outage reason


From: Sean Donelan <sean () donelan com>
Date: Mon, 26 Nov 2001 06:28:22 -0500 (EST)




On Mon, 26 Nov 2001, Christian Kuhtz wrote:
Now, if lack of infrastructure realiability can harm human life you may feel
differently, but that isn't the case for most of us at the present time.

I've designed software and networks used for public safety and
emergencies.  And yes, people have died on my watch. It is a somewhat
different mindset, but not that different.  A lot of "good engineering
practice" applies to any engineering activity, including software
engineering.

Its not even a matter of cost.  A typical hospital spends less on
their emergency power system than a Internet/telco hotel.  The major
difference is the hospital staff knows (more or less) what to do when
the generators don't work.

The big secret is most "life safety" systems fail regularly.  Most of
the time it doesn't matter because the "big one" doesn't coincide with
the failure.


Faults will happen.  And nothing matters as much as how your prepare for
when they do.

Mean Time To Repair is a bigger contributor to Availability calculations
than the Mean Time To Failure.  It would be great if things never failed.
But some people are making their systems so complicated chasing the Holy
Grail of 100% uptime, they can't figure out what happened when it does
fail.

Murphy's revenge: The more reliable you make a system, the longer it will
take you to figure out what's wrong when it breaks.



Current thread: