nanog mailing list archives

Re: Followup British Telecom outage reason


From: Ian Duncan <Ian.Duncan () sympatico ca>
Date: Mon, 26 Nov 2001 10:46:49 -0500


Wandering off the subject of BT's misfortune ...

Sean Donelan wrote:

On Mon, 26 Nov 2001, Christian Kuhtz wrote:

[...]


Faults will happen.  And nothing matters as much as how your prepare for
when they do.

Mean Time To Repair is a bigger contributor to Availability calculations
than the Mean Time To Failure.  It would be great if things never failed.

And Mean Time To Fault Detected (Accurately) is usually the biggest
sub-contributor within Repair but that's kinda your point.


But some people are making their systems so complicated chasing the Holy
Grail of 100% uptime, they can't figure out what happened when it does
fail.

Similar people pursue creation of perpetuum mobile. A strange and somewhat
congruent example stumbled into recently is:
http://www.sce.carleton.ca/netmanage/perpetum.shtml.

Overall simplicity of the system, including failure detection mechanisms, and real
redundancy are the most reliable tools for availablity. Of course, popping just a
few layers out, profit and politics are elements of most systems.

Murphy's revenge: The more reliable you make a system, the longer it will
take you to figure out what's wrong when it breaks.

Hmm.



Current thread: