nanog mailing list archives

Limits of reliability or is 99.999999999% realistic


From: Sean Donelan <sean () donelan com>
Date: 25 Nov 2000 20:24:48 -0800


On Fri, 24 November 2000, Roeland Meyer wrote:
After all the discussion that we had on the Datacenter list, I am surprised
at this. You'd think that they'd have redundant PS's with redundant UPS's.

There are interesting electrical faults which will kill redundant UPSes and
redundant power supplies.  However, I don't know if Sprint uses redudant
power supplies, or had a failure affecting multiple power supplies on the
STP.  They use Compaq/Tandem NonStop hardware for their SCPs.
 
After previous power problems at Sprint COs (i.e. Kansas City) which
affected my service, I've asked my Sprint sales person what happened.
He was never able to get anyone to call me back with an answer.

For the internet, I see an amazing number of systems with no redundancy
whatsoever. Of course, the first hardware failure usually corrects the
problem, at the cost of substantial down-time. But many second-tier ISPs and
dot-coms are still operating on brand-new equipment that hasn't started
hitting its MTBF specs yet and they don't even have a clue on their MTTR
ratings. In the next few years, I expect to see a lot more failures, as the
equipment starts to age.

I'm not sure that is true.  Brand new electronic equipment tends to have a
period of infant mortality.  If it survives, it tends to be reliable for a
fairly long period of time.  I had customers still using Proteon routers 10
years after Proteon discontinued the model.  However scaling requires most
dot-coms to replace/upgrade their equipment every few months, so they are
always dealing with infant mortality.

But back to my question.  What is the real requirement?  Amazon.COM had
system problems on Friday, and their site was unusuable for 30 minutes,
definitely not 99.999%.  But what did that really mean?  The FAA loses
its radar for several hours in various parts of the country.  What did
that really mean?  Essentially every system given as an example of "high-
availability, high-reliability" I've looked at, doesn't hold up under
close examination.

Is 99.999% just F.U.D. created by consultants?

Instead of pretending we can build systems which will never fail, should
we work on a realistic understanding of what can be delivered?






Current thread: