nanog mailing list archives

Re: Quick question.


From: "Robert E. Seastrom" <rs () seastrom com>
Date: 01 Aug 2004 18:37:36 -0400



"Michel Py" <michel () arneill-py sacramento ca us> writes:

The dead processor still has to be replaced, but this is scheduled
maintenance, not outage. A little extra ammo when you have to hunt five
or six nines.

MTTR on a single box is irrelevant when you are off playing Ponce de
Leon, hunting the Fountain of Five or Six Nines.  Even when your
architecture doesn't depend on any one particular machine (or even whole
big sets of machines) being available, you don't get to "five or six
nines"... just ask Google, Akamai, or Microsoft - there are other
things beyond your control that spoil the picnic first.

As has been observed time and time again, the tried and true way to
make five or six nines of reliability in a system of more than trivial
complexity is to take a lesson from the telcos (the progenitors of the
"five nines" lie) and build a framework and evaluation methodology
that excludes broad classes of unavailability-causing events or
prorates them in such a way as to make them non-reportable.  Add to
that list incrementally, until the remaining time listed shows your
target number of nines of reliability.  Presto, five nines.

                                        ---Rob



Current thread: