nanog mailing list archives

Re: Tornados in Ashburn (Equinix affected)


From: Deepak Jain <deepak () ai net>
Date: Sat, 18 Sep 2004 23:29:09 -0400


Despite marketing departments, engineers know there will be failures.
A N+1 design means two faults will result in an interruption.  A N+2
design means three faults wil result in an interruption.  And so on.

Only caveat here (that I want to add) is this:

1) No matter what the company, no matter what the design, N+x doesn't necessarily mean >x failures have to occur at all, or even simultaneously.

2) Just because a design is believed to be N+x or yN doesn't mean all single points of failure are really eliminated. N+x or yN implies that the failures they planned for have to be >(y-1)N or >x. Doesn't mean that they have planned for every possibile failure mode. For example, static transfer switches can and do fail. Even when they are in pairs, the coupling mechanisms and paralleling mechanisms often don't work and aren't easy to repair/bypass in an emergency.

3) Many new systems [say datacenters built/upgraded in the last 5 years] haven't been around long enough to really test 99.999% and above levels of availability... many new systems won't start showing problems for 5-10 years.

Specifically in Equinix's case:

1) Good that they [seemed] to have maintained partial power.

2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected].

3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly.

4) SLA credits. Depending on your contract, even possible breach unless they can prove >x or >(y-1)N failures had occurred in their physical plant. The latter is only useful if you want to get out of Equinix/Ash or reduce your commits to it.

Deepak Jain
AiNET


Current thread: