nanog mailing list archives

Re: "They all suck!" Re: UPS failure modes (was: fire at NAC)


From: Sean Donelan <sean () donelan com>
Date: Thu, 29 May 2003 16:53:43 -0400 (EDT)


On Thu, 29 May 2003, Alex Rubenstein wrote:
Even in instances where 'High availability' is designed, in the case where
one of the units has a failure that causes a fire and FM200 dump, either
the FM200 will still trigger an EPO, or the fire department will.

Why do you think most telephone central offices don't have EPO's?  It is
possible to meet code without an EPO, if you have a smart PE on the
project.


So, the second 'high available' unit will generally not prevent you from
dropping the critical load, but instead, will help you get back on line
quicker.

That's why you have geographic diversity, if one node goes down the other
location may be unaffected.


A much cheaper and easier to implement external maintenance
make-before-break bypass will accomplish the same thing.

Pick two out of three.  The "Internet philosphy" has tended to be a
lots of cheap equipment connected by diverse paths.  Designing for
failure also means defining "failure" in terms of the service, not
particular pieces of equipment.  I don't care how many 9's your switch
is, I just care if my packets get through.

I've heard many a story of the paralleling gear causing the problem in the
first place, as well...

Yep, tieing together "redundant" systems with parelleling gears turns two
independent systems into one "co-dependent" system.  In a failure
situation, you want to compartmentalize the failure.  Loosing half your
systems may be better than loosing all your systems.



Current thread: