nanog mailing list archives

[fwd] Rats take down Stanford ...


From: Paul Ferguson <pferguso () cisco com>
Date: Tue, 22 Oct 1996 09:30:12 -0400

A follow-up thought on redundancy issues.

- paul

[snip]


Date: Mon, 21 Oct 1996 12:54:05 -0700 (PDT)
From: risks () csl sri com
Subject: RISKS DIGEST 18.54

[snip]


Date: Fri, 18 Oct 96 11:03 EST
From: William Hugh Murray <0003158580 () mcimail com>
Subject: Re: Rats take down Stanford ... (RISKS-18.53)

PGN's request for redundancy brings to mind the story of the infrastructure
computer center in Trumbull, Connecticut.  It is an old story but bears
repeating.

Seems that a squirrel got into a transformer and brought down the external
power supply.  The UPS kicked in, engine generators came on line, and the
center operated in this mode for about an hour and a half.  At the end of
that time the external power was restored.  The external power, the UPS, and
the engine generators went inot a deadly embrace.  The whole thing came down
and would not come back up.

I take two lessons from this.  First, redundancy adds some complexity and a
lot of redundancy adds a lot of complexity.  At some point the redundancy
begins to introduce failure modes and failure events that would not have
exited in its absence.  There is an upper bound to such redundancy.

Second, test redundant systems through to resumption of normal operations.
In this case, the operators had tested to ensure that the redundant systems
would come online in the event of a failure of the primary system.  They had
not tested to see what would happen when the primary system was restored to
normal operation.

Who would have even thought about it?  I confess that I would not have.

William Hugh Murray, New Canaan, Connecticut


[snip]

- - - - - - - - - - - - - - - - -


Current thread: