nanog mailing list archives

Re: Data Center testing


From: James Hess <mysidia () gmail com>
Date: Tue, 25 Aug 2009 23:46:42 -0500

On Tue, Aug 25, 2009 at 7:53 AM, Jeff Aitken<jaitken () aitken com> wrote:
[..] Periodically inducing failures to catch [...] them is sorta like using your smoke detector as an oven timer.
[..]
machine-parsable format, but the benefit is that you know in pseudo-realtime
when something is wrong, as opposed to finding out the next time a device
fails.

Config checking can't say much about silent hardware failures.
Unanticipated problems are likely to arise in failover systems,
especially complicated ones.  A failover system that has not been
periodically verified may not work as designed.

Simulations, config review, and change controls are not substitutes
for testing, they address overlapping but different problems.
Testing detects unanticipated error;  config review  is a preventive
measure that helps avoid and correct apparent configuration issues.

Config checking  (both software and hardware choices) also help to
keep out unnecessary complexity.

A human still has to write the script and review its output -- an
operator error would eventually occur that is an accidental omission
from both the current state and from the "desired" state;  there is a
chance that an erroneous entry escapes detection.

There can be other types of errors:
Possibly there is a damaged patch cable, dying port, failing power
supply, or other hardware on the warm spare that has silently degraded
and its poor condition won't be detected    (until it actually tries
to take a heavy workload, blows a fuse, eats a transceiver,  and
everything just falls apart).


Perhaps you upgraded a hardware module or software image X months ago,
to fix bug Y on the secondary unit, and the upgrade caused completely
unanticipated side effect Z.


Config checking can't say much about silent hardware problems.

--
-Mysid


Current thread: