nanog mailing list archives

Re: Data Center testing


From: Jack Bates <jbates () brightok net>
Date: Wed, 26 Aug 2009 09:22:12 -0500

James Hess wrote:
Config checking can't say much about silent hardware failures.
Unanticipated problems are likely to arise in failover systems,
especially complicated ones.  A failover system that has not been
periodically verified may not work as designed.


I've seen 3-4 failover failures in the last year alone on the sonet transport gear. In almost each case, the backup cards were dead when the primary either died or induced errors causing telco to switch to the backup card. I have no doubts that they haven't been testing. While it didn't effect most of my network, I have a few customers that aren't multihomed, and it wiped them out in the middle of the day up to 3 hours.

There can be other types of errors:
Possibly there is a damaged patch cable, dying port, failing power
supply, or other hardware on the warm spare that has silently degraded
and its poor condition won't be detected    (until it actually tries
to take a heavy workload, blows a fuse, eats a transceiver,  and
everything just falls apart).


Lots of weird things to test for. I remember once rebooting a c5500 that had been cruising along for 3 years and the bootup diag detected 1/2 a linecard as bad, which had been running decently up until the reload. Over the years, I think I've seen or detected everything you mentioned either during routine testing or in production "oh crap" events.

Jack




Current thread: