nanog mailing list archives

RE: Amazon diagnosis


From: Robert Bonomi <bonomi () mail r-bonomi com>
Date: Sun, 1 May 2011 16:35:29 -0500 (CDT)


Subject: RE: Amazon diagnosis
Date: Sun, 1 May 2011 12:50:37 -0700
From: George Bonser <gbonser () seven com>

They apparently had a redundant primary network and, on top of that, a
secondary network.  The secondary network, however, did not have the
capacity of the primary network.

Rather than failing over from the active portion of the primary network
to the standby portion of the primary network, they inadvertently failed
the entire primary network to the secondary.  This resulted in the
secondary network reaching saturation and becoming unusable.

There isn't anything that can be done to mitigate against human error.
You can TRY, but as history shows us, it all boils down the human that
implements the procedure.  All the redundancy in the world will not do
you an iota of good if someone explicitly does the wrong thing.  ...

This looks like it was a procedural error and not an architectural
problem.  

A sage sayeth sooth: 

      "For any 'fool-proof' system, there exists 
       a *sufficiently*determied* fool capable of
       breaking it."

It would seem that the validity of that has just been re-confirmed.  <wry grin>


It is worthy of note that it is considerably harder to protect against
accidental stupidity than it is to protect againt intentional malice.
('malice' is _much_ more predictable, in general.  <wry grin>)




Current thread: