nanog mailing list archives

Re: FYI Netflix is down


From: Todd Underwood <toddunder () gmail com>
Date: Sat, 30 Jun 2012 16:24:41 -0400

scott,


This was not a cascading failure.  It was a simple power outage

Cascading failures involve interdependencies among components.


Not always.  Cascading failures can also occur when there is zero dependency
between components.  The simplest form of this is where one environment
fails over to another, but the target environment is not capable of handling
the additional load and then "fails" itself as a result (in some form or
other, but frequently different to the mode of the original failure).

indeed.  and that is an interdependency among components.  in
particular, it is a capacity interdependency.

Whilst the Amazon outage might have been a "simple" power outage, it's
likely that at least some of the website outages caused were a combination
of not just the direct Amazon outage, but also the flow-on effect of their
redundancy attempting (but failing) to kick in - potentially making the
problem worse than just the Amazon outage caused.

i think you over-estimate these websites.  most of them simply have no
redundancy (and obviously have no tested, effective redundancy) and
were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though:  many of these businesses do not
promise perfect uptime and can survive these kinds of failures with
little loss to business or reputation.  twitter has branded it's early
failures with a whale that no only didn't hurt it but helped endear
the service to millions.  when your service fits these criteria, why
would you bother doing the complicated systems and application
engineering necessary to actually have functional redundancy?

it simply isn't worth it.

t


  Scott


Current thread: