nanog mailing list archives

RE: FYI Netflix is down


From: "Dan Golding" <dgolding () ragingwire com>
Date: Mon, 2 Jul 2012 08:01:59 -0700

-----Original Message-----
From: Todd Underwood [mailto:toddunder () gmail com]

scott,


This was not a cascading failure.  It was a simple power outage

Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the 
event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably 
different, but it illustrates the complexity of a data center failure)

Utility Power Failed
First Backup Generator Failed (shut down due to a faulty fan)
Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker)

In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also 
clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to 
deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which 
should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities 
to be much more resilient. Not everyone's business model supports it.

- Dan



Cascading failures involve interdependencies among components.


Not always.  Cascading failures can also occur when there is zero
dependency between components.  The simplest form of this is where
one
environment fails over to another, but the target environment is not
capable of handling the additional load and then "fails" itself as a
result (in some form or other, but frequently different to the mode
of the original failure).

indeed.  and that is an interdependency among components.  in
particular, it is a capacity interdependency.

Whilst the Amazon outage might have been a "simple" power outage,
it's
likely that at least some of the website outages caused were a
combination of not just the direct Amazon outage, but also the flow-
on
effect of their redundancy attempting (but failing) to kick in -
potentially making the problem worse than just the Amazon outage
caused.

i think you over-estimate these websites.  most of them simply have no
redundancy (and obviously have no tested, effective redundancy) and
were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though:  many of these businesses do not
promise perfect uptime and can survive these kinds of failures with
little loss to business or reputation.  twitter has branded it's early
failures with a whale that no only didn't hurt it but helped endear the
service to millions.  when your service fits these criteria, why would
you bother doing the complicated systems and application engineering
necessary to actually have functional redundancy?

it simply isn't worth it.

t


  Scott


Current thread: