nanog mailing list archives

Re: FYI Netflix is down


From: AP NANOG <nanog () armoredpackets com>
Date: Mon, 02 Jul 2012 11:41:00 -0400

While I was working for a wireless telecom company our primary datacenter was knocked off the power grid due to weather, the generators kicked on and everything was fine, till one generator was struck by lighting and that same strike fried the control panel on the second one. Considering the second generator had no control panel we had no means of monitoring it for temp, fuel, input voltage (when it came back), output voltage, surge protection, or ultimately if the generator spiked to go full voltage due to a regulator failure. Needless to say we had to shut the second generator down for safety reasons.

While in the military I seen many generators struck by lighting as well.

Im not saying Amazon was not at fault here, but I can see where this is possible and happens more frequently than one might think.

I hate to play devils advocate here, but you as the customer should always have backups to your backups, and practice these fail-overs on a regular basis. Otherwise you are the fault here, no one else...

--

Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 11:01 AM, Dan Golding wrote:
-----Original Message-----
From: Todd Underwood [mailto:toddunder () gmail com]

scott,

This was not a cascading failure.  It was a simple power outage
Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the 
event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, 
but it illustrates the complexity of a data center failure)

Utility Power Failed
First Backup Generator Failed (shut down due to a faulty fan)
Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker)

In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly 
involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. 
The system should also have been resilient enough to deal with the breaker coordination issue (which should not have 
occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more 
resilient. Not everyone's business model supports it.

- Dan


Cascading failures involve interdependencies among components.

Not always.  Cascading failures can also occur when there is zero
dependency between components.  The simplest form of this is where
one
environment fails over to another, but the target environment is not
capable of handling the additional load and then "fails" itself as a
result (in some form or other, but frequently different to the mode
of the original failure).

indeed.  and that is an interdependency among components.  in
particular, it is a capacity interdependency.

Whilst the Amazon outage might have been a "simple" power outage,
it's
likely that at least some of the website outages caused were a
combination of not just the direct Amazon outage, but also the flow-
on
effect of their redundancy attempting (but failing) to kick in -
potentially making the problem worse than just the Amazon outage
caused.

i think you over-estimate these websites.  most of them simply have no
redundancy (and obviously have no tested, effective redundancy) and
were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though:  many of these businesses do not
promise perfect uptime and can survive these kinds of failures with
little loss to business or reputation.  twitter has branded it's early
failures with a whale that no only didn't hurt it but helped endear the
service to millions.  when your service fits these criteria, why would
you bother doing the complicated systems and application engineering
necessary to actually have functional redundancy?

it simply isn't worth it.

t

   Scott


Current thread: