nanog mailing list archives

RE: FYI Netflix is down


From: "Dan Golding" <dgolding () ragingwire com>
Date: Mon, 2 Jul 2012 12:25:54 -0700



-----Original Message-----
From: Leo Bicknell [mailto:bicknell () ufp org]



I want to emphasize _and test_.
[snip]

I used to work with a guy who had a simple test for these things, and
if I was a VP at Amazon, Netflix, or any other large company I would
do
the same.  About once a month he would walk out on the floor of the
data center and break something.  Pull out an ethernet.
Unplug a server.  Flip a breaker.


*DING DING* - we have a winner! In a previous life, I used to spend a
lot of time in other people's data centers. The key question to ask was
how often they pulled the plug - i.e. disconnected utility power without
having backup generators running. Simulating an actual failure. That
goes for pulling out an Ethernet cord or unplugging a server, or
flipping a breaker. Its all the same. The problem is that if you don't
do this for a while, you get SCARED of doing it, and you stop doing it.
The longer you go without, the scarier it gets, to the point where you
will never do it, because you have no idea what will happen, other that
you probably getting fired. This is called "horrible engineering
management", and is very common.

The other problem, of course, is that people design under the assumption
that everything will always work, and that failure modes, when they
occur, are predictable and fall into a narrow set. Multiple failure
modes? Not tested. Failure modes including operator error? Never tested.


When was the last time you had a drill?

- Dan


Then he would wait, to see how long before a technician came to fix
it.

If these activities were service impacting to customers the
engineering
or implementation was faulty, and remediation was performed.  Assuming
they acted as designed and the customers saw no faults the team was
graded on how quickly the detected and corrected the outage.

I've seen too many companies who's "test" is planned months in
advance,
and who exclude the parts they think aren't up to scratch from the
test.
Then an event occurs, and they fail, and take down customers.

TL;DR If you're not confident your operation could withstand someone
walking into your data center and randomly doing something, you are
NOT
redundant.

--
       Leo Bicknell - bicknell () ufp org - CCIE 3440
        PGP keys at http://www.ufp.org/~bicknell/


Current thread: