nanog mailing list archives

Re: FYI Netflix is down


From: "Greg D. Moore" <mooregr () greenms com>
Date: Mon, 02 Jul 2012 15:43:29 -0400

At 03:08 PM 7/2/2012, George Herbert wrote:

If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow.

The "it can't happen" is almost guaranteed to happen. ;-) And when it does, it'll often interact in ways we can't predict or sometimes even understand.

As for pulling the plug to test stuff. I recall a demo at Netapps in the early 00's. They were talking about their fault tolerance and how great it was. So I walked up to their demo array and said, "So, it shouldn't be a problem if I pulled this drive right here?" Before I could the salesperson or tech guy, can't remember, told me to stop. He didn't want to risk it.

That right there said loads about their confidence in their own system.



Late reply, but:

On Sat, Jun 30, 2012 at 12:30 AM, Lynda <shrdlu () deaddrop org> wrote:
>...
> Second, and more important. I *was* a "computer science guy" in a past life,
> and this is nonsense. You can have astonishingly large software projects
> that just continue to run smoothly, day in, day out, and they don't hit the
> news, because they don't break. There are data centers that don't hit the
> news, in precisely the same way.

I really need to write the book on IT reliability I keep meaning to.

There's reliability - backwards looking statistical, which can be 100%
for a given service or datacenter - and then there's dependability,
forwards-predicted outage risks, which people often *assert* equals
the prior reliability record, but in reality you often have a number
of latent failures (and latent cascade paths) that you do not
understand, did not identify previously, and are not aware of.

I've had or had to respond to over a billion dollars of culminative IT
disaster loss over my consulting career so far; I have NEVER seen
anyone who did it perfect, even the best pros.  And I include myself
in that list.

Looking at other fields like aerospace and nuclear engineering, what
is done in IT is not anywhere close to the same level of QA and
engineering analysis and testing.  We cannot assert better results
with less work.

"Oh, that never happens", except I've had my stuff in three locations
that had catastrophic generator failures.  "Oh, that never happens"
when you're doing power maintenance and the best-rated electrical
company in California, in conjunction with the generator vendor and a
couple of independent power EEs, mis-balance the maintenance generator
loads between legs and blow the generators and datacenter.  "Oh, that
never happens" that the datacenter burns (or starts to burn and then
gets flooded).  "Oh, that never happens" that the FM-200 goes off or
preaction breaks and water leaks.  "Oh, that never happens" that well
maintained and monitored and triple-redundant AC units all trip
offline due to a common mode failure over the course of a weekend and
the room gets up to 106 degrees.  Oh thank god the next thing didn't
go wrong in THAT situation, because the spot temperature meters
indicated that the ceiling height of that particular room peaked at 1
degree short of the temp at which the sprinkler heads are supposed to
discharge, so we nearly lost that room to flooding rather than just a
10% disk and 15% power supply attrition over the next year...

Don't be so confident in the infrastructure.  It's not engineered or
built or maintained well enough to actually support that assertion.
The same can be said of the application software and application
architecture and integration.


--
-george william herbert
george.herbert () gmail com

Greg D. Moore http://greenmountainsoftware.wordpress.com/
CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net





Current thread: