nanog mailing list archives

Re: Amazon diagnosis

From: Ryan Malayter <malayter () gmail com>
Date: Fri, 6 May 2011 07:35:46 -0700 (PDT)



On May 5, 3:51 pm, Jay Ashworth <j... () baylink com> wrote:

----- Original Message -----

From: "Ryan Malayter" <malay... () gmail com>
I like to bag on my developers for not knowing anything about the
infrastructure, but sometimes you just can't do it right because of
physics. Or you can't do it right without writing your own OS,
networking stacks, file systems, etc., which means it is essentially
"impossible" in the real world.


"Physics"?

Isn't that an entirely inadequate substitute for "desire"?


Not really. For some applications, it is physics:
   1) You need two or more locations separated by say 500km for
disaster protection (think Katrina, or Japan Tsunami).
   2) Those two locations need to be 100% consistent, with in-order
"serializable" ACID semantics for a particular database entity. An
example would be some sort of financial account - the order of
transactions against that account must be such that an account cannot
go below a certain value, and debits to and from different accounts
must always happen together or not at all.

The above implies a two-phase commit protocol. This, in turn, implies
*at least* two network round-trips. Given a perfect dedicated fiber
network and no switch/router/CPU/disk latency, this means at least
10.8 ms per transaction, or at most 92 transactions per second per
affected database entity. The reality of real networks, disks,
databases, and servers makes this perfect scenario unachievable -
often by an order of magnitude.

I don't have inside knowledge, but I suspect this is why Wall Street
firms have DR sites across the river in New Jersey, rather than
somewhere "safer".

Amazon's EBS service is network-based block storage, with semantics
similar to the financial account scenario: data writes to the volume
must happen in-order at all replicas. Which is why EBS volumes cannot
have a replica a great distance away from the primary. So any
application which used the EBS abstraction for keeping consistent
state were screwed during this Amazon outage. The fact that Amazon's
availability zones were not, in fact, very isolated from each other
for this particular failure scenario compounded the problem.

Current thread:

Re: Amazon diagnosis, (continued)
- - - Re: Amazon diagnosis Jeroen van Aart (May 02)
    - Re: Amazon diagnosis Valdis . Kletnieks (May 02)
    - Re: Amazon diagnosis Jeroen van Aart (May 02)
    - Re: Amazon diagnosis George Herbert (May 02)
    - Re: Amazon diagnosis Jason Baugher (May 03)
    - Re: Amazon diagnosis Phil Pierotti (May 03)
    - Re: Amazon diagnosis Paul Graydon (May 02)
    - Re: Amazon diagnosis Ryan Malayter (May 05)
    - Re: Amazon diagnosis George Herbert (May 05)
    - Re: Amazon diagnosis Jay Ashworth (May 05)
    - Re: Amazon diagnosis Ryan Malayter (May 06)
    - Re: Amazon diagnosis Jay Ashworth (May 06)
    - RE: Amazon diagnosis Kenneth M. Chipps Ph.D. (May 06)
- RE: Amazon diagnosis George Bonser (May 01)
  - Re: Amazon diagnosis Brett Frankenberger (May 01)
  - RE: Amazon diagnosis Robert Bonomi (May 01)
- Re: Amazon diagnosis Robert Bonomi (May 01)
- Re: Amazon diagnosis Valdis . Kletnieks (May 01)
- Re: Amazon diagnosis Stefan (May 01)
- Re: Amazon diagnosis James Smith (May 02)