nanog mailing list archives
Re: Amazon diagnosis
From: Ryan Malayter <malayter () gmail com>
Date: Fri, 6 May 2011 07:35:46 -0700 (PDT)
On May 5, 3:51 pm, Jay Ashworth <j... () baylink com> wrote:
----- Original Message -----From: "Ryan Malayter" <malay... () gmail com> I like to bag on my developers for not knowing anything about the infrastructure, but sometimes you just can't do it right because of physics. Or you can't do it right without writing your own OS, networking stacks, file systems, etc., which means it is essentially "impossible" in the real world."Physics"? Isn't that an entirely inadequate substitute for "desire"?
Not really. For some applications, it is physics: 1) You need two or more locations separated by say 500km for disaster protection (think Katrina, or Japan Tsunami). 2) Those two locations need to be 100% consistent, with in-order "serializable" ACID semantics for a particular database entity. An example would be some sort of financial account - the order of transactions against that account must be such that an account cannot go below a certain value, and debits to and from different accounts must always happen together or not at all. The above implies a two-phase commit protocol. This, in turn, implies *at least* two network round-trips. Given a perfect dedicated fiber network and no switch/router/CPU/disk latency, this means at least 10.8 ms per transaction, or at most 92 transactions per second per affected database entity. The reality of real networks, disks, databases, and servers makes this perfect scenario unachievable - often by an order of magnitude. I don't have inside knowledge, but I suspect this is why Wall Street firms have DR sites across the river in New Jersey, rather than somewhere "safer". Amazon's EBS service is network-based block storage, with semantics similar to the financial account scenario: data writes to the volume must happen in-order at all replicas. Which is why EBS volumes cannot have a replica a great distance away from the primary. So any application which used the EBS abstraction for keeping consistent state were screwed during this Amazon outage. The fact that Amazon's availability zones were not, in fact, very isolated from each other for this particular failure scenario compounded the problem.
Current thread:
- Re: Amazon diagnosis, (continued)
- Re: Amazon diagnosis Jeroen van Aart (May 02)
- Re: Amazon diagnosis Valdis . Kletnieks (May 02)
- Re: Amazon diagnosis Jeroen van Aart (May 02)
- Re: Amazon diagnosis George Herbert (May 02)
- Re: Amazon diagnosis Jason Baugher (May 03)
- Re: Amazon diagnosis Phil Pierotti (May 03)
- Re: Amazon diagnosis Paul Graydon (May 02)
- Re: Amazon diagnosis Ryan Malayter (May 05)
- Re: Amazon diagnosis George Herbert (May 05)
- Re: Amazon diagnosis Jay Ashworth (May 05)
- Re: Amazon diagnosis Ryan Malayter (May 06)
- Re: Amazon diagnosis Jay Ashworth (May 06)
- RE: Amazon diagnosis Kenneth M. Chipps Ph.D. (May 06)
- Re: Amazon diagnosis Brett Frankenberger (May 01)
- RE: Amazon diagnosis Robert Bonomi (May 01)