nanog mailing list archives

Re: OT: Question/Netflix issues?


From: Paul Graydon <paul () paulgraydon co uk>
Date: Wed, 23 Mar 2011 15:42:04 -1000

On 03/23/2011 09:41 AM, sillywizard () rs4668 com wrote:
"Lyndon Nerenberg (VE6BBM/VE7TFX)"<lyndon () orthanc ca>  wrote:

Guess that move to Amazon EC2 wasn't such a good idea. First reddit,
now netflix.
http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html
FWIW, at $DAYJOB we haven't been able to run out a pool of a couple of
dozen EC2 instances for more than two weeks (since last June) without
at least one of them going down.  The same number of hardware servers
we ran ourselves in Peer1 ran for a couple of years with no unplanned
outages.

Amortized over five years, Peer1 colo + hardware is also cheaper than
the equivalent EC2 cost.

Hey everyone! Join the cloud, and stand in the pissing rain.

--lyndon

Interesting, because we run 120 with almost no issues whatsoever (3 failures over the past 12 months, none of which caused downtime). 
I've never had an EBS volume fail in the 18 months we've used them. IMHO, the "issues" with the cloud are almost always 
at a layer above the infrastructure.

--L

Reddit has routinely had EBS volumes either outright fail (2 major outages in the last month/month and a half, both caused by several EBSs vanishing), or show some not insignificant degradation in performance, and it seems barely a month goes by when I don't hear someone on twitter talking about similar with their infrastructures. Most of the problems I've heard about do seem to revolve around EBS, however, rather than their other services. It may be just the nature of people to pick on and shout about the biggest targets, but I'm reasonably sure almost all the problems I hear about relating to cloud services revolve around Amazon and rarely their competitors.

http://highscalability.com/blog/2010/12/20/netflix-use-less-chatty-protocols-in-the-cloud-plus-26-fixes.html
When it comes to other layers in the infrastructure probably one of the most talked about problems is network latency between instances. Netflix had to specifically re-engineer their platform because of it (and other major users talk of similar changes). There is almost certainly an argument to be made that the outcome of the forced re-engineering is a good thing as it's generally boosting resilience, but that it's been forced on them in such a way surely should also be of some cause for concern also. Reddit seem to be working hard to make their platform as resilient as possible to their routine problems cause by the infrastructure. One of their outgoing dev's gave a pretty interesting read on the problems they'd experience with Amazon: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l6ykx

I absolutely do think cloud hosting / virtual servers have value and use and shouldn't be underestimated or written off as a fad, but I'm also not entirely convinced at the moment that Amazon is a vendor to particularly trust with such services, I'd probably also argue that anyone keeping their eggs in one basket and relying on a single vendor for such services is taking a significant risk. There are plenty of tools and libraries out there to help provide a standard API for rolling out servers on different platforms. It seems crazy not to take advantage of the flexibility the cloud offers to remove as many SPOFs as possible.

Paul


Current thread: