Interesting People mailing list archives

Full statement by Amazon regarding AWS S3 outage and actions


From: "Dave Farber" <farber () gmail com>
Date: Thu, 2 Mar 2017 16:28:07 -0500




Begin forwarded message:

From: Lauren Weinstein <lauren () vortex com>
Date: March 2, 2017 at 2:11:02 PM EST
To: nnsquad () nnsquad org
Subject: [ NNSquad ] Full statement by Amazon regarding AWS S3 outage and actions


Full statement by Amazon regarding AWS S3 outage and actions

https://aws.amazon.com/message/41926/

     Summary of the Amazon S3 Service Disruption in the Northern
   Virginia (US-EAST-1) Region We'd like to give you some
   additional information about the service disruption that
   occurred in the Northern Virginia (US-EAST-1) Region on the
   morning of February 28th. The Amazon Simple Storage Service
   (S3) team was debugging an issue causing the S3 billing system
   to progress more slowly than expected. At 9:37AM PST, an
   authorized S3 team member using an established playbook
   executed a command which was intended to remove a small number
   of servers for one of the S3 subsystems that is used by the S3
   billing process.  Unfortunately, one of the inputs to the
   command was entered incorrectly and a larger set of servers
   was removed than intended. The servers that were inadvertently
   removed supported two other S3 subsystems.  One of these
   subsystems, the index subsystem, manages the metadata and
   location information of all S3 objects in the region.  This
   subsystem is necessary to serve all GET, LIST, PUT, and DELETE
   requests. The second subsystem, the placement subsystem,
   manages allocation of new storage and requires the index
   subsystem to be functioning properly to correctly operate. The
   placement subsystem is used during PUT requests to allocate
   storage for new objects. Removing a significant portion of the
   capacity caused each of these systems to require a full
   restart. While these subsystems were being restarted, S3 was
   unable to service requests. Other AWS services in the
   US-EAST-1 Region that rely on S3 for storage, including the S3
   console, Amazon Elastic Compute Cloud (EC2) new instance
   launches, Amazon Elastic Block Store (EBS) volumes (when data
   was needed from a S3 snapshot), and AWS Lambda were also
   impacted while the S3 APIs were unavailable.

   S3 subsystems are designed to support the removal or failure
   of significant capacity with little or no customer impact. We
   build our systems with the assumption that things will
   occasionally fail, and we rely on the ability to remove and
   replace capacity as one of our core operational processes.
   While this is an operation that we have relied on to maintain
   our systems since the launch of S3, we have not completely
   restarted the index subsystem or the placement subsystem in
   our larger regions for many years. S3 has experienced massive
   growth over the last several years and the process of
   restarting these services and running the necessary safety
   checks to validate the integrity of the metadata took longer
   than expected. The index subsystem was the first of the two
   affected subsystems that needed to be restarted. By 12:26PM
   PST, the index subsystem had activated enough capacity to
   begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM
   PST, the index subsystem was fully recovered and GET, LIST,
   and DELETE APIs were functioning normally.  The S3 PUT API
   also required the placement subsystem. The placement subsystem
   began recovery when the index subsystem was functional and
   finished recovery at 1:54PM PST. At this point, S3 was
   operating normally. Other AWS services that were impacted by
   this event began recovering. Some of these services had
   accumulated a backlog of work during the S3 disruption and
   required additional time to fully recover.

   We are making several changes as a result of this operational
   event. While removal of capacity is a key operational
   practice, in this instance, the tool used allowed too much
   capacity to be removed too quickly. We have modified this tool
   to remove capacity more slowly and added safeguards to prevent
   capacity from being removed when it will take any subsystem
   below its minimum required capacity level. This will prevent
   an incorrect input from triggering a similar event in the
   future. We are also auditing our other operational tools to
   ensure we have similar safety checks. We will also make
   changes to improve the recovery time of key S3 subsystems. We
   employ multiple techniques to allow our services to recover
   from any failure quickly. One of the most important involves
   breaking services into small partitions which we call cells.
   By factoring services into cells, engineering teams can assess
   and thoroughly test recovery processes of even the largest
   service or subsystem. As S3 has scaled, the team has done
   considerable work to refactor parts of the service into
   smaller cells to reduce blast radius and improve recovery.
   During this event, the recovery time of the index subsystem
   still took longer than we expected. The S3 team had planned
   further partitioning of the index subsystem later this year.
   We are reprioritizing that work to begin immediately.

   From the beginning of this event until 11:37AM PST, we were
   unable to update the individual services' status on the AWS
   Service Health Dashboard (SHD) because of a dependency the SHD
   administration console has on Amazon S3. Instead, we used the
   AWS Twitter feed (@AWSCloud) and SHD banner text to
   communicate status until we were able to update the individual
   services' status on the SHD.  We understand that the SHD
   provides important visibility to our customers during
   operational events and we have changed the SHD administration
   console to run across multiple AWS regions.

   Finally, we want to apologize for the impact this event caused
   for our customers. While we are proud of our long track record
   of availability with Amazon S3, we know how critical this
   service is to our customers, their applications and end users,
   and their businesses. We will do everything we can to learn
   from this event and use it to improve our availability even
   further.





-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
RSS Feed: https://www.listbox.com/member/archive/rss/247/18849915-ae8fa580
Modify Your Subscription: https://www.listbox.com/member/?member_id=18849915&id_secret=18849915-aa268125
Unsubscribe Now: 
https://www.listbox.com/unsubscribe/?member_id=18849915&id_secret=18849915-32545cb4&post_id=20170302162815:2372CB18-FF8F-11E6-93C6-F169D229D834
Powered by Listbox: http://www.listbox.com

Current thread: