It’s been a rough week for service providers. First Sony’s Playstation Network (PSN) was broken into and the service has been down ever since. Then part of Amazon’s EC2 cloud service went down without warning. Although everyone knew about these outages, details surrounding the issues have been very vague and users were wondering whether any specific information would be provided in the future.
Amazon has now released a very detailed post mortem explaining the issue. As with most IT outages, this was caused by an improperly executed change. Apparently the team was attempting to expand network capacity but didn’t properly migrate the network connection beforehand. This ended up causing data corruption when both network paths went down.
“During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving.”
After connectivity was restored, the issue of data corruption had to be addressed. By April 24th, the company had restored nearly 99% of their customers’ data at which point Amazon “began forensics on the remaining volumes.” Unfortunately they were unsuccessful in recovering 0.07% of the volumes, meaning that data is gone forever unless their customers have a backup outside of the cloud.
It doesn’t appear that the people responsible for writing the detailed explanation were the same people involved in sending email messages to customers. According to Business Insider, the email says that customers have been given the corrupted data and are asked to delete it to avoid incurring future charges.
A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful. The hardware failed in such a way that we could not forensically restore the data.
What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility...
If you have no need for this snapshot, please delete it to avoid incurring storage charges.
We apologize for this volume loss and any impact to your business.
Amazon Web Services, EBS Support"
Amazon is offering ten days worth of credits to everyone who has their service in the affected US East Region, regardless of whether they were actually impacted. In addition, the company issued an apology for the outage.
“Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.”
With more and more of our data being pushed into the cloud, these kinds of incidents are bound to become more frequent. While a failure on a personal PC is frustrating to an individual, cloud providers are potentially impacting millions of customers. Users want their data available from anywhere, but there is an inherent risk in not controlling your own information.