Amazon offers details on last weekend's cloud server outage

Last weekend, one of Amazon's cloud data centers shut down for several hours, due to a massive thunderstorm that made its way through the northern Virginia and Washington, DC areas. As a result, a number of popular online services that used Amazon's services, including Netflix, Instagram, and Pinterest, were affected by the outage.

This week, Amazon posted up a lengthy and highly detailed explanation on what happened on Friday night to its servers in northern Virginia. The servers themselves were based on 10 datacenters. broken down into a number of Availability Zones. As one might expect, it was a combination of issues that caused the outages.

Even though the company had plenty of warning about the incoming storm, and even brought in extra people on Friday night to help out, two of the datacenters still got hit with an electrical surge caused by the storm. Normally, backup generators are supposed to kick in. Amazon stated:

In one of the datacenters, the transfer completed without incident. In the other, the generators started successfully, but each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load and servers operated without interruption during this period on the Uninterruptable Power Supply (“UPS”) units.

Shortly after that happened, power was cut off for all 10 datacenters. That one faulty backup generator failed to kick in for that one datacenter and the UPS units started to lose their charge. Amazon team members worked to restore power. Finally, after a few minutes, Amazon said. "... the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks."

Even with power restored, it took Amazon several more hours to completely restore its datacenters. In addition, the company encountered bugs in its Elastic Load Balancing and RDS software. Amazon said it will be making changes to its servers and its software so these problems won't come up again. It added:

We apologize for the inconvenience and trouble this caused for affected customers. We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes.

Source: Amazon.com

Report a problem with article
Previous Story

New NVIDIA beta unified drivers add Windows 8 support

Next Story

Kim Dotcom accuses Joe Biden of ordering MegaUpload raid

13 Comments

Commenting is disabled on this article.

I do think it's pretty awful that they've got backup generators that they obviously haven't ever tested or trained staff to use...
That's like having an awkward fire escape route in a large building, not even doing any fire drills or showing people how to use it, then wondering why no-one escaped when a fire broke out

Once more, this kind of thing just highlights why cloud storage is a terrible idea and anyone relying on it is monumentally dumb. There are just too many points of failure between you and your data.

FloatingFatMan said,
Once more, this kind of thing just highlights why cloud storage is a terrible idea and anyone relying on it is monumentally dumb. There are just too many points of failure between you and your data.
Sounds like a daft sweeping statement you make, do you have a better suggestion for the industry for digital distribution? I doubt you do.
I rely on cloud storage, but like most it uses local copies too - so an outage at a data centre is a inconvenience, not a disaster unlike that compared to a local drive failure which normally requires restoration from backups etc.

FloatingFatMan said,
There's nothing wrong with it as a backup solution, but anyone using it for primary storage needs their heads examining.

No, I still disagree with you. You need to look at the type of service you want to provide and the amount of money you want to throw at it. The technology should never be more costly than the value of the service and what it generates. if you compare 3-5 hours of downtime, in a year that has about 8760 hours, is still less than 0.06% of downtime. That still falls in the range of 99.95% guaranteed uptime. To get that type of uptime in your own business, you would have spent at least 5 times the amount you're paying Amazon. Some of our services are in the cloud, and I am happy. We need to provide services across the Caribbean, and there you have lousy utility and data services. So the cloud was the way to go. We're planning of moving even more services to it, as cost-cutting is the buzzword these days. As long as those in charge are honest about the expectation of IaaS infrastructure, I think it is the best way going forward.

Edited by vhaakmat, Jul 4 2012, 2:35pm :

Respectfully, I disagree. The company I work for sells cloud services in one of it's departments (huge multinational). Within the industry its a running joke. You have NO guarantee on the security of your data, you have no guarantee that the guy installing updates has tested them properly, or even knows what the hell he's doing, and you have no guarantee that when you need it, you'll be able to access it.

Cases in point, Amazon LOST some customer data last year due to an office monkey, and then there's the whole MegaUpload debacle. What if the feds decide that too many twits are uploading pirated content to Amazon's cloud servers and take them ALL down? Bye bye data.

There are too many failure points.

FloatingFatMan said,
Respectfully, I disagree. The company I work for sells cloud services in one of it's departments (huge multinational). Within the industry its a running joke. You have NO guarantee on the security of your data, you have no guarantee that the guy installing updates has tested them properly, or even knows what the hell he's doing, and you have no guarantee that when you need it, you'll be able to access it.

There is where SLA are coming in. In the end there is a contract between 2 business, each required to follow certain rules or guidelines. on breach of contract, of course, the lawyers will get happy. There is however nothing stopping you to make sure you also have a backup of your data. The same service monkey can be in your company and screws up your data. Who do you hold responsible then? You can't get any monetary compensation from an employee, but you can get it from a business you have a contract with. Amazon had to pay for that error. Sorry was not enough. If you're scared of the feds, host it with a European company. European law forbids European companies to host client info data on non-European hosts. So there are many ways to skin a cat... choose one

Thing is, sure, you can wave contracts and the law around and feel safe; but at the end of the day, you still have no idea who might be accessing your data.

People can't honestly expect anyone to keep servers up through the craziest of natural disasters. Amazon seemed to handle this really well.

It's like expecting a grocery store to still sell you bananas while the entire parking lot just collapsed in an earthquake.

andrewbares said,
People can't honestly expect anyone to keep servers up through the craziest of natural disasters. Amazon seemed to handle this really well.

umm its supposed to be cloud so if its not going to stay up for, and i quote:

"99.95% during the Service Year"

then businesses have some real issues...

Couldn't come at a worse time now that a week or so ago Google took aim at amazon with the announcement of its iaaS Cloud services...also would have been nice if the service providers for the major services that failed had geographical separation options in place so that other data centres could have picked up the load...

I'm having all kinds of issues with their mp3 downloader failing and taking forever to finish downloads. They must still be dealing with it.