Last weekend, one of Amazon's cloud data centers shut down for several hours, due to a massive thunderstorm that made its way through the northern Virginia and Washington, DC areas. As a result, a number of popular online services that used Amazon's services, including Netflix, Instagram, and Pinterest, were affected by the outage.
This week, Amazon posted up a lengthy and highly detailed explanation on what happened on Friday night to its servers in northern Virginia. The servers themselves were based on 10 datacenters. broken down into a number of Availability Zones. As one might expect, it was a combination of issues that caused the outages.
Even though the company had plenty of warning about the incoming storm, and even brought in extra people on Friday night to help out, two of the datacenters still got hit with an electrical surge caused by the storm. Normally, backup generators are supposed to kick in. Amazon stated:
In one of the datacenters, the transfer completed without incident. In the other, the generators started successfully, but each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load and servers operated without interruption during this period on the Uninterruptable Power Supply (“UPS”) units.
Shortly after that happened, power was cut off for all 10 datacenters. That one faulty backup generator failed to kick in for that one datacenter and the UPS units started to lose their charge. Amazon team members worked to restore power. Finally, after a few minutes, Amazon said. "... the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks."
Even with power restored, it took Amazon several more hours to completely restore its datacenters. In addition, the company encountered bugs in its Elastic Load Balancing and RDS software. Amazon said it will be making changes to its servers and its software so these problems won't come up again. It added:
We apologize for the inconvenience and trouble this caused for affected customers. We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes.