Amazon: human error caused Netflix's Christmas Eve outage

On Christmas Eve, as millions of US residents were either at home or traveling to get home before Christmas Day, they found that many of their devices that streamed Netflix were unable to do so. This was due to a problem in the Northern Virginia server cluster operated by Netflix's server partner Amazon Web Services. The issue was not fully addressed until well into Christmas Day.

Late on Monday, Amazon issued a statement that gave the reasons for the outage. Simply put, someone at Amazon Web Services did something that they shouldn't have done. Amazon said that on Christmas Eve, part of the state data that handles its East Coast Elastic Load Balancing system was deleted by an unnamed developer on the company's team.

Amazon's statement said:

The data was deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers.

The statement goes into detail on how the deleted data caused about 6.8 percent of the running ELB load balancers to be affected. That was apparently enough to cut off access to Netflix's service for many smartphones and other hardware. Amazon's workers were able to set things right by mid-day on Christmas Day.

Amazon says they will take steps to make sure something like what happened on Christmas Eve does not happen again. That will include requiring a per-incident CM approval before a developer on the team can access the production ELB data. Amazon also gave their apologies for the incident, saying:

We know how critical our services are to our customers’ businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service.

Source: Amazon Web Services | Image via Amazon Web Services

Report a problem with article
Previous Story

Net Applications: Windows 8 running on 1.64 percent of PCs

Next Story

IE10 still well behind most major web browsers in December

7 Comments

Commenting is disabled on this article.

Just 5 minutes before running this command, the developer was trying to think of a way to get back at Amazon for making him work Christmas Eve, when suddenly inspiration hit...

A nice warning message before running the task would be nice. Something like this.


Warning!
You are about to run a task against PRODUCTION data. This will negatively impact the state of the services. Do you wish to continue?

Now they could amend a check that requires CM identification before it runs.

Only now will they require CM approval per incident?

Wow - basic stuff Amazon! Every 2nd and 3rd line time I have ever worked on has needed CM signoff to change anything (even remove old tables) on a live system!