Microsoft reveals reasons for late December Azure outage

In late December, between Christmas and New Year's, one Windows Azure server cluster in the south central portion of the US went down for a few days. One of the side effects of the outage was that some Xbox 360 owners were unable to access their cloud save files.

This week, Microsoft finally revealed the reasons behind the Windows Azure server outage on its official blog. Basically, the server disruption was caused by three issues. One was that "human error" caused some storage nodes to not have their node protection turned on. Another problem was that the system's monitoring system for detecting configuration errors "had a defect which resulted in failure of alarm and escalation."

The final nail in the coffin was caused when a "normal" transition to a new primary node "incorrectly triggered a ‘prepare’ action against the unprotected storage nodes." Microsoft said:

Within a storage stamp we keep 3 copies of data spread across 3 separate fault domains (on separate power supplies, networking, and racks). Normally, this would allow us to survive the simultaneous failure of 2 nodes with your data within a stamp. However, the reformatted nodes were spread across all fault domains, which, in some cases, lead to all 3 copies of data becoming unavailable.

The incident caused a total of 62 hours of downtime for that one server cluster. Microsoft has announced it has taken steps to prevent similar incidents and added that it will issue a "100% service credit" to the customers that were affected by the downtime.

Source: Windows Azure blog | Image via Microsoft

Report a problem with article
Previous Story

Intel: Windows 8 makes picking either a tablet or a PC unnecessary

Next Story

Two Cam Newton Windows Phone 8 TV ads to be shown Sunday

10 Comments

Commenting is disabled on this article.

3 mistakes at the same time?. sheesh, it sounds really bad.

Problems happens (it is part of the business) but, most of the time, it is a single problem at the same time. Two problems in chain is rare but it could happens. Three problems at once is bad. But, it is way worst. It wasn't just only 3 problems but 4, they took more that a day for fix that.

Most IT business follow a specific project management specification/standard (PMO, ITIL, PRINCE2 and such). However, those certification/procedures are not simply extra and annoying paperwork. Instead, they are a tools that helps in different stage of a process and specially (for this case) when a change is requested (RFC in ITIL). A RFC is simple : what will be done, who will do that, when it will be done, does exist an impact?, and, if the procedure fail, HOW CAN WE REVERT ALL THE CHANGES?.

It's amazing to know how many dudes are master in numberless certification but they are clueless in a real world scenario.

It's good that they give a clear explanation of what happened but I'm sorry guys - my take on this is less positive - you outsource your computing to companies like Microsoft because you believe they know how to do it right. Downtime of any kind is never a good thing, and of course we all make mistakes - but for something so visible, and so critical, I don't believe a company like Microsoft should be experiencing 62 hours downtime - that's sitting between 2 and 3 nines (99% and 99.9%) availability which is pretty bad.

I know everyone around here likes to defend Microsoft, and leap on those who dare criticize them but look at the bigger picture - take it out of the context of which companies you like, and which you dislike. Think of a business that puts their stuff into Azure because they feel it's the sensible choice and as it's a cloud offering it should be super resilient. It's not good.

Just my 2 cents.

> Think of a business that puts their stuff into Azure because they feel it's the sensible choice
> and as it's a cloud offering it should be super resilient. It's not good.

Yeah, it's much cheaper instead for every single last company to hire its own hardware / networking experts and purchase / set up / maintain redundant hardware in multiple locations worldwide, etc. That automatically means better uptime for cheaper than what MS can offer, right? Is that what you're suggesting?

Er no, I'm not saying that at all. I'm suggesting the exact opposite. It's incredibly expensive to hire your own team of engineers, buy loads of servers in multiple locations and put in enterprise class storage. The point of taking out a cloud hosting solution with a 3rd party like Microsoft means you pay for their economy of scale and, by choosing someone like Microsoft, you hope you're getting the very best.

What was implied is that even the very best will fail - and the cost (to cloud users) of such a failure is lower than doing it all yourself.

**** happens. To all of them. Without exception.

Got to love this about Microsoft, direct and to the point. No excuses and silly explanations. Human error and a flawed design, we have patched the design and hope that humans won't stuff up again. Apologies all round and your account is credited.

How can ya not be happy with that!

Auzeras said,
Got to love this about Microsoft, direct and to the point. No excuses and silly explanations. Human error and a flawed design,

Its all well and good to blame human error, but time and again I've seen this used as an excuse, and NOT A THING is done to prevent re-occurrence.

Well Microsoft - have you fixed the problem? or is it really down to cost cutting, and nothing will be done to ensure it doesn't happen again?

Auzeras said,
Got to love this about Microsoft, direct and to the point. No excuses and silly explanations. Human error and a flawed design, we have patched the design and hope that humans won't stuff up again. Apologies all round and your account is credited.

How can ya not be happy with that!


Exactly, I hate it when companies own up to their screw-ups and then some people just say "oh they really screwed up bad, company xyz sucks!!11!!" when their favorite companies would make any excuse to blame someone else for their mistakes