In late December, between Christmas and New Years, one Windows Azure server cluster in the south central portion of the US went down for a few days. One of the side effects of the outage was that some Xbox 360 owners were unable to access their cloud save files.
This week, Microsoft finally revealed the reasons behind the Windows Azure server outage on its official blog. Basically, the server disruption was caused by three issues. One was that "human error" caused some storage nodes to not have their node protection turned on. Another problem was that the systems monitoring system for detecting configuration errors "had a defect which resulted in failure of alarm and escalation."
The final nail in the coffin was caused when a "normal" transition to a new primary node "incorrectly triggered a ‘prepare’ action against the unprotected storage nodes." Microsoft said:
Within a storage stamp we keep 3 copies of data spread across 3 separate fault domains (on separate power supplies, networking, and racks). Normally, this would allow us to survive the simultaneous failure of 2 nodes with your data within a stamp. However, the reformatted nodes were spread across all fault domains, which, in some cases, lead to all 3 copies of data becoming unavailable.
The incident caused a total of 62 hours of downtime for that one server cluster. Microsoft has announced it has taken steps to prevent similar incidents and added that it will issue a "100% service credit" to the customers that were affected by the downtime.
Source: Windows Azure blog | Image via Microsoft