In July, Microsoft's Windows Azure cloud-based service experienced an extended amount of downtime in Western Europe. The company apologized for the issue, which caused servers in Dublin, Ireland and Amsterdam to shut down for just over 2 1/2 hours.
This week, Microsoft posted up a more detailed explanation about what happened to cause the Azure outage. In a post on the official Azure blog, the company said the problem was related to a "safety valve mechanism" that, when in normal use, is suppose to limit the number of connections that the Azure server hardware can accept if any "potential cascading networking failures" occur.
Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity. Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages.
The final result was that bugs in the server cluster presented themselves, causing the hardware to reach 100 percent CPU utilization. Microsoft said it solved this problem by increasing the threshold limits in the cluster and in all of the company's Windows Azure datacenters. Microsoft also said it has taken steps to improve the monitoring of its network so these kinds of situations don't happen again.
Source: Windows Azure blog