In late February, Microsoft experienced a shutdown in its Windows Azure cloud-based business server services. Windows Azure was inaccessible to its customers around the world for a number of hours before service was restored.
Now Microsoft has updated its Windows Azure blog with an official explanation of what happened to the service a couple of weeks ago. More importantly for its customers, Microsoft is also offering some compensation for the downtime. The blog states:
Due to the extraordinary nature of this event, we have decided to provide a 33 (percent) credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted. These credits will be applied proactively and will be reflected on a billing period subsequent to the affected billing period.
As far as the downtime itself, Microsoft offers a highly detailed explanation for what happened. In summary, the virtual machines used in Windows Azure run applications that use a guest agent. The guest agent sometimes sends what is called a "transfer certificate" to the host agent running on the host OS server.
When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.
This leap day bug basically cascaded through a number of the virtual machines in the Windows Azure network, causing the servers to go into a Human Investigate phase where they would shut down. Microsoft fixed those issues but then discovered that a number of older host agent servers that got updated were still using the networking plugin that was written for the newer host agent. Those servers also had to be fixed.
Microsoft said it is evaluating its response to the server downtime and is taking steps to make sure similar incidents are not repeated.