On April 1, Microsoft experienced a massive outage across most of its services, which were rooted in DNS issues with Azure. Now, the company has released a more detailed status update explaining what caused Azure to fail to respond to queries (via ZDNet).
According to the update on the Azure status history page, there was an unusual surge in Azure DNS queries from all over the world, which Azure's systems are designed to mitigate through "layers of caches and traffic shaping". However, a specific series of events exposed an error in the code for Azure's DNS service, which made it less efficient.
Things only got worse as clients faced errors because subsequent DNS retries only resulted in more traffic piling up. Microsoft has systems in place that would usually drop illegitimate DNS queries that cause volumetric spikes like this, but because many of the queries were retries, they were considered legitimate. As such, the DNS services became unavailable after some time.
Microsoft says the issues started at 9:21PM UTC and by 10:00PM, Azure services themselves had been fixed, with additional capacity also being prepared if further mitigation was needed. However, this time exceeded Microsoft's goal for fixing problems like this. Many Microsoft services depend on Azure and recovery times were different for each of them, but by 10:30PM, most services were back online.
Following the outage, Microsoft says it updated the logic on its mitigation systems to prevent excessive retries, and it will continue working to improve the detection and mitigation of volumetric spikes in traffic. Naturally, the DNS code defect was also fixed, so hopefully, accidents like this won't happen anytime soon.