Microsoft explains most recent SkyDrive and Hotmail outage

On September 8, users of Microsoft's Hotmail and Skydrive online services were unable to access these features for a few hours. At the time, Microsoft didn't offer much in the way of a reason for this outage but this week, the Windows Live blog site provided a more detailed explanation. According to Arthur de Haan, Microsoft's Vice President of Windows Live Test and Service Engineering, "A tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption."

That corruption affected Microsoft's DNS service, according to de Haan. He stated that two events that happened at the same time helped to corrupt a file in the DNS service. He states:

The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file). The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client.  Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.

De Haan said that Microsoft has taken several steps to prevent such an outage from happening again including "further hardening the DNS service to improve its overall redundancy and fail-over capability" and "developing an additional recovery process that will allow a specific property the ability to fail over to restore service and then fail back when the DNS service is restored."

Report a problem with article
Previous Story

MySpace delays revamp plans until October

Next Story

Facebook wants to go Hollywood

7 Comments

Commenting is disabled on this article.

MFH said,
hmm… is it possible that I'm never affected by one of those outages?^^

I guess your account has not been in the lucky datacenter chosen to be first when rolling out the update

Excuses, excuses. Something like an incorrectly formed configuration file putting down an entire datacenter? That's unacceptable and whoever approved that update should be fired!

Given the complexities involved. I give them a pass. I think it's good to get these type of glitches out of the way and learn. Now you tell me that ALL MY DATA has been blown away? Then I'll have something to complain about. THAT will be the real cloud storage story for any cloud company the first time it happens.

jimmyfal said,
Now you tell me that ALL MY DATA has been blown away? Then I'll have something to complain about. THAT will be the real cloud storage story for any cloud company the first time it happens.

It will probably be Apple, but they'll call it a feature and name it something like "iDelete for outdated files you shouldn't need anymore"

Just Kidding! Heck, I'm posting this from Firefox on a Mac right now.