Amazon apologizes for cloud outage and permanent data loss

It’s been a rough week for service providers. First Sony’s Playstation Network (PSN) was broken into and the service has been down ever since. Then part of Amazon’s EC2 cloud service went down without warning. Although everyone knew about these outages, details surrounding the issues have been very vague and users were wondering whether any specific information would be provided in the future.

Amazon has now released a very detailed post mortem explaining the issue. As with most IT outages, this was caused by an improperly executed change. Apparently the team was attempting to expand network capacity but didn’t properly migrate the network connection beforehand. This ended up causing data corruption when both network paths went down.

“During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving.”

After connectivity was restored, the issue of data corruption had to be addressed. By April 24th, the company had restored nearly 99% of their customers’ data at which point Amazon “began forensics on the remaining volumes.” Unfortunately they were unsuccessful in recovering 0.07% of the volumes, meaning that data is gone forever unless their customers have a backup outside of the cloud.

It doesn’t appear that the people responsible for writing the detailed explanation were the same people involved in sending email messages to customers. According to Business Insider, the email says that customers have been given the corrupted data and are asked to delete it to avoid incurring future charges.

"Hello,
 
A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes.  We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful.  The hardware failed in such a way that we could not forensically restore the data.
 
What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility...

If you have no need for this snapshot, please delete it to avoid incurring storage charges.

We apologize for this volume loss and any impact to your business.

Sincerely,
Amazon Web Services, EBS Support
"

Amazon is offering ten days worth of credits to everyone who has their service in the affected US East Region, regardless of whether they were actually impacted. In addition, the company issued an apology for the outage.

Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.

With more and more of our data being pushed into the cloud, these kinds of incidents are bound to become more frequent. While a failure on a personal PC is frustrating to an individual, cloud providers are potentially impacting millions of customers. Users want their data available from anywhere, but there is an inherent risk in not controlling your own information.

Report a problem with article
Previous Story

Clash of the Titans: Samsung countersues Apple over patent issues

Next Story

Issues stall Samsung Omnia 7 NoDo update

28 Comments

Commenting is disabled on this article.

And this is exactly why the Cloud is a dreadful idea. You have no control whatsoever over what happens to your data, or the hardware your data is stored on; and no recourse if something like this happens.

You're entirely at the mercy of anonymous people who may or may not know what the hell they're doing.

Network outage causing data corruption - that's not how you do it. Even if the upstream server didn't get the rollback you still have to verify data is there for good.

Thanks for the explanation and the apology but if that had been my data that was lost then reasoning would have meant nothing.

I cant honestly rely on cloud backup.

Orange Battery said,
Thanks for the explanation and the apology but if that had been my data that was lost then reasoning would have meant nothing.

I cant honestly rely on cloud backup.

And nor should you. You should also not rely on local backup. Your most important data (anything you care to not lose) should be kept in 3 places, and one should be off site

Orange Battery said,
Thanks for the explanation and the apology but if that had been my data that was lost then reasoning would have meant nothing.

I cant honestly rely on cloud backup.

You're still more likely to have a backup disk die on you than to lose data stored on the cloud.

this is why i doubted cloud storage. now there's gonna a be a backup of the backup of the cloud's backup.
still sticking to burning dvd's and keeping hd's for storage thank you.

3rd impact said,
this is why i doubted cloud storage. now there's gonna a be a backup of the backup of the cloud's backup.
still sticking to burning dvd's and keeping hd's for storage thank you.

Best practices for backup are 3 copies, the original, a local backup (such as a local NAS or USB drive, NAS being preferred because the HDD will not be next to the computer) of that and an offsite backup, in case of any sort of failure in any two places.

This practice was useful for my critical school work. I had all of my code and course notes on a Desktop, a laptop, and on my Live SkyDrive. Anything else was only backed up to externals, which I wish I had the space for cloud backup as well (not that they were lost, but they were close to being lost)

My basement flooded, where my tower resides. My tower was 17" tall, and the flood was 18-19 inches. The tower was on at the time. I was a bit lucky that the water only got as high as it did, as my laptop and backup drives sit on a desk next to the tower, but the laptop was out of service for two days as I could not recover it before leaving for class that day, and on the next I found out that the power brick was fried (as it was on the floor) and needed replacing. During that time, I was able to continue my work because of cloud backup.

Cloud backup is not a waste of time, but if you "Put all your eggs into one basket," you will get stung, remember, data should be in 3 places

3rd impact said,
this is why i doubted cloud storage. now there's gonna a be a backup of the backup of the cloud's backup.
still sticking to burning dvd's and keeping hd's for storage thank you.

99.93% durability seems pretty good to me. Besides, I would use S3 for backup rather than EBS, and I don't think S3 was affected at all (though I may be wrong).

rev23dev said,
10 days worth of credits for permanent loss? Weak.

Exactly my thoughts as well. 14 days would have sounded a lot better.

Could always create a par2 recovery set of what's on the cloud, therefore if anythings damaged you could recover it.. Obv more useful for backups and not rapidly updating sites

I'd also like to add that it's exactly the same as online backup services that charge lucratively high prices and allow you to backup your data over the net. One idiot in the north of england that quite obviously didn't know much about the services he resold was doing a speech on it saying how great and wonderful it was, and it's all encrypted and basically all it would take is a SINGLE BYTE of ANY of the data to change and your backup is... You guessed it, permanently useless data you can't do anything with.

He didn't go into security much but from what he was saying basically there was no AV on the servers and only an ISP's firewall, so guess how easy people could hack into that.

n_K said,
He didn't go into security much but from what he was saying basically there was no AV on the servers and only an ISP's firewall, so guess how easy people could hack into that.
They don't need AV unless they're executing code. Simply storing bad code doesn't mean you crash. Again, bad code doesn't mean you're easily hacked either. You have to actually execute the code for it to work and for that you need some sort of fundamental system flaw. I was under the impression a lot of their services were simply data storage, like a SAN. Then again I admit I know little about EC2 services.

Tim Dawg said,
They don't need AV unless they're executing code. Simply storing bad code doesn't mean you crash. Again, bad code doesn't mean you're easily hacked either. You have to actually execute the code for it to work and for that you need some sort of fundamental system flaw. I was under the impression a lot of their services were simply data storage, like a SAN. Then again I admit I know little about EC2 services.

They use commodity hardware and scale it out. They're totally against enterprise systems and when i interviewed with them, they basically laughed at my experience with enterprise storage/SAN/NAS systems. Look who is laughing now.. ha ha ha

Tim Dawg said,
They don't need AV unless they're executing code. Simply storing bad code doesn't mean you crash. Again, bad code doesn't mean you're easily hacked either. You have to actually execute the code for it to work and for that you need some sort of fundamental system flaw. I was under the impression a lot of their services were simply data storage, like a SAN. Then again I admit I know little about EC2 services.

Well for example this one company (It's in newcastle if anyone using them wants to start think about changing now) has a wireless access point, WPA1 encryption none-the-less, but that's it.

True, not really much point in having AV because if hackers got in chances are they'd be using custom tools that wouldn't register on AV databases, but it has a chance of keeping more people out.

And THIS ladies and gentlemen is why the Cloud is the worst idea for any business; integrity: NONE!

n_K said,
And THIS ladies and gentlemen is why the Cloud is the worst idea for any business; integrity: NONE!

You get what you pay for There are better cloud platforms built on enterprise hardware that will offer SLAs. Amazon just isn't one fo them.

blahism said,
You get what you pay for There are better cloud platforms built on enterprise hardware that will offer SLAs. Amazon just isn't one fo them.

Amazon offered a 99.9% uptime SLA.

thats why I am not going to be using cloud rubbish. I have a live account, but all my emails come down to a nice pst in outlook which then is backed up by shadowcopy from my server.

SHoTTa35 said,
How could it be permanent, don't they have backups of their backups?
You can backup traffic that can't get between the server and the storage... the network was too overly congested. The routers & switches likely had to start dumping packets as their buffers overflowed, and that's the data that was lost. Also, data that was being written, and the first part was written while the last part hadn't been recieved, that can't be recovered.

There could be several different reasons for it, one of them being pricing (do they offer different contracts with different level of services?) and the other could be for "security" reasons ("we promise not to store your data!").

cybertimber2008 said,
Also, data that was being written, and the first part was written while the last part hadn't been recieved, that can't be recovered.
That's called a transaction which would be rolled back the minute the transaction wasn't committed. That could cause potential data loss if the sending system were unable to be told to resend the data but it wouldn't result in data corruption.

Tim Dawg said,
That's called a transaction which would be rolled back the minute the transaction wasn't committed. That could cause potential data loss if the sending system were unable to be told to resend the data but it wouldn't result in data corruption.
I stand corrected. So it's not that part.