Storing tweets requires four petabytes of data a year

Twitter analytics lead, Kevin Weil, spoke recently at the Web 2.0 Expo in New York and revealed some interesting information about Twitter and its infrastructure.

Weil said that all of tweets, which are limited to just 140 characters, add up to 12 terabytes of storage every day. He said, "that would translate to four petabytes a year, if we weren't growing."

All that data is being analyzed by Weil and his team to attempt to find information that would be useful to help make Twitter a profitable business. Twitter has been working hard to become profitable, they recently made major changes to their website in an attempt to increase user engagement. 

Twitter is using user data to determine whether their changes are successful. They track users who have been inactive for some time and suddenly become active again. They match the time the user becomes active with changes they have made to determine the success of those changes.

Twitter is also analyzing what influences a retweet, what tweets are most successful. They are using "machine learning techniques" to figure out which tweets resonate most with users.

According to The Technology Review Blog, "Twitter benefits from a variety of open-source software developed by companies such as Google, Yahoo, and Facebook. These tools are designed to deal with storing and processing data that's too voluminous to manage on even the largest single machine."

Gathering and storing all of this data has been an issue for Twitter, they currently operate on a 100-machine cluster which can not completely handle the load. They plan to move to a new data center later this year with three to four times the capacity of their current data center.

Report a problem with article
Previous Story

Rumor: Sony gearing up for PSP2 launch

Next Story

Nintendo 3DS ships February in Japan, US and Europe in March

64 Comments

Commenting is disabled on this article.

Sure someone would kill to get their hands on it! You know, to black mail some peeps.

What a racist term that is.

this is valuable marketing and statistic data, if someone can figure out how to sort it..

twitter has a hard time even storing it!

If they only stored the valuable data however, all tweets since the beginning of mankind would all fit on a low-density 5.25" floppy disk.

With plenty of space left to cover the next million years.

They mentioned there are over 75 million users of Twitter...so if you just take one user on an average day of probably sending and receiving around 10-20 tweets a day..in this case..let's say 20 tweets a day...multiply that times 75 million users...and that equals out to about 1 billion 500 million tweets in a single day.

And that is just the average user...that doesnt include twitter accounts from famous people who probably get way more than 20 tweets a day or companies.

So...yeah...4 petabytes of storage seems about right...if not 4 petabyes..it might be more.

doesn't each tweet store about 4KB of data in their system? They have a LOT of meta data and indexes add to the amount... it's a lot more then just 140 characters there

Right-click the folder the tweets are in, click Properties -> Advanced... -> check Compress contents to save disk space -> OK

feelgood13 said,
Right-click the folder the tweets are in, click Properties -> Advanced... -> check Compress contents to save disk space -> OK

and since this is probably stored in a database server where the database is broken into multi terrabyte storage groups, that would do absoutly nothing but make the system slow as heck or unresopnsive

i wonder what the figures would be if they had no spam.. i would say it's a priority for twitter to try and tackle, i get added by lots of spam accounts every day and most of the tweets are pointless.
I'm also getting @replies which is annoying.

Dont forget that they are not going to use cheap off the shelf drives. They will be hot swappable 7200 sata disks at the very least, at 2tb one of those costs around 400$ at least and you dont have 2000 of them, they are running some form of raid and I would assume not raid 5 due to the massive amount of storage and issue with recovering a raid 5 array with even 3 1tb drives in it. I would assume they are running sets of raid 10 or so which knocks out half the disks.

A 2TB SATA drive does not cost $400...perhaps you are talking about SAS drives and no they won't be 7200RPM at a minimum...Nothing less than maximum capacity and speeds. We are talking data center level equipment. RAID10 would be an extremely inefficient in regards to capacity. You're concept on the technology they use is vastly underestimated.

not that bad 2000 x 2TB hard drive at average of about $100 or so in major tech world isn't much.

Most of it probably goes on amazon's S3 anyway so who cares

Digitalx said,
not that bad 2000 x 2TB hard drive at average of about $100 or so in major tech world isn't much.

Most of it probably goes on amazon's S3 anyway so who cares

If you are getting 2000 drives I'm sure they are cheaper than $100 a pop to twitter. I'm sure they would get a bulk discount. In addition something tells me that the drives themselves hold more than 2TB each or are seriously raided.

chadlachlanross said,
Why store them past a few days or say 30 days at the most?

Good question. Is anyone really going back to try to retrieve a tweet from 6 months ago?

chaos_disorder said,

Good question. Is anyone really going back to try to retrieve a tweet from 6 months ago?

You never know ;-)

People may do some sort of research on them in the future, like what stupid pointless crap were people fitting into tiny messages. Lol.

Mr aldo said,
You never know ;-)

People may do some sort of research on them in the future, like what stupid pointless crap were people fitting into tiny messages. Lol.

What about compressing tweets older than 30 days. This way they would still be available but there would also be a space saving. The higher cpu load (as someone already mentioned was possible) would only be incurred if the system had to access one older than 30 days.

Shadrack said,
Quite frankly, I don't see how they are going to be profitable w/o putting adverts all over the page...

They have several things in place I believe, the most noteable is "Sponsored Tweets" or something to that extent.

Shadrack said,
Quite frankly, I don't see how they are going to be profitable w/o putting adverts all over the page...
do people actually use the site? i just use the API via some application of my choice..

BGM said,
do people actually use the site? i just use the API via some application of my choice..

That's what the new update is trying to stop. Twitter want more visitors to their website and not through the use of applications.

jbrooksuk said,

That's what the new update is trying to stop. Twitter want more visitors to their website and not through the use of applications.

I wish them luck on that. I never use their site. I've used a handful of Mac Twitter clients (Tweetie, Echofon, Itsy, and a few others), plus the iPhone client, and all of them are superior to the website.

BGM said,
do people actually use the site? i just use the API via some application of my choice..

Yea. Some people do. When the tweet says it was done "via web" then that is their site. They are making improvements that might encourage users to use the site but the tweets go through so fast it is hard to keep up unless you are using a client. Even then it is difficult to mull through all the incoming tweets before you get dumped with a bunch more.

jbrooksuk said,

That's what the new update is trying to stop. Twitter want more visitors to their website and not through the use of applications.

If they limit the use of apps, then people will stop tweeting. No one is going to the browser on their phone (or even computer) when the Twitter experience is so much better via an app on your phone. I have 5 Twitter apps on both my iPhone and iPad and 2 on my BlackBerry. I even have 5 on my Macs. Without them I wouldn't tweet. That could be a bad move for Twitter.

BigCheese said,
I think a lot of metadata is stored with each tweet. So it's a fair bit more than 140 bytes.

Yes, here's a tweet:
http://mehack.com/map-of-a-twitter-status-object

It should be added that Twitter could have made this object much, much smaller by making it binary instead of going all XML on it, but maybe this is just what's generated by their API, and not what's actually stored on their servers? If I saw Petabyte figures, I'd definitely not store things in XML, to hell with open standards in that case.

I dont belive it. Either this is BS or the people that are running it are just ignorant. Plain Text messages can compress better than anything out there. Add to that storage vendors like NetApp have Deduplication built into their primary storage. Text messages would be deduped to hell and back.

What is alarming is the amount of energy used to power that ammount of storage. Storage of totally useless information. I would love to see twitter fail, its nothing be digital diarrhea of the mouth.

rrode74 said,
I dont belive it. Either this is BS or the people that are running it are just ignorant. Plain Text messages can compress better than anything out there. Add to that storage vendors like NetApp have Deduplication built into their primary storage. Text messages would be deduped to hell and back.

What is alarming is the amount of energy used to power that ammount of storage. Storage of totally useless information. I would love to see twitter fail, its nothing be digital diarrhea of the mouth.

I'm sure they store more than just the 140 character tweets.

And like someone said above, decompressing takes processing power.

LaserWraith said,

I'm sure they store more than just the 140 character tweets.

And like someone said above, decompressing takes processing power.

Deduplication on storage takes Zero proccesing power on the servers side. Its all done on the SAN side. Even it did take more CPU on the serverside, is it cheaper to buy faster CPU's or a more servers than it is to buy more SAN.

While it is a large amount, it really doesn't surprise me after that article showing the amount of data each tweet stores (date, time, user, and tons of other stuff that it doesn't really need to store).

Bet they could trim some of that stuff and cut the required space down by 1/3rd or so.

Ci7 said,
4PB of 99% useless data

Amen.
Could donate all those servers and storage to something even half-worthwhile.

acnpt said,
So how much space would they save if tweets were only 139 characters?
About 30 terabytes I think.

Or store them compressed. They're mostly text.

GreyWolf said,

Or store them compressed. They're mostly text.

That would save on storage, but then CPU usage would shoot up quite a bit to decompress the data as it's retrieved. Since storage is cheaper than processing (in terms of ROI), it probably wouldn't make sense to do that.

Meconio said,

It's bigger than Wikipedia but is it bigger than Facebook? I wounder that.

Facebook is much larger in terms of user base. But Twitter is constantly bombarded with tweets. That requires huge amounts of queries at almost every given moment.

arukun14 said,

Facebook is much larger in terms of user base. But Twitter is constantly bombarded with tweets. That requires huge amounts of queries at almost every given moment.

Same with Facebook and Status updates, group and page posts. Added to the fact that Facebook also serves images, videos, applications, IM sessions, and Personal Messages i'd wager their data usage would be even higher

Subject Delta said,

Same with Facebook and Status updates, group and page posts. Added to the fact that Facebook also serves images, videos, applications, IM sessions, and Personal Messages i'd wager their data usage would be even higher

Yeah I was just thinking the same thing. Facebook's data usage must be 3 or 4 times (if not more) greater than Twitter's daily usage.

I estimate about 32 trillion tweets based on that number. That is probably off target, but it gives you an idea.

xTdub said,
I estimate about 32 trillion tweets based on that number. That is probably off target, but it gives you an idea.

seems like many dont have a life