Storing tweets requires four petabytes of data a year

Twitter analytics lead, Kevin Weil, spoke recently at the Web 2.0 Expo in New York and revealed some interesting information about Twitter and its infrastructure.

Weil said that all of tweets, which are limited to just 140 characters, add up to 12 terabytes of storage every day. He said, "that would translate to four petabytes a year, if we weren't growing."

All that data is being analyzed by Weil and his team to attempt to find information that would be useful to help make Twitter a profitable business. Twitter has been working hard to become profitable, they recently made major changes to their website in an attempt to increase user engagement.

Twitter is using user data to determine whether their changes are successful. They track users who have been inactive for some time and suddenly become active again. They match the time the user becomes active with changes they have made to determine the success of those changes.

Twitter is also analyzing what influences a retweet, what tweets are most successful. They are using "machine learning techniques" to figure out which tweets resonate most with users.

According to The Technology Review Blog, "Twitter benefits from a variety of open-source software developed by companies such as Google, Yahoo, and Facebook. These tools are designed to deal with storing and processing data that's too voluminous to manage on even the largest single machine."

Gathering and storing all of this data has been an issue for Twitter, they currently operate on a 100-machine cluster which can not completely handle the load. They plan to move to a new data center later this year with three to four times the capacity of their current data center.