Google introduces new search indexing system - Caffeine

Google announced on Tuesday the completion of a new web indexing system called Caffeine.

Google claims that Caffeine provides 50 percent fresher results for web searches than their last index. "It's the largest collection of web content we've offered. Whether it's a news story, a blog or a forum post, you can now find links to relevant content much sooner after it is published than was possible ever before", said Carrie Grimes, Software Engineer at Google.

Google has built a new search indexing system as web content is growing in size and is more diverse in the form of video, images, news and real time updates. To keep up, Google has built Caffeine. The old index had several layers (see image above) that updated at different rates to others. To refresh a layer of the old index, Google would analyze the entire web which meant there was a significant delay between when Google found a page and when it was available on its search pages for end users.

With Caffeine, Google analzyes small portions of the web and updates the index on a continuous basis. Every second, Caffeine processes hundreds of thousands of pages in parallel. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds information at a rate of hundreds of thousands of gigabytes per day. Google claims you'd need 625,000 of the largest iPods to store that much information.

Users who search on Google.com should see improvements in the months to come.

Report a problem with article
Previous Story

Microsoft Office 2011 for Mac will be 32-bit only

Next Story

Army developing virtual world to train troops

39 Comments

Commenting is disabled on this article.

Did they beta test this? I remember months ago asking a question on 4chan and it ocurred to me to search the question on google about a minute later. The second result was my own post on 4chan.

P1R4T3 said,
How is it possible to have so much of space? What's the size of their storage building (farm?)?

They have over 1 million servers...go figure.

P1R4T3 said,
How is it possible to have so much of space? What's the size of their storage building (farm?)?

I know the one near me (in the Columbia River Gorge in Oregon) is two buildings, each larger than a football field, loaded with their custom-built servers. They are connected to at least 4 separate power grids, 2 from Oregon and 2 from Washington. The joy of cheap hydro-electric power

Masterp said,
I can't watch for the next version of Google's indexing system. Code name: "Crack".

Anything with Caffeine in it has got to be good ;-)

Is it just me or do those numbers not add up? If it is already 100 million gigabytes and it adds data at a rate of 'hundreds of thousands of gigabytes' a day that just doesn't make sense. Consider the very pessimistic approach of 100,000 gigabytes a day.

100,000,000 / 100,000 = 1000 days to fully populate the 100 million gigabytes of data.

It just doesn't seem to add up?

ascendant123 said,
Is it just me or do those numbers not add up? If it is already 100 million gigabytes and it adds data at a rate of 'hundreds of thousands of gigabytes' a day that just doesn't make sense. Consider the very pessimistic approach of 100,000 gigabytes a day.

100,000,000 / 100,000 = 1000 days to fully populate the 100 million gigabytes of data.

It just doesn't seem to add up?

They've been around a good few years, and when they started off they did not collect nearly as much data as they do now (for obvious reasons). They hav that amount of data, and the amount of data they collect each day just increases - so really, it makes perfect sense.

ascendant123 said,
Is it just me or do those numbers not add up? If it is already 100 million gigabytes and it adds data at a rate of 'hundreds of thousands of gigabytes' a day that just doesn't make sense. Consider the very pessimistic approach of 100,000 gigabytes a day.

100,000,000 / 100,000 = 1000 days to fully populate the 100 million gigabytes of data.

It just doesn't seem to add up?


They are not saying they are reindexing the web every minute.

JustinN said,

They've been around a good few years, and when they started off they did not collect nearly as much data as they do now (for obvious reasons). They hav that amount of data, and the amount of data they collect each day just increases - so really, it makes perfect sense.

Google has been around for over 10 years now and while I agree they obviously did not collect so much data initially that is still a long period of time and I doubt it was a spike. Even if you consider a linear growth from 50,000GB a day to 60,000 2 years ago to 70,000 last year to 100,000 last year the numbers just don't add up. It sounds like a lot of marketing to me (even ignoring the fact that using GB to represent database size is pretty much worthless; maybe records, keys or UIDs would be more appropriate).

ascendant123 said,
Google has been around for over 10 years now and while I agree they obviously did not collect so much data initially that is still a long period of time and I doubt it was a spike. Even if you consider a linear growth from 50,000GB a day to 60,000 2 years ago to 70,000 last year to 100,000 last year the numbers just don't add up. It sounds like a lot of marketing to me (even ignoring the fact that using GB to represent database size is pretty much worthless; maybe records, keys or UIDs would be more appropriate).
Then don't consider a linear growth. The internet has exploded in size recently. You now have mobile versions of sites that can get indexed, as well as multiple versions of video feeds ranging from mobile to SD to HD, and often a few in-between; Youtube is in their index, and I bet they get thousands of GBs per day there. Not to mention the number of high quality, user submitted photos (Facebook recently opened the floodgates to indexing) submitted to the various depths of the internet. Then consider the number of duplicates you find, such as the image in this story posted, and rehosted, who-knows how many times.

pickypg said,
Then don't consider a linear growth. The internet has exploded in size recently. You now have mobile versions of sites that can get indexed, as well as multiple versions of video feeds ranging from mobile to SD to HD, and often a few in-between; Youtube is in their index, and I bet they get thousands of GBs per day there. Not to mention the number of high quality, user submitted photos (Facebook recently opened the floodgates to indexing) submitted to the various depths of the internet. Then consider the number of duplicates you find, such as the image in this story posted, and rehosted, who-knows how many times.

They don't store the actual images (or actual paths as separate cached objects). That is an interesting point on Youtube, though, I wonder if they are including that in their growth calculation or if this is purely their index service.

Top that off with the fact that despite popular (re: media and marketing) opinion the internet doesn't actually grow in chunks or "omg explosions". Look at Facebook's growth as a search term:

http://www.google.com/trends?q=Facebook

It's clearly linear (with a burst from its creation, obviously). A more interesting one is the trend for Google as a keyword:

http://www.google.com/trends?q=Google

This one always makes me smile, so here it is.

http://www.google.com/trends?q=year

Edited by ascendant123, Jun 9 2010, 1:46pm :

ascendant123 said,
They don't store the actual images (or actual paths as separate cached objects). That is an interesting point on Youtube, though, I wonder if they are including that in their growth calculation or if this is purely their index service.
They link to the actual image, but I wonder if they don't actually store it locally for processing. While I agree that they would still generate metadata associated with any image (for their image search), I wonder if they persist the image. Considering they kept WiFi data, even if it was supposedly accidental, I bet they keep the images.

ascendant123 said,
Top that off with the fact that despite popular (re: media and marketing) opinion the internet doesn't actually grow in chunks or "omg explosions". Look at Facebook's growth as a search term:

http://www.google.com/trends?q=Facebook

It's clearly linear (with a burst from its creation, obviously). A more interesting one is the trend for Google as a keyword...

I agree with you, but I put it in the wrong context and meant explosions in terms of Google. So, when it was new to Google, it was an explosion. The mobile development has been an explosion to Google (as it ads more mobile facing services) due to the immediate replication of so much information, but simply for a different device.

Considering they were going for such a large number, I bet they were including YouTube (since it must be one of the largest repositories of data).

ascendant123 said,
Is it just me or do those numbers not add up? If it is already 100 million gigabytes and it adds data at a rate of 'hundreds of thousands of gigabytes' a day that just doesn't make sense. Consider the very pessimistic approach of 100,000 gigabytes a day.

100,000,000 / 100,000 = 1000 days to fully populate the 100 million gigabytes of data.

It just doesn't seem to add up?


Other than the points mentioned below, you also forget that the rate of hundreds of thousands of gigabytes doesn't indicate that this is the growth rate of the size of the database, as some of the data being added (or maybe most, dare I say?) replace older content.

thatguyandrew1992 said,
Think it's here to combat Bing?

As if Bing were a threat? LOL.

Caffeine has been in the works since long before Bing was ever launched, it's just Google doing what Google does best... moving forward. MS is the one playing catchup here, and they have been for almost a decade.

vaximily said,

Caffeine has been in the works since long before Bing was ever launched, it's just Google doing what Google does best... moving forward. MS is the one playing catchup here, and they have been for almost a decade.

Meh, Google could have done this before. It was competition that led to this investment. The stronger Bing and others are as competitors, the better all search products will have to become.

I guess you can dismiss a bear chasing you through the woods as "playing catchup". The problem is when you stop running.

Neb Okla said,
Meh, Google could have done this before. It was competition that led to this investment. The stronger Bing and others are as competitors, the better all search products will have to become.

I guess you can dismiss a bear chasing you through the woods as "playing catchup". The problem is when you stop running.

While I don't discount what you're saying, my statement still stands... Google has never stopped running, they've always been a leader when it comes to moving up and forward.

I would disagree that "Google could have done this before.", however... this is the natural evolution for their service, based on the way the internet has evolved... their technology has always been on par with or in advance of the infrastructure that needed to be supported.

I wish I could use the internet connections they are using to distribute data to their servers all over the world!! :-D

Not to spy on other people or anything, just for the insane speeds!

'Caffeine takes up nearly 100 million gigabytes of storage in one database and adds information at a rate of hundreds of thousands of gigabytes per day'.. man that's insane

Caffeine takes up nearly 100 million gigabytes of storage in one database and adds information at a rate of hundreds of thousands of gigabytes per day.

stevember said,
Hopefully stop aggressive indexing which can be nightmare.

The biggest part of their profits comes from 'aggressive indexing' and they'll never stop it (or even admit). Google sucks big time, but unfortunately the others suck even more.

stevember said,
Hopefully stop aggressive indexing which can be nightmare.

If you are running medium or large size website, you need to be careful how you release and link your content, how you set robots.txt, how you set indexing of your site on Google etc. If you are not careful, it can kill your website, true. Been there, done that