main

Google Turns Away Robots From Its Front Door

Sleeper on 28 March 2002 - 15:45 · 11 comments & 674 views

Advertisement (Why?)
Citing the need to protect its server resources, Google has prevented competing search engines from indexing much of its own Web site, the company confirmed Wednesday.
While Google has amassed an index of over 2 billion Web pages by automatically "spidering" or "crawling" sites all over the Web, the popular search portal has effectively walled off numerous sections of its own site from other search "bots."

By placing a special "robots.txt" file on its server, Google has prohibited other crawlers from indexing 19 areas of its site, including one that offers searches of the company's archive of Usenet newsgroup discussions, as well as an area for exploring its index of graphic images on the Web. Also blocked are a special index of mail-order catalogs, and a section that allows searches of news articles at other sites.

As a result, a search of the phrase "LL Bean" performed through a link to Google set up at AltaVista.com produces no results. Doing the same search going directly though Google.com produces 20 pages that include the LL Bean name.

News source: TechNews.com
View: The Full Story
View: The Robot Exclusion Standard
View: Meyer's posting on RISKS


According to Google spokesman Nate Tyler, the company is merely trying to prevent other search crawlers from hurting the performance of its site.

"We have blocked other spiders from fully crawling our site because it would require too many server resources on our end to support that," said Tyler, who noted that Google takes great care when indexing other sites so as not to overload them.

According to a database maintained by The Web Robots Pages, there are 284 robots that actively crawl the Web under names including Googlebot, Inktomi Slurp, and AltaVista Scooter.

Most crawlers honor the Robots Exclusion Protocol, which specifies that robots should check a site for a file named robots.txt for a list of "disallowed" directories and adhere to it when indexing the site.

A comparison today by Newsbytes of robots files at other major Web destinations revealed sharp contrasts in the sites' attitude toward Web crawlers.

Some leading portals, such as those operated by Yahoo, AOL, Microsoft, Amazon and Lycos, have no robots file at all and apparently give search spiders free reign to index all of their pages.

Others, including AltaVista, only disallow bots from crawling in directories that contain program files.

Some big sites, however, attempt to block search crawlers completely. The robots file for the New York Times site, for example, appears to disallow search bots from accessing any of its archived content.

Ebay, which successfully sued a specialized search site that was trawling its online auction listings, uses a prohibitive robots file that begins with a simple comment: "Go away."

Similarly, CNN.com's file shoos away search spiders with a comment that states, "Robots, scram."

According to Danny Sullivan, editor of SearchEngineWatch.com, Google may be attempting to prevent other search sites from "harvesting" its proprietary newsgroup archives and other content.

"Having a robots file doesn't just protect your resources. It can also protect your intellectual property," said Sullivan, who added that a site's failure to restrict access to proprietary content through a robots file could limit its legal position in copyright infringement cases.

Sullivan noted, however, that respecting a site's robots file is entirely voluntary, and that "rogue" search bots are likely to ignore it altogether. For that reason, some sites may eschew robots files and instead block certain Internet protocol addresses that are associated with "impolite" spiders, he said.

In fact, some experts have argued that robots files provide snoops with clear instructions on how to find the most sensitive areas of a Web site.

Bertrand Meyer, a software expert who developed a computer language called Eiffel, observed in a 1998 message to the RISKS mailing list that itemizing disallowed directories in a robots file is akin to telling someone, "Here is what I am not telling you."

"I think there are some basic flaws in this mechanism. If I'm a bad guy, the robots file is the first place I'm going to start looking," Meyer said in an interview Wednesday.

A link to Meyer's posting is tucked into the beginning of the robots file at Sun Microsystem's site. The comments section of the file at Sun.com explains in considerable detail the company's justification for disallowing bots in seven directories.

According to Sun's file, which begins with the words, "A note to those who'd bother to look at this file," the company is not trying to hide proprietary content. Instead, the purpose of its robots file is to prevent users from downloading the pages without first registering with Sun, the file stated.


Post a comment · Send to friend Comments · There are 11 additional comments

Commenting has either been disabled on this article or you are not logged in. Click here to login or register, its free!

Note: Anonymous commenting is disabled in order to keep the quality of responses to a high standard.

Advertisement (Why?)