Perplexity's stealth crawlers exposed in new Cloudflare report

Perplexity is allegedly using stealth and undeclared crawlers to bypass no-crawl directives, according to a Cloudflare investigation launched after its customers complained Perplexity was still accessing their content despite being blocked. Cloudflare says that Perplexity modifies its user agent and changes source autonomous source networks (ASNs) to hide their activity. Cloudflare says Perplexity’s crawlers are failing to fetch robots.txt files so don’t respect the rules in those files.

Due to these bad practices, Cloudflare has decided to de-list Perplexity as a verified bot which will affect its interactions with websites, especially those using Cloudflare services for protection.

By default, Perplexity behaves itself by using its declared PerplexityBot user agent, however, whenever a website blocks it, it switches to a generic browser agent (Chrome/124.0.0.0 Safari/537.36). The stealth crawler uses multiple IPs not listed in Perplexity’s official range and rotates through different ASNs. This behaviour was not isolated either, it’s a pattern of behavior of Perplexity as Cloudflare observed it happening across tens of thousands of domains, involving millions of requests daily.

Companies like OpenAI, which also crawl the web, clearly outline their crawlers and respect robots.txt directives and network blocks. Cloudflare tested ChatGPT’s crawlers and found that it stopped crawling when a disallow directive was present or when a black page was presented.

To address the issue, Cloudflare has added heuristics to its managed rules to block the stealth crawling. Cloudflare customers that have bot management or challenge rules set up are already protected with these heuristic measures, it is available for all customers, even those using Cloudflare’s services for free.

With heuristic blocking, Cloudflare doesn’t hardcode specific crawlers to block, instead it looks for certain behaviors and blocks crawlers that it suspects to be violating them. As Perplexity’s tactics adjust, these heuristic blockers should be able to continue fighting this behavior.

Cloudflare also said that it’s actively working with technical and policy experts around the world, like the IETF efforts to standardize extensions to robots.txt. This will help to establish measurable principles that well-meaning bot operators should abide by.

Report a problem with article
Next Article

Apple's new MDM change is a quiet win for Microsoft Intune

Previous Article

Nintendo raises prices of the Switch and accessories by up to $50