Content scraping to train AI models has become pervasive as LLM chatbots continue to grow more popular. One issue with it is that companies are hoovering up data without permission or compensation, putting a strain on the people who host the content.
This affects both big players like Reddit, which has sued Anthropic for training Claude on its posts, and small players like Neowin reader Gerowen, who"s had to fend off lots of requests from AI bots hammering his personal Kiwix mirror running on a home server in the backwoods of Kentucky.
Last month, Cloudflare announced a feature that would let website owners charge AI agents for crawling a website. The feature was in private beta and used the HTTP response code 402, which means "Payment Required".
Now the company is back, rebranding its AI Audit tool, which provided insights on AI bot traffic, as AI Crawl Control and making it generally available to all paying customers. The new tool expands from monitoring to giving website owners direct control over how AI systems access their content.
Here"s how the feature works: As long as you"re a paying Cloudflare customer, you can access the AI Crawl Control dashboard. From there, you can block individual bots from a list of known AI crawlers.
When you block a bot, you can now choose to respond with a custom message using the 402 Payment Required status code. So the bot will see a message like "To license this content for AI training, please email our partnerships team at partnerships@yourdomain.com."
Cloudflare says it is releasing this feature because customers have different needs when it comes to AI. The company claims some want to license their content for revenue but find that crawlers do not know how to reach them, while others just want to block the bots entirely. For those who want to negotiate, simply blocking traffic closes the door on potential deals.
In the future, the company plans to add more parameters to the 402 response. These parameters could allow crawlers to see structured data about a content"s value, how often it is updated, and specific licensing terms directly in the response.