Recommended Posts

So I've hosted my own personal/family server for years now with services like Jellyfin, Nextcloud, etc.  In light of the recent retroactive editing of information I thought it might be a good idea to set up a Kiwix mirror as a "just in case" for certain resources that might come under attack in the current political climate.  It has done fine and is running without issue ( https://kiwix.marcusadams.me if you're curious ), but lately I have noticed a humongous influx of AI scraping bots trying to index the contents of that mirror.  On one hand, I don't mind search engines indexing it and directing people to it if the original source of that material is somehow unavailable to someone.  I can see in my log file that in all of yesterday "Googlebot" made 188 requests to it.  However, in that same period of time, "Claudebot" made over 800,000 requests.  I've been seeing these requests in the logs for a couple of weeks now, but started to get a little curious because it's been going 24 hours a day for at least a week or two now.  Upon investigating a little further I noticed that several AI scrapers from Claude, Meta and even Bytedance have been hammering my poor little home server non-stop for a while now.  It would be one thing if they just got a directory listing and then used that to inform a search engine, but no they're trying to fully index the entire contents of the site, which comprises nearly a terabyte of data, at a rate of about 100 kB/s; I'm guessing to try to stay under the radar with regards to rate limiting and such.  The problem is that this is a personal server and the RAID array sits on spinning rust hard drives, so even if it's only at 100 kB/s, the end result is that they're keeping my read heads moving around constantly to scrape data to train their AI with.

So I ultimately decided that I need to just block abusive AI scrapers; both because they steal data en masse to train an AI so you never have to visit an original source, but also because I don't think any one of those billion dollar companies are gonna send me a dollar to help pay for my electricity, internet or replacement drives when they fail, despite them benefiting from all those things.

image.png.f1427c53745d9a4bedb13de9e0457c1a.png

First I tried creating a custom filter for Fail2Ban.  That works, but my inbox has exploded.  Initially I had the limit set to 5 connections within a 10 minute window before triggering a ban.  As of writing I have right at 1,900 emails in my system inbox; all notifications from Fail2Ban of unique IP addresses that have been banned.  It seems like as soon as one IP gets banned they just move to another one and keep going.

image.thumb.png.1cc2a02dcae128d6bbfb957512ed851f.png

It's been almost 24 hours and Fail2Ban has slowed them down considerably.  Claude dropped from over 800,000 requests in yesterday's log to just over 75,000 today.  That's probably mostly up to the time it takes them to realize they've been blocked and switch to a new IP address.  You would think they'd take the blatant blocking as a sign that their behavior was not welcome.  Nope.  They're all still going.  So this evening I've made two more changes.

First I reduced the amount of requests they have to issue to trigger a ban from 5 to 1.  I figured allowing 5 requests within 10 minutes would be plenty on the off chance an image or something in a search result was pulled from something I'm hosting.

Second, I've modified the Apache site config for the archive to give a "403 Forbidden" response to AI bots matching the same user agents that I also have blocked in Fail2Ban.

It's been about an hour and the requests still haven't slowed down, even though now not only are they having to switch IPs more often to get around firewall bans after every single request, but they aren't even getting the one file they requested, they're just getting a 403 error.

image.png.c02fdc872cfca8c286a0352efe3e94d8.png

It seems to me like the recent release of Deepseek has instigated some kind of AI arms race where any data that is publicly accessible is free game for training your models.  It doesn't matter if it's straight from reputable news sites, or some random dude's home NAS running on an old PC tower in the back woods of Kentucky.

image.png.9682875e98396010cce89a193cca8263.png

I just wanted to share this little anecdote about what's going on right now; and maybe give a heads up to those of you who host similar services, especially if you have bandwidth caps or anything on your personal stuff or if you're hosting it on a VPS that may charge you based on bandwidth.

Link to comment
https://www.neowin.net/forum/topic/1452911-ai-scraping-is-getting-out-of-hand/
Share on other sites

Put your site behind Cloudflare, its free and they have a new Anti AI bot feature to send bots into a AI hell spiral of ###### data so they stop indexing your site(s)

On 24/03/2025 at 00:43, binaryzero said:

Restrict your firewall to only allow connections from a known list of IPs (i.e. the people using the media server); welcome to having your infrastructure open to the world... 

The dreaded 'Any' rule strikes again.

I would except I do occasionally use my Nextcloud instance to share files with friends and family members.  My wife and I don't have Facebook so whenever there's a birthday party or something I'll make an album on Nextcloud and then share a link to it with grandparents and other interested parties.  I do have a few things tightened down that way; such as SSH access not being forwarded and only accepting connections from the LAN/VPN IP ranges, but Apache is one that needs to remain open.

  • Like 2

I just saw an article the other day where iFixit said their website was hit over a million times in a 24 hour period by the ClaudeBot.  Mine racked up over 800,000.  It's getting wild; they're just hoovering up anything they can find to try to stay competitive with Chinese companies.

  • Like 1

The suggestion of putting your public mirrors behind Cloudflare by Mud W1ggle is a good shot. The data will be cached, so less load on your server. You get the protection of Cloudflare's web application firewall, plus they are pretty good at blocking bots / unwanted traffic too.

Something you could do is keep any publicly accessible data like the Wikipedia mirror on an NVMe drive, then only keep your media library and family photos on your array. That should mean the array would be spun down most of the time unless family are accessing this media.

I have various Docker containers running from an Nvme drive, including a game server. Yet my array is idle most of the time unless someone starts playing something via Plex or accesses family photos via an SMB share.

This results in very low idle power usage, despite having quite a few self hosted services running:

image.png.5cbb07aeaafebfaf3064fc00d30de439.png

  • Like 3
On 25/03/2025 at 06:32, hornett said:

I've got the same issue, if you get a chance, would you mind sharing your fail2ban filter definition & the jail config? 

Thanks

Contents of /etc/fail2ban/filter.d/aibots.conf (Rename it whatever you want):

#Fail2Ban filter for misbehaving AI scrapers and bots
#that don't respect robots.txt
#Marcus Dean Adams

[Definition]
failregex = ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*ClaudeBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*meta-externalagent.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*meta-externalfetcher.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*Bytespider.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*GPTBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*anthropic-ai.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*FacebookBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*Diffbot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*PerplexityBot.*$

Contents of /etc/fail2ban/jail.d/aibots.local:

[aibots]
enabled = true
port = 80,443
filter = aibots
maxretry = 1
bantime = 168h
findtime = 10m
logpath = /var/log/apache2/access.log

There are other bots out there with different user agent strings you may want to add to your filter, but Google and the few others I've seen haven't been spamming the living daylights out of me so I've left them alone.

I also made a change to the apache site configuration file and added this so that anything with one of the specified user agents gets a 403 - Forbidden error instead of actually getting the file they requested.

                #Block AI Bots
                RewriteEngine on

                RewriteCond %{HTTP_USER_AGENT}  ^.*Bytespider.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*ClaudeBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*meta-externalagent.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*meta-externalfetcher.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*GPTBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*anthropic-ai.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*FacebookBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*Diffbot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*PerplexityBot.*$
                RewriteRule . - [R=403,L]

 

On 25/03/2025 at 06:32, hornett said:

I've got the same issue, if you get a chance, would you mind sharing your fail2ban filter definition & the jail config? 

Thanks

I got tired of being blown up with the emails from this jail (4,500+ unique IP addresses banned since turning this jail on a couple days ago), partly because of the notifications, partly because it was drowning out legitimate emails from the server, so I slightly modified the jail file to specify an action that doesn't include sending the email.  I also bumped up the ban time to 4 weeks.

New contents of /etc/fail2ban/jail.d/aibots.local

[aibots]
enabled = true
port = 80,443
filter = aibots
maxretry = 1
bantime = 672h
findtime = 10m
logpath = /var/log/apache2/access.log
action = %(action_)s

 

The spam has slowed down considerably.  I still get a couple new banned IPs every hour, but after I made this initial post where I thought things were slowing down they picked right back up because Bytedance seemed to be picking up the slack after Claude started slowing down; just hammering me non-stop.  Before instituting the block I was getting over a million automated bot queries a day (Predominantly Claude at first) and since implementing the block it's slowed them down considerably due to having to switch addresses constantly, but I've still racked up 4.5k unique IP addresses on the block list since Sunday.  I've bumped the ban time from 1 to 4 weeks and where I was getting 1 or more banned IPs every minute; this morning that slowed down to one every 3 or 4 minutes and it's now down to one IP every 6-10 minutes, so they're either turning their attention away from me or just straight up running out of IP addresses to swap to.

As long as other folks like Google keep their traffic reasonable I hopefully won't have to add anybody else to the list.

image.thumb.png.ed24be0376312226cd86bcf1e62ddef5.png

Edited by Gerowen

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

On 27/03/2025 at 03:04, SuperKid said:

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

I'll definitely check it out; you're the 2nd person who has mentioned Cloudflare.  I've just been busy with other stuff and haven't taken the time to sit down and take a look at it.

  • 2 months later...
On 27/03/2025 at 08:04, SuperKid said:

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

Is this like tailscale?

Minor update.  All is going well for the most part.  Picked up some new information today though from my server logs.

Just discovered a new #ByteDance scraper's useragent; imageSpider.  More specifically:

"Mozilla/5.0 (compatible; imageSpider; [email protected])"

Added to my list.

I'm also adding #Alibaba IP ranges to a "drop" rule on my firewall because they've aggressively scraping me (hundreds of thousands of requests per day), but they're not using accurate user agents.  They're ignoring robots.txt (big surprise) and pretending to be everything from Chrome 114 to Internet Explorer 6 and not self identifying as a bot.

I've put it behind a login for the time being.  I had something like 600,000 requests from just from Alibaba IP addresses that didn't clarify they were bots or scrapers, and so not easy to block using user agent filtering.  I didn't have any issues with bandwidth or accessibility, but that's 600,000 requests just from one cloud provider made to my spinning rust hard drives, that I have to personally pay for when they die, by bots being ran by corrupt mega corporations ignoring my polite requests that they not scrape me and that the information only be accessed by real humans.

If any of y'all here were actually using my Kiwix mirror, I have no issue whatsoever creating a username and password for you, just hit me up using one of the methods listed on my personal site and I'll make one for you.

https://marcusadams.me

Added an extra filter to Fail2Ban.  I thought about just adding this to my existing aibots filter, but for the time being I'm keeping it separate because it's "possible" real humans may trigger this one so as long as it doesn't start filling my inbox I'd like to get notified about these so I can adjust it as necessary in the future.

I'm still holding close to 10k unique IP addresses at any given time that have been banned via the "aibots" filter that looks for certain user agent strings of known AI scrapers.  However, I've been getting an increasing amount of traffic trying to scrape the site with sanitized user agent strings that just look like normal web browsers, however...

Because I enabled authentication I can now see that they're racking up lots of 401 (unauthorized) responses in the Apache "access.log" file, but they're not triggering anything in the Apache "error.log" file, which is where failed attempts to log in would appear.  Basically, if an actual human tried to log in with an invalid username and password they don't immediately go into "access.log" as a 401, they go into "error.log" with a status message such as "user FOO not found".  The only way to trigger a 401 simply by visiting the site, as far as I'm aware, is to hit "Cancel" on the login prompt, or otherwise try to access files directly without properly authenticating.

So, given the fact I'm getting a few thousand 401 errors a day from sanitized user agent strings that don't show up in "error.log", which means no attempt at logging in properly, I added another jail/filter set to Fail2Ban to immediately ban anybody who triggers a 401.  This feels a bit nuclear so I may need to adjust it in the future, but as far as I'm aware so far no real humans are being inconvenienced so all I'm doing is wasting the time of some AI scraper bots.

Example log entry

61.170.149.70 - - [25/Jun/2025:20:01:04 -0400] "GET /content/mdwiki_en_all_maxi_2024-06/A/Neuroregeneration HTTP/1.1" 401 3287 "https://kiwix.marcusadams.me/content/mdwiki_en_all_maxi_2024-06/A/Neuroregeneration" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43"
	

Contents of /etc/fail2ban/filter.d/apache-401repeat.conf

#Fail2Ban filter for bots and scrapers that try to access
#files directly without entering credentials for apache2-auth
#and therefore trigger lots of 401 errors without triggering
#the apache-auth jail.
#
#Marcus Dean Adams
	[Definition]
failregex = ^<HOST> .+\" 401 \d+ .*$
	

Contents of /etc/fail2ban/jail.d/apache-401repeat.local

[apache-401repeat]
enabled = true
ignoreip = 10.1.1.1
port = 80,443
filter = apache-401repeat
maxretry = 1
bantime = 672h
findtime = 10m
logpath = /var/log/apache2/access.log
	

Oh, and all this traffic is AFTER I explicitly banned Alibaba's IP ranges that were absolutely blowing me up day and night.

image.png.8a28a63f7b764548bd90107ca4d39629.png

Observation; two of the IP addresses that have triggered this jail in the 30 or so minutes since I turned it on were owned by Microsoft.  Wonder if they're doing their own AI scraping/probing, or if that's just an Azure VM owned by somebody else.

image.thumb.png.000a6cda66d132a059c2d2956818c731.png

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Posts

    • Which finger's fingernail are we talking about? I can see how not having this info can lead to massive differences in interpretation.
    • This Chinese company is reportedly developing a feature Apple and Samsung can only dream of by Hamid Ganji While companies like Apple and Samsung have been relatively conservative with their devices’ battery capacities in recent years, Chinese manufacturers have taken the competition to the next level by introducing significantly larger batteries. However, the latest report from China suggests that a local company may already be developing a smartphone with a whopping 14,000mAh battery. Chinese leaker Digital Chat Station claimed on Weibo that a smartphone maker is developing a device with a 14,000mAh battery. If true, it would be the largest battery ever used in a smartphone and could, in theory, provide up to a week of battery life on a single charge. The leaker did not reveal the name of the company behind the device, but there are some clues. This week, HONOR unveiled the X80 Pro Max in China with an 11,000mAh battery and 90W wired charging support. The company also launched the Honor Win in January, which packs a 10,000mAh battery. HONOR, a former subsidiary of Huawei, has a proven track record of developing smartphones with unusually large batteries. However, other Chinese brands, including Xiaomi, have also launched devices such as the Xiaomi 17 Pro Max with 7,500mAh batteries. Though Chinese users on Weibo also believe the company behind the new battery is HONOR. Interestingly, Digital Chat Station said the device with the 14,000mAh battery weighs around 220 grams, making it lighter than the Apple iPhone 17 Pro Max (233 grams) and slightly heavier than the Samsung Galaxy S26 Ultra (214 grams). The iPhone 17 Pro Max currently packs a 5,088mAh battery in eSIM-only versions, while the Galaxy S26 Ultra features a 5,000mAh battery. Neither device is expected to see a dramatic increase in battery capacity in its next-generation successor. So when it comes to battery comparison, Chinese brands are unbeaten. HONOR smartphones are currently available in the EU, but the Chinese brand has no official presence in the United States due to restrictions imposed by the U.S. government.
    • Qualcomm takes on NVIDIA with new Dragonfly CPU and AI chips by Pradeep Viswanathan Microsoft, Google, Amazon, AMD, Meta, Apple, OpenAI, and several others have been developing their own chips for AI infrastructure. However, NVIDIA still remains the dominant player in the market. Today, Qualcomm announced a major expansion of its data center infrastructure portfolio to better compete with NVIDIA. The new lineup includes the Qualcomm Dragonfly C1000 CPU, Qualcomm High Bandwidth Compute technology, the Dragonfly AI300 inference accelerator, new connectivity products, and custom silicon solutions. Qualcomm claims that this new lineup improves performance per watt, token throughput, and total cost of ownership for AI data centers. The Dragonfly C1000 is a new data center CPU built with Qualcomm’s custom Oryon cores. This chip will feature more than 250 cores, frequencies above 5GHz, and a chiplet-based design. Qualcomm claims that this new C1000 can deliver more than 2x better performance per watt compared to existing server CPU offerings based on specifications. The Dragonfly C1000 will support PCIe Gen 7 with more than 2TB/s of connectivity, along with CXL, advanced RAS features, and both air and liquid cooling. Qualcomm expects the Dragonfly C1000 to be commercially available in 2028. Additionally, Qualcomm and Meta announced a multi-year, multi-generation agreement under which Qualcomm will supply Dragonfly C1000 data center CPUs for Meta’s next-generation server fleet. Qualcomm also announced High Bandwidth Compute, a new near-memory computing architecture designed to address AI’s memory bandwidth bottleneck. HBC Gen 1 will debut with the Dragonfly AI250, which is expected to sample in mid-2027. The AI250 will deliver 133TB/s per card, an 18x increase in effective memory bandwidth compared to the AI200 with LPDDR5X. The new Dragonfly AI300 with HBC Gen 2 is a rack-level AI inference platform from Qualcomm. Qualcomm claims that the AI300 can deliver 4x to 8x better performance per watt compared to existing GPU-based architectures based on memory bandwidth per watt per card. The Dragonfly AI300 is expected to be available in 2028.
  • Recent Achievements

    • Week One Done
      Meta Plast earned a badge
      Week One Done
    • First Post
      kinowa earned a badge
      First Post
    • Rookie
      krychek57 went up a rank
      Rookie
    • Grand Master
      Jaybonaut went up a rank
      Grand Master
    • One Year In
      Philsl earned a badge
      One Year In
  • Popular Contributors

    1. 1
      +primortal
      461
    2. 2
      +Edouard
      171
    3. 3
      PsYcHoKiLLa
      136
    4. 4
      Michael Scrip
      78
    5. 5
      Xenon
      77
  • Tell a friend

    Love Neowin? Tell a friend!