Recommended Posts

So I've hosted my own personal/family server for years now with services like Jellyfin, Nextcloud, etc.  In light of the recent retroactive editing of information I thought it might be a good idea to set up a Kiwix mirror as a "just in case" for certain resources that might come under attack in the current political climate.  It has done fine and is running without issue ( https://kiwix.marcusadams.me if you're curious ), but lately I have noticed a humongous influx of AI scraping bots trying to index the contents of that mirror.  On one hand, I don't mind search engines indexing it and directing people to it if the original source of that material is somehow unavailable to someone.  I can see in my log file that in all of yesterday "Googlebot" made 188 requests to it.  However, in that same period of time, "Claudebot" made over 800,000 requests.  I've been seeing these requests in the logs for a couple of weeks now, but started to get a little curious because it's been going 24 hours a day for at least a week or two now.  Upon investigating a little further I noticed that several AI scrapers from Claude, Meta and even Bytedance have been hammering my poor little home server non-stop for a while now.  It would be one thing if they just got a directory listing and then used that to inform a search engine, but no they're trying to fully index the entire contents of the site, which comprises nearly a terabyte of data, at a rate of about 100 kB/s; I'm guessing to try to stay under the radar with regards to rate limiting and such.  The problem is that this is a personal server and the RAID array sits on spinning rust hard drives, so even if it's only at 100 kB/s, the end result is that they're keeping my read heads moving around constantly to scrape data to train their AI with.

So I ultimately decided that I need to just block abusive AI scrapers; both because they steal data en masse to train an AI so you never have to visit an original source, but also because I don't think any one of those billion dollar companies are gonna send me a dollar to help pay for my electricity, internet or replacement drives when they fail, despite them benefiting from all those things.

image.png.f1427c53745d9a4bedb13de9e0457c1a.png

First I tried creating a custom filter for Fail2Ban.  That works, but my inbox has exploded.  Initially I had the limit set to 5 connections within a 10 minute window before triggering a ban.  As of writing I have right at 1,900 emails in my system inbox; all notifications from Fail2Ban of unique IP addresses that have been banned.  It seems like as soon as one IP gets banned they just move to another one and keep going.

image.thumb.png.1cc2a02dcae128d6bbfb957512ed851f.png

It's been almost 24 hours and Fail2Ban has slowed them down considerably.  Claude dropped from over 800,000 requests in yesterday's log to just over 75,000 today.  That's probably mostly up to the time it takes them to realize they've been blocked and switch to a new IP address.  You would think they'd take the blatant blocking as a sign that their behavior was not welcome.  Nope.  They're all still going.  So this evening I've made two more changes.

First I reduced the amount of requests they have to issue to trigger a ban from 5 to 1.  I figured allowing 5 requests within 10 minutes would be plenty on the off chance an image or something in a search result was pulled from something I'm hosting.

Second, I've modified the Apache site config for the archive to give a "403 Forbidden" response to AI bots matching the same user agents that I also have blocked in Fail2Ban.

It's been about an hour and the requests still haven't slowed down, even though now not only are they having to switch IPs more often to get around firewall bans after every single request, but they aren't even getting the one file they requested, they're just getting a 403 error.

image.png.c02fdc872cfca8c286a0352efe3e94d8.png

It seems to me like the recent release of Deepseek has instigated some kind of AI arms race where any data that is publicly accessible is free game for training your models.  It doesn't matter if it's straight from reputable news sites, or some random dude's home NAS running on an old PC tower in the back woods of Kentucky.

image.png.9682875e98396010cce89a193cca8263.png

I just wanted to share this little anecdote about what's going on right now; and maybe give a heads up to those of you who host similar services, especially if you have bandwidth caps or anything on your personal stuff or if you're hosting it on a VPS that may charge you based on bandwidth.

Link to comment
https://www.neowin.net/forum/topic/1452911-ai-scraping-is-getting-out-of-hand/
Share on other sites

Put your site behind Cloudflare, its free and they have a new Anti AI bot feature to send bots into a AI hell spiral of ###### data so they stop indexing your site(s)

  On 24/03/2025 at 04:43, binaryzero said:

Restrict your firewall to only allow connections from a known list of IPs (i.e. the people using the media server); welcome to having your infrastructure open to the world... 

The dreaded 'Any' rule strikes again.

Expand  

I would except I do occasionally use my Nextcloud instance to share files with friends and family members.  My wife and I don't have Facebook so whenever there's a birthday party or something I'll make an album on Nextcloud and then share a link to it with grandparents and other interested parties.  I do have a few things tightened down that way; such as SSH access not being forwarded and only accepting connections from the LAN/VPN IP ranges, but Apache is one that needs to remain open.

  • Like 2
Posted (edited)

I just saw an article the other day where iFixit said their website was hit over a million times in a 24 hour period by the ClaudeBot.  Mine racked up over 800,000.  It's getting wild; they're just hoovering up anything they can find to try to stay competitive with Chinese companies.

  • Like 1

The suggestion of putting your public mirrors behind Cloudflare by Mud W1ggle is a good shot. The data will be cached, so less load on your server. You get the protection of Cloudflare's web application firewall, plus they are pretty good at blocking bots / unwanted traffic too.

Something you could do is keep any publicly accessible data like the Wikipedia mirror on an NVMe drive, then only keep your media library and family photos on your array. That should mean the array would be spun down most of the time unless family are accessing this media.

I have various Docker containers running from an Nvme drive, including a game server. Yet my array is idle most of the time unless someone starts playing something via Plex or accesses family photos via an SMB share.

This results in very low idle power usage, despite having quite a few self hosted services running:

image.png.5cbb07aeaafebfaf3064fc00d30de439.png

  • Like 3
  On 25/03/2025 at 10:32, hornett said:

I've got the same issue, if you get a chance, would you mind sharing your fail2ban filter definition & the jail config? 

Thanks

Expand  

Contents of /etc/fail2ban/filter.d/aibots.conf (Rename it whatever you want):

#Fail2Ban filter for misbehaving AI scrapers and bots
#that don't respect robots.txt
#Marcus Dean Adams

[Definition]
failregex = ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*ClaudeBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*meta-externalagent.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*meta-externalfetcher.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*Bytespider.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*GPTBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*anthropic-ai.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*FacebookBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*Diffbot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*PerplexityBot.*$

Contents of /etc/fail2ban/jail.d/aibots.local:

[aibots]
enabled = true
port = 80,443
filter = aibots
maxretry = 1
bantime = 168h
findtime = 10m
logpath = /var/log/apache2/access.log

There are other bots out there with different user agent strings you may want to add to your filter, but Google and the few others I've seen haven't been spamming the living daylights out of me so I've left them alone.

I also made a change to the apache site configuration file and added this so that anything with one of the specified user agents gets a 403 - Forbidden error instead of actually getting the file they requested.

                #Block AI Bots
                RewriteEngine on

                RewriteCond %{HTTP_USER_AGENT}  ^.*Bytespider.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*ClaudeBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*meta-externalagent.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*meta-externalfetcher.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*GPTBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*anthropic-ai.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*FacebookBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*Diffbot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*PerplexityBot.*$
                RewriteRule . - [R=403,L]

 

Posted (edited)
  On 25/03/2025 at 10:32, hornett said:

I've got the same issue, if you get a chance, would you mind sharing your fail2ban filter definition & the jail config? 

Thanks

Expand  

I got tired of being blown up with the emails from this jail (4,500+ unique IP addresses banned since turning this jail on a couple days ago), partly because of the notifications, partly because it was drowning out legitimate emails from the server, so I slightly modified the jail file to specify an action that doesn't include sending the email.  I also bumped up the ban time to 4 weeks.

New contents of /etc/fail2ban/jail.d/aibots.local

[aibots]
enabled = true
port = 80,443
filter = aibots
maxretry = 1
bantime = 672h
findtime = 10m
logpath = /var/log/apache2/access.log
action = %(action_)s

 

Posted (edited)

The spam has slowed down considerably.  I still get a couple new banned IPs every hour, but after I made this initial post where I thought things were slowing down they picked right back up because Bytedance seemed to be picking up the slack after Claude started slowing down; just hammering me non-stop.  Before instituting the block I was getting over a million automated bot queries a day (Predominantly Claude at first) and since implementing the block it's slowed them down considerably due to having to switch addresses constantly, but I've still racked up 4.5k unique IP addresses on the block list since Sunday.  I've bumped the ban time from 1 to 4 weeks and where I was getting 1 or more banned IPs every minute; this morning that slowed down to one every 3 or 4 minutes and it's now down to one IP every 6-10 minutes, so they're either turning their attention away from me or just straight up running out of IP addresses to swap to.

As long as other folks like Google keep their traffic reasonable I hopefully won't have to add anybody else to the list.

image.thumb.png.ed24be0376312226cd86bcf1e62ddef5.png

Edited by Gerowen

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

  On 27/03/2025 at 07:04, SuperKid said:

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

Expand  

I'll definitely check it out; you're the 2nd person who has mentioned Cloudflare.  I've just been busy with other stuff and haven't taken the time to sit down and take a look at it.

  • 2 months later...
  On 27/03/2025 at 07:04, SuperKid said:

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

Expand  

Is this like tailscale?

Minor update.  All is going well for the most part.  Picked up some new information today though from my server logs.

Just discovered a new #ByteDance scraper's useragent; imageSpider.  More specifically:

"Mozilla/5.0 (compatible; imageSpider; scrapedia-receive@bytedance.com)"

Added to my list.

I'm also adding #Alibaba IP ranges to a "drop" rule on my firewall because they've aggressively scraping me (hundreds of thousands of requests per day), but they're not using accurate user agents.  They're ignoring robots.txt (big surprise) and pretending to be everything from Chrome 114 to Internet Explorer 6 and not self identifying as a bot.

I've put it behind a login for the time being.  I had something like 600,000 requests from just from Alibaba IP addresses that didn't clarify they were bots or scrapers, and so not easy to block using user agent filtering.  I didn't have any issues with bandwidth or accessibility, but that's 600,000 requests just from one cloud provider made to my spinning rust hard drives, that I have to personally pay for when they die, by bots being ran by corrupt mega corporations ignoring my polite requests that they not scrape me and that the information only be accessed by real humans.

If any of y'all here were actually using my Kiwix mirror, I have no issue whatsoever creating a username and password for you, just hit me up using one of the methods listed on my personal site and I'll make one for you.

https://marcusadams.me

Posted (edited)

Added an extra filter to Fail2Ban.  I thought about just adding this to my existing aibots filter, but for the time being I'm keeping it separate because it's "possible" real humans may trigger this one so as long as it doesn't start filling my inbox I'd like to get notified about these so I can adjust it as necessary in the future.

I'm still holding close to 10k unique IP addresses at any given time that have been banned via the "aibots" filter that looks for certain user agent strings of known AI scrapers.  However, I've been getting an increasing amount of traffic trying to scrape the site with sanitized user agent strings that just look like normal web browsers, however...

Because I enabled authentication I can now see that they're racking up lots of 401 (unauthorized) responses in the Apache "access.log" file, but they're not triggering anything in the Apache "error.log" file, which is where failed attempts to log in would appear.  Basically, if an actual human tried to log in with an invalid username and password they don't immediately go into "access.log" as a 401, they go into "error.log" with a status message such as "user FOO not found".  The only way to trigger a 401 simply by visiting the site, as far as I'm aware, is to hit "Cancel" on the login prompt, or otherwise try to access files directly without properly authenticating.

So, given the fact I'm getting a few thousand 401 errors a day from sanitized user agent strings that don't show up in "error.log", which means no attempt at logging in properly, I added another jail/filter set to Fail2Ban to immediately ban anybody who triggers a 401.  This feels a bit nuclear so I may need to adjust it in the future, but as far as I'm aware so far no real humans are being inconvenienced so all I'm doing is wasting the time of some AI scraper bots.

Example log entry

61.170.149.70 - - [25/Jun/2025:20:01:04 -0400] "GET /content/mdwiki_en_all_maxi_2024-06/A/Neuroregeneration HTTP/1.1" 401 3287 "https://kiwix.marcusadams.me/content/mdwiki_en_all_maxi_2024-06/A/Neuroregeneration" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43"
	

Contents of /etc/fail2ban/filter.d/apache-401repeat.conf

#Fail2Ban filter for bots and scrapers that try to access
#files directly without entering credentials for apache2-auth
#and therefore trigger lots of 401 errors without triggering
#the apache-auth jail.
#
#Marcus Dean Adams
	[Definition]
failregex = ^<HOST> .+\" 401 \d+ .*$
	

Contents of /etc/fail2ban/jail.d/apache-401repeat.local

[apache-401repeat]
enabled = true
ignoreip = 10.1.1.1
port = 80,443
filter = apache-401repeat
maxretry = 1
bantime = 672h
findtime = 10m
logpath = /var/log/apache2/access.log
	

Oh, and all this traffic is AFTER I explicitly banned Alibaba's IP ranges that were absolutely blowing me up day and night.

image.png.8a28a63f7b764548bd90107ca4d39629.png

Observation; two of the IP addresses that have triggered this jail in the 30 or so minutes since I turned it on were owned by Microsoft.  Wonder if they're doing their own AI scraping/probing, or if that's just an Azure VM owned by somebody else.

image.thumb.png.000a6cda66d132a059c2d2956818c731.png

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Posts

    • WUT?! Tell us you've never built desktop software without telling us you've never built desktop software... The chromium rendering engine is the content-rendering engine for "chromium-based browsers," but that does NOT mean there's a full-on UI underneath that is somehow bloating these products. The bloat is from the additional UI components that the browser vendor (Vivaldi) is adding atop the base package. Most chromium browsers customize the default skin/theme of the overall package so there's absolutely zero added overhead; it's really just a different CSS-based theme pack. Vivaldi, however, adds more than just a different skin; they add built-in extensions (that are managed by other built-in extensions), add other customization modules, and all kinds of other bloat -- and these add-ons are EXACTLY where the resource-hogging stems from. The mere fact it's a chromium-based browser has no impact on the matter. Lastly, power users literally DO notice resource-intensive applications -- they'll even be familiar with tools and widgets that expose those measurements the way only a power-user would! General consumers, however, would simply remark that their rig is sluggish and probably outdated and blindly upgrade to whatever the salespeople are hawking at the local computer store. General consumers wouldn't even upgrade their existing computer cuz they wouldn't know how to!
    • Nope, they just removed the game from sales.
    • Welcome to Neowin! Please enjoy your stay!
    • WTF is this title? It reads like I'm having a stroke.
  • Recent Achievements

    • First Post
      Celilo earned a badge
      First Post
    • One Year In
      K.I.S.S. earned a badge
      One Year In
    • Week One Done
      solidox earned a badge
      Week One Done
    • Dedicated
      solidox earned a badge
      Dedicated
    • Week One Done
      Devesh Beri earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      439
    2. 2
      ATLien_0
      165
    3. 3
      +FloatingFatMan
      149
    4. 4
      Nick H.
      66
    5. 5
      +thexfile
      62
  • Tell a friend

    Love Neowin? Tell a friend!