If you use CloudFlare, you can use the below ruleset in your WAF panel and block the major (currently) players.
These AI bot trainers really should allow an opt-in to their services instead of forcing it to be an opt-out. They are hoovering up tons of free data and turning around and making a profit on it for themselves. I really prefer not to support them. They want to use the data created on my site, they can do like one did with Reddit... pay for it.
The
If you have shell access on your server and want to check our log out for certain terms quickly, grep is your friend. You do need access to read the logs and if you have that, it is as simple as
And example using the search word of
These AI bot trainers really should allow an opt-in to their services instead of forcing it to be an opt-out. They are hoovering up tons of free data and turning around and making a profit on it for themselves. I really prefer not to support them. They want to use the data created on my site, they can do like one did with Reddit... pay for it.
Code:
(http.user_agent contains "claudebot") or (http.user_agent contains "CCBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "GPTBot") or (http.user_agent contains "Omgili") or (http.user_agent contains "ImagesiftBot ") or (http.user_agent contains "cohere-ai") or (http.user_agent contains "anthropic-ai") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "ByteSpider")
The
diffbot
one is the default for the package but can be changed by whomever is using it, so checking your logs for bot visits remains a good idea.If you have shell access on your server and want to check our log out for certain terms quickly, grep is your friend. You do need access to read the logs and if you have that, it is as simple as
cat access.log | grep searchword
where access.log
is your HTTP server access log name and the searchword
is the word you want to search for.And example using the search word of
bot
from todays log of my astro site.
Last edited: