Welcome to Admin Junkies, Guest — join our community!

Register or log in to explore all our content and services for free on Admin Junkies.

General Don't want to give free content to the AI bots?

For all the diverse topics that don't quite fit elsewhere.
Joined
Dec 22, 2022
Messages
2,100
Website
astrowhat.com
Credits
3,658
If you use CloudFlare, you can use the below ruleset in your WAF panel and block the major (currently) players.
These AI bot trainers really should allow an opt-in to their services instead of forcing it to be an opt-out. They are hoovering up tons of free data and turning around and making a profit on it for themselves. I really prefer not to support them. They want to use the data created on my site, they can do like one did with Reddit... pay for it.

Code:
(http.user_agent contains "claudebot") or (http.user_agent contains "CCBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "GPTBot") or (http.user_agent contains "Omgili") or (http.user_agent contains "ImagesiftBot ") or (http.user_agent contains "cohere-ai") or (http.user_agent contains "anthropic-ai") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "ByteSpider")

The diffbot one is the default for the package but can be changed by whomever is using it, so checking your logs for bot visits remains a good idea.

If you have shell access on your server and want to check our log out for certain terms quickly, grep is your friend. You do need access to read the logs and if you have that, it is as simple as
cat access.log | grep searchword where access.log is your HTTP server access log name and the searchword is the word you want to search for.

And example using the search word of bot from todays log of my astro site.

Screen Shot 2024-04-18 at 6.04.22 PM.png
 
Last edited:
Advertisement Placeholder
People helping AI are like those people trained to help the replacement, or is it? Are they being trained to help the destructor of humanity (Terminator Movies)?
That aspect I really don't care about.
What I do care about is them taking content created on my site and gaining financial benefit from it without even asking or offering compensation, especially considering that I pay for the pipeline they are stealing their data down from my site.
I have a feeling that soon there will be laws placed on the book dealing with that type of action exactly.
 
That aspect I really don't care about.
What I do care about is them taking content created on my site and gaining financial benefit from it without even asking or offering compensation, especially considering that I pay for the pipeline they are stealing their data down from my site.
I have a feeling that soon there will be laws placed on the book dealing with that type of action exactly.
Yeah, definitely it's a thing where people are exploited cause they can't read the fine print, like with the .com thing we were discussing.
 
Just to give you a hint of how heavy some of these hit your site. This sudden rise was right after I added ClaudeBot into the block list.
By far the majority of those blocks (over about a 7 hour timespan) were from that one AI bot.
Screen Shot 2024-04-19 at 1.37.21 AM.png
 
Is there any downside to giving them access though?
 
Is there any downside to giving them access though?
yeah, they usually use your content to create their response with no attribution to your site or way for a user to find your site for more than what they may have used it for.
Why would I want to give them information that is "hard earned" on many sites now so that they can benefit from it and the site it came from doesn't get squat?

Depending on how a couple of cases go I would not be surprised to see more and more sites being paywalled off, even if the paywall is limiting the full reading to signed in members, with guests only getting a small "taste".
Will there be people that run sites that don't care? Very likely. But there are others that feel that the AI bot scraping is infringing on their hard work.
 
Last edited:
yeah, they usually use your content to create their response with no attribution to your site or way for a user to find your site for more than what they may have used it for.
Why would I want to give them information that is "hard earned" on many sites now so that they can benefit from it and the site it came from doesn't get squat?

Depending on how a couple of cases go I would not be surprised to see more and more sites being paywalled off, even if the paywall is limiting the full reading to signed in members, with guests only getting a small "taste".
Will there be people that run sites that don't care? Very likely. But there are others that feel that the AI bot scraping is infringing on their hard work.
Everything is people trying to get your info which is something in the same boat. For instance, app deals giving you $1 burgers, even in this inflated economy, are designed to grab your info, which is incredibly valuable.
 
Just to give you a hint of how heavy some of these hit your site. This sudden rise was right after I added ClaudeBot into the block list.
By far the majority of those blocks (over about a 7 hour timespan) were from that one AI bot.
View attachment 3767
I’m at the point where I’ma block the claudebot as well. This is how high my bandwidth usage has been all day. It’s never been this high. There’s been at least 20-30 claudebot’s on my forum since 2pm.
 

Attachments

  • IMG_9536.png
    IMG_9536.png
    277.6 KB · Views: 45
I’m at the point where I’ma block the claudebot as well. This is how high my bandwidth usage has been all day. It’s never been this high. There’s been at least 20-30 claudebot’s on my forum since 2pm.
I don't know if claudebot honors it or not (but from reading it appears it does), but apparently they do try to read robots.txt... but they aren't even allowed to get to that point now. And there have been several of these attempts back to back also.

Screen Shot 2024-04-27 at 11.20.04 AM.png


An interesting read for those that were inquiring on why to block stuff like this.
The reddit one has some drift, but there are still some nuggets of knowledge in there.
 
Last edited:
I wonder why Google never used it against ChatGPT. That would be a hell of a copyright arguement.
The internet and society revolves around ChatGPT nowadays if you didn't know.
 
If user A posts a message on your forum that explains how to take a picture of such a galaxy, does this content belong to you or does it still belong to user A? Real question.

If the content does not belong to you even though you are the one hosting it, do you have the right to restrict its distribution via AI? You are not doing it for search engines because it is beneficial to you, your forum gains popularity and activity if its content is well referenced. But for AI you do not benefit from it so this leads you to block this type of diffusion.

I don't blame you for anything, I'm just putting my foot down... it's ultimately a philosophical question.
 
If user A posts a message on your forum that explains how to take a picture of such a galaxy, does this content belong to you or does it still belong to user A? Real question.

If the content does not belong to you even though you are the one hosting it, do you have the right to restrict its distribution via AI? You are not doing it for search engines because it is beneficial to you, your forum gains popularity and activity if its content is well referenced. But for AI you do not benefit from it so this leads you to block this type of diffusion.

I don't blame you for anything, I'm just putting my foot down... it's ultimately a philosophical question.
The content of the post the user has copyright to. The actual post of the data belongs to the website per the TOS that they agreed to when they signed up for membership.
Just because they didn't take time to read it does not negate the fact that they have acknowledged it and agreed to abide by it by completing their signup routine.
So yes, I have the right to "restrict" it.... I can decide to paywall the entire site off. I can decide to limit what a guest sees to 3-10 lines of text. I can decide to only show every third, fifth or fiftieth word. I can decide to delete it. I can decide to move it to another area. I even have the right to edit it if it is in violation of the rules & regulations for the site.

As a guest, which a bot comports to, they have no RIGHT to exploit MY bandwidth (at the rate some of them doo) that I pay for for their financial gain. That is why several of them are entering into contracts with social sites to vacuum up the data on them, as they know if they don't, those sites will shut them off at the doorway. Hell, at one point before I blocked it, one of those bots had 240 connections as "guests" to my site in a few minute span, and that count was climbing (and that's when I started working on blocking the AI bots).

And you actually prove the point I was making. You see no issue with others "stealing" the data from elsewhere because it "helps" you. But that very help robs the site of what they need to stay relevant.... traffic.
Why would someone drill down deep enough in the AI response to see where it originated at and then go to that site? When those sites quit getting visits and participation, the data that the AI bots hoover up to give to you on a platter ceases to exist.
 
I view AI capturing all of our content as a necessary evil, just as Google uses all of our content to build its knowledge graph.

My personal take is that, even if I gateway my forums, these AI bots will find other sites to crawl and digest content. I'd rather my content be part of the mix, and that means I have a shot of at least being a footnote
 
I view AI capturing all of our content as a necessary evil, just as Google uses all of our content to build its knowledge graph.

My personal take is that, even if I gateway my forums, these AI bots will find other sites to crawl and digest content. I'd rather my content be part of the mix, and that means I have a shot of at least being a footnote
Yeah, some don't mind.
I just have issues with big companies trying to get something for nothing, and go out of my way to make it harder on them.
And as for a "footnote". A lot of these AI bots don't even give that, making it hard to find (via them) where the data they relay came from.
At least Bing is doing it right, and for AI engines like this I don't mind. It's the ones that grant no attribution. that I have issues with.

Screen Shot 2024-05-30 at 1.19.37 AM.png


Screen Shot 2024-05-30 at 1.22.29 AM.png
Screen Shot 2024-05-30 at 1.27.31 AM.png


As you can see... a different AI service (Perplexity) and it hit my site... because they do this they got through.
It's the rogue ones that go bat-poop crazy on hitting your site that I have the biggest issues with.
And I only check them once... if they don't, they get blocked, if they do, they get to visit.
Everyone has to make their own decision.... but you aren't going to acknowledge the site you got it from, I'm gonna make it harder for you to get.
 
As a guest, which a bot comports to, they have no RIGHT to exploit MY bandwidth
It's a certain problem indeed.

Regarding a fair return of things for the posting of content and used by AI, there is inevitably a regulation to be created, enforced. Artificial intelligence being something new in our daily lives it will take a little time but the EU, in any case, is doing it right now.

The actual post of the data belongs to the website per the TOS that they agreed to when they signed up for membership.
Thinking about it, is this appropriation honest?

Personally, I have never read the rules of a forum before signing up for it as I never read the huge text blocks that must be accepted when I update my iPhone for example...

So on a subject as important as intellectual property a clear and available note that informs the future member should be displayed, not a paragraph lost in 100,000 characters...

But would it be engaging for a visitor to see a clear notice indicating that from the moment he clicks on the "POST" button, what he has just published no longer belongs to him?

Answer: NO

This is why this information is diluted in the regulation and no one dares to be so honest as to clearly notify its future members.
 
Thinking about it, is this appropriation honest?

Personally, I have never read the rules of a forum before signing up for it as I never read the huge text blocks that must be accepted when I update my iPhone for example...

This is why this information is diluted in the regulation and no one dares to be so honest as to clearly notify its future members.
"honest" or not... you want to ride the ride, you agree to the rules. That pretty much applies across life. 🎢
And whether you read the rules or not. You accept them by your continuing to either register, download the item, etc. It is not the responsibility of the site owner, the program owner or vendor to make sure YOU do due diligence. That is the responsibility we all have as adults. And then if the failure to read the rules bites one in the posterior, that person only has themself to blame.

As they old saying goes... why beat a dead horse? Again, it's not my (or anyone else) responsibility to force others to do their due diligence. They clearly acknowledge that they accept the terms and privacy policy.

Screen Shot 2024-05-30 at 3.09.11 AM.png


And to comply with GDPR, they are forced to acknowledge by manually checking that box that they agree to the terms and the privacy policy if they want to participate.

If they are to lazy/busy/self absorbed to read them then I'm not going to be their nanny and sit there and make sure that they do. We each as an adult are responsible for our own actions.

If they don't want to agree to it, then they are welcome to close the registration browser window and continue participating as a guest. Nothing on my site is closed to access by guests. The only thing a guest cannot do is participate actively in the site.
 
You reason like a big company while you manage an astrophotography community...
Your members are customers who just had to read the contract. Too bad for them if they clicked without reading it... legally you are covered, humanly it's small.
 

Log in or register to unlock full forum benefits!

Log in or register to unlock full forum benefits!

Register

Register on Admin Junkies completely free.

Register now
Log in

If you have an account, please log in

Log in

Would You Rather #9

  • Start a forum in a popular but highly competitive niche

    Votes: 9 27.3%
  • Initiate a forum within a limited-known niche with zero competition

    Votes: 24 72.7%
Win this space by entering the Website of The Month Contest

Theme editor

Theme customizations

Graphic Backgrounds

Granite Backgrounds