How can I block AI bots from scraping our content?

Paula Derrenger · 2024-11-26T11:24:21+00:00

Am I the only one concerned about AI bots scraping our content to train themselves? It feels like the problem is already out of control.Our community is built on authentic, real-life experiences shared by real people. This content is unique and valuable, and I’m uncomfortable with the idea of random chatbots using it without consent to improve their algorithms.What’s worse, it’s nearly impossible to prove when a chatbot has used our content. They don’t copy it word for word and rarely cite our website directly. A vague reference to our site doesn’t feel like fair use.How can we protect our content from being scraped and exploited by AI systems? Has anyone successfully implemented effective solutions?

How can I block AI bots from scraping our content?

Paula Derrenger
@paulad

Updated: Dec 4, 2024

Views: 993

Am I the only one concerned about AI bots scraping our content to train themselves? It feels like the problem is already out of control.
Our community is built on authentic, real-life experiences shared by real people. This content is unique and valuable, and I’m uncomfortable with the idea of random chatbots using it without consent to improve their algorithms.
What’s worse, it’s nearly impossible to prove when a chatbot has used our content. They don’t copy it word for word and rarely cite our website directly. A vague reference to our site doesn’t feel like fair use.
How can we protect our content from being scraped and exploited by AI systems? Has anyone successfully implemented effective solutions?

4

Replies

Howdy guest!

Dear guest, you must be logged-in to participate on Jatra Community. We would love to have you as a member of our community. Consider creating an account or login.

Replies

Kaustubh Katdare

@kaustubh • 8mos

There are two ways you can inform the AI bots to not use content on your community website.

Opt-out via robots.txt:

Add the following code to your robots.txt file. The robots.txt file should be placed in the root folder of your website, so that it's accessible on yourdomain.com/robots.txt

User-agent: OpenAI
Disallow: /

Add meta-tag:

The noai meta-tags are being used and respected by the AI chatbots as a strong indication that the website doesn't intend to share its public data for LLM training.

Add the following meta tag before the </head> tag in your HTML:

<meta name="robots" content="noai">

There are other techniques too; but they are not as effective as these two.

Paula Derrenger

@paulad • 8mos

I have added those tags. I however strongly think that AI chatbots, especially OpenAI and other similar bots do not respect these directives. Are there other directives I can add to block all leading chatbots?

Kaustubh Katdare

@kaustubh • 8mos

Paula, I totally understand your concern. Keep in mind that chatbots don't have to rely on direct scraping of your website or community content to get the data. They can get the data from large data-scrapers.

I have compiled a list of the data scrapers and popular chatbots. You may update your robots.txt file as follows.

Caution : Disallow: / indicates that the bot will not be allowed to crawl any pages on the website. Please make yourself familiar before you make the change. If you have questions, as below. Also remove the bots that you want to allow.

robots.txt for blocking AI bots

# Robots.txt file to block specified bots

User-agent: ChatGPT-User
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: omgili
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Ai2Bot
Disallow: /

User-agent: Ai2Bot-Dolma
Disallow: /

User-agent: FriendlyCrawler
Disallow: /

User-agent: GoogleOther
Disallow: /

User-agent: GoogleOther-Image
Disallow: /

User-agent: GoogleOther-Video
Disallow: /

User-agent: ICC-Crawler
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: Scrapy
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: VelenPublicWebCrawler
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: facebookexternalhit
Disallow: /

User-agent: img2dataset
Disallow: /

Angela Williams

@MmUqtq7 • 8mos

This information is really helpful. Thank you!

Our technical teams have put our entire website behind Cloudflare. Cloudflare has one-switch protection against the AI, called "Firewall for AI". I think it's very effective. It's been just about a month since we enabled it. It would take about 4-5 months before we can see whether it's effective or not.

Rahul Roy

@RjDoavL • 8mos

This is excellent advice. Thank you for sharing - Kaustubh.

@Angela - there is no perfect way to block the AI bots. You can try adding the mentioned tags; but there is no guarantee that these instructions will be followed by the AI Bots. As Kaustubh mentioned, the chatbot companies can use multiple sources to scrape data from websites and public communities.

Abheejit K

@bnDTesp • 8mos

I'm assisting a startup that's exactly working on this. Would be keen to chat and understand your requirement better.