• Am I the only one concerned about AI bots scraping our content to train themselves? It feels like the problem is already out of control.

    Our community is built on authentic, real-life experiences shared by real people. This content is unique and valuable, and I’m uncomfortable with the idea of random chatbots using it without consent to improve their algorithms.

    What’s worse, it’s nearly impossible to prove when a chatbot has used our content. They don’t copy it word for word and rarely cite our website directly. A vague reference to our site doesn’t feel like fair use.

    How can we protect our content from being scraped and exploited by AI systems? Has anyone successfully implemented effective solutions?

    4
    Replies
Howdy guest!
Dear guest, you must be logged-in to participate on Jatra Community. We would love to have you as a member of our community. Consider creating an account or login.
Replies
  • Kaustubh Katdare

    Community Administrator1w

    There are two ways you can inform the AI bots to not use content on your community website.

    1. Opt-out via robots.txt:

    Add the following code to your robots.txt file. The robots.txt file should be placed in the root folder of your website, so that it's accessible on yourdomain.com/robots.txt

    User-agent: OpenAI
    Disallow: /
    
    1. Add meta-tag:

    The noai meta-tags are being used and respected by the AI chatbots as a strong indication that the website doesn't intend to share its public data for LLM training.

    Add the following meta tag before the </head> tag in your HTML:

    <meta name="robots" content="noai">
    

    There are other techniques too; but they are not as effective as these two.

    Are you sure? This action cannot be undone.
    Cancel
  • Paula Derrenger

    Member1w

    I have added those tags. I however strongly think that AI chatbots, especially OpenAI and other similar bots do not respect these directives. Are there other directives I can add to block all leading chatbots?

    Are you sure? This action cannot be undone.
    Cancel
  • Kaustubh Katdare

    Community Administrator1w

    Paula, I totally understand your concern. Keep in mind that chatbots don't have to rely on direct scraping of your website or community content to get the data. They can get the data from large data-scrapers.

    I have compiled a list of the data scrapers and popular chatbots. You may update your robots.txt file as follows.

    Caution : Disallow: / indicates that the bot will not be allowed to crawl any pages on the website. Please make yourself familiar before you make the change. If you have questions, as below. Also remove the bots that you want to allow.

    robots.txt for blocking AI bots

    # Robots.txt file to block specified bots
    
    User-agent: ChatGPT-User
    Disallow: /
    
    User-agent: Meta-ExternalFetcher
    Disallow: /
    
    User-agent: Amazonbot
    Disallow: /
    
    User-agent: Applebot
    Disallow: /
    
    User-agent: OAI-SearchBot
    Disallow: /
    
    User-agent: PerplexityBot
    Disallow: /
    
    User-agent: YouBot
    Disallow: /
    
    User-agent: Applebot-Extended
    Disallow: /
    
    User-agent: Bytespider
    Disallow: /
    
    User-agent: CCBot
    Disallow: /
    
    User-agent: ClaudeBot
    Disallow: /
    
    User-agent: Diffbot
    Disallow: /
    
    User-agent: FacebookBot
    Disallow: /
    
    User-agent: Google-Extended
    Disallow: /
    
    User-agent: GPTBot
    Disallow: /
    
    User-agent: Meta-ExternalAgent
    Disallow: /
    
    User-agent: omgili
    Disallow: /
    
    User-agent: anthropic-ai
    Disallow: /
    
    User-agent: Claude-Web
    Disallow: /
    
    User-agent: cohere-ai
    Disallow: /
    
    User-agent: Ai2Bot
    Disallow: /
    
    User-agent: Ai2Bot-Dolma
    Disallow: /
    
    User-agent: FriendlyCrawler
    Disallow: /
    
    User-agent: GoogleOther
    Disallow: /
    
    User-agent: GoogleOther-Image
    Disallow: /
    
    User-agent: GoogleOther-Video
    Disallow: /
    
    User-agent: ICC-Crawler
    Disallow: /
    
    User-agent: ImagesiftBot
    Disallow: /
    
    User-agent: PetalBot
    Disallow: /
    
    User-agent: Scrapy
    Disallow: /
    
    User-agent: Timpibot
    Disallow: /
    
    User-agent: VelenPublicWebCrawler
    Disallow: /
    
    User-agent: Webzio-Extended
    Disallow: /
    
    User-agent: facebookexternalhit
    Disallow: /
    
    User-agent: img2dataset
    Disallow: /
    
    Are you sure? This action cannot be undone.
    Cancel
  • Angela Williams

    Member4d

    This information is really helpful. Thank you!

    Our technical teams have put our entire website behind Cloudflare. Cloudflare has one-switch protection against the AI, called "Firewall for AI". I think it's very effective. It's been just about a month since we enabled it. It would take about 4-5 months before we can see whether it's effective or not.

    Are you sure? This action cannot be undone.
    Cancel
  • Rahul Roy

    Member2d

    This is excellent advice. Thank you for sharing - Kaustubh.

    @Angela - there is no perfect way to block the AI bots. You can try adding the mentioned tags; but there is no guarantee that these instructions will be followed by the AI Bots. As Kaustubh mentioned, the chatbot companies can use multiple sources to scrape data from websites and public communities.

    Are you sure? This action cannot be undone.
    Cancel
  • Abheejit K

    Member3h

    I'm assisting a startup that's exactly working on this. Would be keen to chat and understand your requirement better.

    Are you sure? This action cannot be undone.
    Cancel
Home Channels Search Login Register