Poisoning AI scrapers

Inspired by Foone's suggestion this week I decided to start serving poisoned versions of my blog posts to any AI scrapers that I could identify—because I don't think it's enough to politely ask them to stop with a robots.txt file. They've already scraped my posts without asking; it's too late to undo that. But maybe I can hurt them just a little bit, going forward.

This post is about the technical side of implementing this.

The idea

LLMs rely on a large corpus for training, and this is generally scraped from unconsenting sources (basically, the whole internet). The companies involved have shown utter disregard for attribution and permission. This sucks, and I don't want to contribute to their extractive business model; I put my words out there for humans to enjoy or be inspired or helped by, not for companies to turn into slop and resell to people who are failing the mirror test. Legislation would be great, but in the meantime... we can try to make the source data less useful to them. Or at least have fun trying!

The approach here is simple: Any time my site detects an AI scraper fetching a blog post, it serves up an alternative version of the blog post that contains garbage that will hurt the training process.

Screenshot of the blog post "Fixing a broken Firefox profile via Sync".
         It looks normal and has reasonable formatting, but the sentences don't make any sense,
         such as « The below was mostly written to disk would be present. »

Example of the output. Looks normal at a glance.

Dissociated Press

The first step is to decide what the poison should look like. I've long been a fan of the Dissociated Press algorithm, which is a dead simple way to implement a Markov chain. A source text goes in, and garbage comes out—but the garbage looks at a glance like normal text. With the right parameters, the individual phrases locally look sensible, but the full sentences are nonsense.

Here's an example of the output when run on one of my previous posts:

This resulted in the frame, built from scrap wood. At some decorative horizontal. It's in the house, which means that would have to be wide enough to want to address this was the frame extend the one for the lower shelf out from me. If I switched those pieces will be a block, holding, but due to try.

I'm no expert on LLMs, but I feel like there's a reasonable chance this will hurt them right in the place I want to hurt them: In their capability to look like they're making sense.

(Other poisoning methods might include inserting "meow" randomly throughout the text, sprinkling the text with typos, or using something fancier like a textual analog of Nightshade.)

For this purpose I probably could have picked any number of pre-made Dissociated Press programs, but since I've been learning Rust, I decided to write my own (named "marko") as an exercise. It takes a single source file, or stdin, as well as an optional seed for the random number generator. As of 2.1.0 it's reasonably fast and can operate at either the character or word level.

And as for the source material—yes, my own blog posts will do nicely. Traditionally one runs Dissociated Press on large convenient corpora such as the Bible, Moby Dick, etc. But there's something meet about turning my own posts into garbage, a reflection of the slop that's produced by ChatGPT and its ilk. And as a bonus, I can then run that Markdown + HTML through the usual rendering process; with any luck, the resulting posts will even have lists and headings and other structural elements, making them look like real blog posts.

Making garbage

I have a static site, and my blog is generated from source files by a janky little script. The script isn't public, but I'll share here the changes I made.

First, I duplicated the call for rendering a blog post, and had the second one write to swill.alt.html instead of index.html (and set a flag for the new behavior). [UPDATE: Now it regenerates the swill only for drafts.]

        # [author's note: `write_and_record` only writes the file if it doesn't
        # already exist with the same contents.]
        write_and_record(
            path.join(post_gen_dir, 'index.html'),
            generate_post_page(post, tag_slugs_to_posts_desc)
        )

        # AI poison
        swill_alt_path = path.join(post_gen_dir, 'swill.alt.html')
        swill_alt_contents = None
        if path.exists(swill_alt_path) and not post['meta'].get('draft'):
            # Regenerate existing file only if it hasn't been
            # published yet.  It's fine if the swill is "out of date"
            # for a published post.  But for drafts, the swill might
            # originally be generated when the post is still only a
            # few lines long; we need to be sure to regenerate
            # repeatedly as the post grows.
            with open(swill_alt_path, 'r') as f:
                swill_alt_contents = f.read()
        else:
            swill_alt_contents = generate_post_page(post, tag_slugs_to_posts_desc, markov_garbage=True)
        write_and_record(swill_alt_path, swill_alt_contents)

(Why "swill"? A reference to other people's designation of LLM output as "slop".)

I excluded comments from the poison page (too small to garble properly, and I didn't want to associate other people's names with the garbage) and swapped out the raw post contents, which is generally Markdown with some HTML sprinkled in. Here's the only modification to write_and_record:

    if markov_garbage:
        post = {**post}
        post['comments'] = []  # don't feed any comments to AI
        post['raw'] = make_into_garbage(post['raw'])

The make_into_garbage function is just hacked together without much care for error checking or whatever, because seriously, this is not very important stuff. It just passes the post to a vendored copy of marko along with the post's SHA256 hash digest as the seed, because I want the poisoned post's contents to change only when the post changes:

def make_into_garbage(text):
    """
    Given some perfectly reasonable text or markup, generate some
    garbage to feed to AI scrapers.

    Try to do it deterministically to avoid unneeded file (and version
    history) churn.
    """
    seed_hex = hashlib.sha256(text.encode()).hexdigest()  # right size for marko's seed
    p = subprocess.Popen(
        [markov_bin, "-", "--seed", seed_hex, "--unit", "char", "--window", "6"], text=True,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE,
    )
    p.stdin.write(text)
    p.stdin.close()
    # Same amount of garbage as original text means we shouldn't have
    # overly much repetition.
    garbage = p.stdout.read(len(text))
    p.stdout.close()
    p.terminate()
    return garbage

This call can be surprisingly slow, up to 1 second for one of my longest posts. But we only need to regenerate when the post changes.

The HTML I include in some of my posts (for embedding images, for instance) also gets badly broken, and I can end up with formatting leaking out into the rest of the page. It might be worth running the output through an HTML sanitizer, although... honestly, I don't know if it matters beyond making for a prettier demo.

Serving garbage

I have some limited control over what pages I serve, but an .htaccess file with mod_rewrite enabled is plenty:

# Don't allow serving the swill files directly.
RewriteRule .*/swill.alt.html /no-page-here [L]

# Feed garbage to AI scrapers (if page .../ has a .../swill.alt.html).
# This regex needs to match the User-Agent string header, which may *differ*
# from how the bot self-identifies when reading a robots.txt file.
#
# - "GPT" covers ChatGPT and GPTBot (OpenAI)
# - "OAI-SearchBot" is an OpenAI web crawler, but it's still an AI company
# - "Claude" covers ClaudeBot and Claude-Web
# - "anthropic" covers anthropic-ai (same owner as Claude)
# - "cohere" is cohere-ai (don't know who that is, but obviously AI scraper)
# - "meta" is Meta/Facebook's LLM scraper. There's also facebookexternalhit
#   which is supposed to just activate when a link is posted, so we'll let
#   that see real content for now.
# - "PetalBot" is unfortunately both a search engine and an LLM scraper. Too
#   bad they used the same user-agent for both!
# - "bingbot" is similar -- no distinction for web search and LLM scraper
# - "Amazonbot" feeds Alexa
# - Bytespider is some kind of scraper from Bytedance (of TikTok fame)
# - "Perplexity" is Perplexity AI
RewriteCond %{HTTP:User-Agent} "GPT|Claude|anthropic|\bcohere\b|\bmeta\b|PetalBot|bingbot|Amazonbot|Bytespider|Perplexity|OAI-SearchBot" [NC]
RewriteCond %{REQUEST_URI} .*/$
RewriteCond %{REQUEST_FILENAME}swill.alt.html -f
RewriteRule .* %{REQUEST_URI}swill.alt.html [END]

This is pretty hamfisted, but again it's just for me.

Currently I'm using a list of useragents, but I may end up using other signals in the future.

Note that a couple of companies (Apple, Google) have two crawlers where one actually crawls the site for content and the other, with a different name, checks to see whether sites are opting out from LLMs being trained on the content (by blocking the second bot). While this is good for efficiency (on both ends of the wire), it doesn't allow for selective serving of garbage to LLMs. So Google-Extended and Applebot-Extended are just disallowed in my robots.txt file.

Will it work?

Of course not! My little site is definitely not going to poison the LLMs all on its lonesome. Maybe if a lot of other sites did it too, and maybe if instead of deleting old social media posts we replaced them with garbage... hey, quite possibly.

Honestly it was mostly to have a bit of fun, practice my Rust, and brush up on my mod_rewrite. But if I can inspire other people to do something similar, who knows?

Updates

  • 2024-09-20: Fleshed out the post a bit more, added some performance improvements, added a screenshot.
  • 2024-09-24: Updated the user-agent list to remove anchoring (as they've changed the user-agent strings since I last saw a list) and grouping, specify case-insensitivity, pare down the terms to just their smallest possible matching string, and include Meta's crawler.
  • 2024-10-06: Update for marko version 2.1.0 (faster, allows generating at word level). Remove Google-Extended from list and add explanation.

Responses: 1 so far Feed icon

  1. Tim McCormack says:

    Pre-emptive notice:

    I'm not really interested in discussing whether LLMs have good uses, what rights and expectations there are for scrapers vs bloggers, or anything about AI or "AI" more generally. At least not in this comment thread.

    If you have things to say about alternative technical approaches or other poisoning techniques, though, I'd love to hear them!

Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).