Prevent unwanted training, parsing or usage by AI / LLM systems

Is there a package to poison your pdf document?
(after some searching, I guess not, but would people be interested? - I’m not sure if I’m up yet to undertake such an adventure, but I would like to gather like-minded people around this :slight_smile:)

So, first of all, the obvious: if you’re pouring a lot of time in writing a nice document (with Typst, of course), you might not like it that all the AI companies are hoovering up you’re work to create unimaginative slop, while in the meantime DDoSing the internet, creating an ecological nightmare, trampling copyright, …
(placeholder - there should probably be a lot of references here, but if you didn’t live under rock the last few years, you might know what I’m talking about)
There are some tools for images (nightshade, …), there are of course prompt injections since there is no boundary between data & instructions, …

I think it would be good to have a Typst package that can automate the same things, but for static pdfs. It would be a best effort to protect your work and make it generally less attractive for scraping and parsing.

I guess some of the requirements should be:

  • Embed some user-hidden text, only visible for the scrapers/parsers
  • Choose from different protection mechanisms (different goals)
    • Poison machine learning with garbage
    • Prevent parsing by LLM tools in e.g. CVs or try prompt injection
  • Work with accessibility tools: make sure e.g. screen readers don’t need to read this or can skip over the poisoned content.
  • Setting to control how much overhead poison you want. More (hidden) text means also larger documents.
  • Check the efficacy with some real-world tests

The only previous work I encountered is brilliant-cv – Typst Universe, which has

(NEW) 4. AI Prompt and Keywords Injection
Fight against the abuse of ATS system or GenAI screening by injecting invisible AI prompt or keyword list automatically.

Btw, I have no prior experience with AI tools, since you might have figured out I dislike them :wink: , but experience will probably be needed to make this effective.

1 Like

I’m glad you list this as a requirement such a tool would need to fulfill. Unfortunately I’m fairly certain that this is incompatible with poisoning. If a screen reader can read a document, so can any other program. If you prevent programs from scanning your document, you’ll hit screen readers too.

The only difference is, AI scrapers will have more resources behind them than legitimate screen readers, so you’ll end up with a document that is not accessible, and yet the AI companies can still scrape it.

For anyone interested, the relevant code is here: injection.typ. I’m certain a screen reader would read this text. It’s a prompt injection, and would probably not meaningfully hurt the training of AI, even though it can potentially exploit a deployed LLM.