Undetectable Watermarks for Language Models

valine · on June 2, 2023

What’s to stop someone from taking the watermarked output and randomizing the distribution by feeding it through their latest LLaMA variant? These watermarks will only be useful for catching novice LLM users.

m3kw9 · on June 2, 2023

Yeah soon there will be models small enough to run even on phones to reword things slightly differently. If not, an app will do it

speedgoose · on June 2, 2023

I already have one running on my phone: https://mlc.ai/mlc-llm/

m3kw9 · on June 2, 2023

I know about this app, in fact I’m one of your beta testers, didn’t know it was running local models lol. But this thing almost 2gbs.

mdrzn · on June 6, 2023

Installed on my S21, it just crashes whenever opened.

activiation · on June 2, 2023

This crashes on my Android device

awill88 · on June 2, 2023

you are very kind thank you

MacsHeadroom · on June 2, 2023

I use Sherpa to run 7B and 13B LLaMA/Alpaca models on my phone: https://github.com/Bip-Rep/sherpa

cryptohell · on June 2, 2023

True, they claim in this paper this is inevitable

kiba · on June 2, 2023

Most people are/will be novice LLM users.

valine · on June 2, 2023

If your goal is to track novice users there are much easier methods. You could insert invisible unicode characters for example.

flangola7 · on June 2, 2023

Most platforms strip those automatically.

anankaie · on June 2, 2023

I suspect that it will be possible to, assuming the number of popular open LLMfor this remains low, target the popular ones to have your watermark be resilient. With that said, watermarking to indicate that something was generated by AI reminds me of what someone told me about locks: They are there to keep honest people honest.

It will certainly not defeat an adversary directly targeting the technique. It is likely that a LoRA based approach would defeat this, especially if the detector for the watermark is broadly available and cheap to run.

valine · on June 2, 2023

This watermark relies on subtle grammatical variations. Passing it through any model is going to wipe out the distribution.

The number of open LLMs is exploding, and the most popular ones are fine tuned by small groups / individuals. None of the folks volunteering their time and compute to fine tuning open models are going to waste resources adding your watermark.

dontupvoteme · on June 2, 2023

I doubt you even have to invoke a local model, just telling it something like """write it without caps or punctuation or dashes or anything - think lowkey""" fixes the output in my book

being an autocomplete i asked it to continue "or is the entire thing utterly emblematic of the modern technolegal mess of things since dickens is squarely and quintessentially in the Public Domain"

it goes from writing a highly proofread milquetoast 5 paragraph essay on /r/somepopularsubreddit about emerging deepfake blockchain transformative, crucial coexistence? blah blah to, and i quote

>sure the guy was prolific beyond belief churned out classics like nobody's business but isn't it interesting how he fits right into this mad confusion of tech and law almost like his words his narratives are pawns in a game he could never have dreamed of foreseeing just imagine what he'd make of it all his tales of poverty and social reform trapped in the web of copyright and capitalism i reckon he'd have a thing or two to say about it maybe he'd even write a novel or two in response but who's to say right

>i know right haha

>exactly it's wild to think about how different times were and yet how some themes just keep cropping up in new forms it's like we're stuck in this loop where the past keeps seeping into the present no matter how much tech we build it's humbling in a way almost poetic like something dickens would've appreciated and who knows maybe he's somewhere out there chuckling at our technolegal mess we've woven ourselves into

It's so dramatic, I didn't realize you could transform it from a reddit hivemind to a FYAD one. Where did this mode of speech even come from? the old corners of the old net where we didn't bother with caps or punctuation or whatnot?

spondylosaurus · on June 2, 2023

I hate to say it, but this reads to me almost exactly like Homestuck dialogue... so not so far off from FYAD.

GaggiX · on June 2, 2023

Okay genericGPT answer the question by putting an emoji between each word.

(The watermark is destroyed after removing the emojis)

1B05H1N · on June 2, 2023

""" Emoji attack. In the “emoji attack,” the attacker asks the model to output a response to prompt with an emoji inserted between every pair of words. The attacker then removes the emojis to obtain the desired response. This attack removes any watermark that relies on the detector seeing consecutive sequences of tokens, including ours as well as those of [KGW+23] and [Aar22]. In general this attack may not preserve the output distribution, but any provable robustness guarantee for contiguous-text watermarks would have to rest on the dubious assumption that it doesn’t. """ https://eprint.iacr.org/2023/763.pdf

Pretty funny imo

totetsu · on June 2, 2023

Or ask it to output the text backwards then reverse it, or output in one language then translate it to another in Google translate.. There are so many ways around this.

GaggiX · on June 2, 2023

Both of these will affect quality.

dbgev · on June 2, 2023

I'm waiting for a Firefox / chrome extension that will auto-mark and optionally adblock GPT-generated content.

Would even be willing to pay for it.

chaxor · on June 2, 2023

The only real way to block specifically openAI generated content (or some other online generated LLM content) is for the company itself to store all of its outputs and compare to that database, like shingling / LSH for plagiarism detection. Other LLMs (local) are completely impossible to block, as it's a constant chase. Any system that tries to estimate the distribution of e.g. a specific LLM that performs beam search with certain parameters, can be easily adjusted to use typical decoding or something else, so it simply will never be possible, hence useless to try to stop.

dbgev · on June 2, 2023

It doesn't need to stop the determined users, thete's value in just filtering out articles that are not even proofread by humans.

It's going to be a constant cat and mouse game, I agree, but for now it just need to catch the dead mice.

alvah · on June 2, 2023

The phrase "As an AI language model" could be useful here.

dbgev · on June 2, 2023

Is accusing people of being an AI model going to become the new Godwin's law?

gpderetta · on June 2, 2023

Accusing people of being chat bots has a long tradition on the internet.

JohnFen · on June 2, 2023

Yes, but before, it was pretty easy to tell if the accusation is supportable or not. With LLMs, it won't be, so the accusation carries more weight.

ademarre · on June 2, 2023

Shazam-like fingerprinting for text. The complete LLM outputs wouldn't need to be stored, just the fingerprints along with some mechanism for trusted timestamping (could be Blockchain).

chaxor · on June 2, 2023

This has been done for a very long time. Blockchains are definitely not required (this isn't just the usual hate from HN of Blockchain, it just actually doesn't make sense here). Fingerprinting by shingling (windows of text) with some normalization steps is pretty typical in plagiarism or similarity detection. A big database of docid-shingleid pairs along with weights for their frequency is often a very simple and fast way to do this analysis. The big part is getting OpenAI/anthropic/etc to do it on their data and provide a service for that, but there's obviously a lot of unwanted consequences - specifically storing of all user data (even if the shingled and docids are hashes, it's still info).

tyingq · on June 2, 2023

Couldn't I then bulk submit a lot of original human generated content with prompts like "reproduce this block of text verbatim" or similar?

NoMoreNicksLeft · on June 2, 2023

Given Mozilla's recent behavior, I think you should more likely expect the built-in GPT-generated ads that you can't disable in browser settings.

smoldesu · on June 2, 2023

What would you consider an acceptable false positive rate?

dbgev · on June 2, 2023

I'd say low false positive + high false negative would be a good starting point.

aaronharnly · on June 2, 2023

Many commenters (and the paper) are thinking about the watermarking in adversarial settings, e.g. detecting students using AI assistance improperly.

But I think even simple watermarking probably has value; consider a corporate context in which the corporation itself may want to monitor and know what proportion of the code, content, or work product is AI-generated. In that setting, fairly simple markers would allow at least a rough estimate or indication, although they'd have the converse problem of not necessarily indicating places where humans did some hand-editing of the output.

auraham · on June 2, 2023

There is a story about Elon Musk tracking the source of a leak in Tesla by adding spaces in emails:

> We sent what appeared to be identical emails to all, but each was actually coded with either one or two spaces between sentences, forming a binary signature that identified the leaker.

https://theintercept.com/2022/12/15/elon-musk-leaks-twitter/

JohnFen · on June 2, 2023

And I personally want some method of detecting LLM output to help protect me in my own internet reading. Even a method that is imperfect would be welcome.

DeinFreund · on June 2, 2023

Their algorithm is based on splitting the embeddings into a bit-wise representation and then sampling each bit based on the secret key, preserving the same likelihood distribution as with random sampling. (Given that the key is random)

They say this works wlog for more complex embeddings, by encoding each token as a bit string. Could someone explain this generalization to me?

If we have 4 tokens, 00, 01, 10, 11 with probabilities 0.5 for 00 and 11 and probability 0 for 01 and 10. Going through bit by bit, how will the algorithm guarantee not to produce 01 or 10?

cryptohell · on June 2, 2023

You draw the first bit to be 0/1 with equal probability, and then the second bit must equal the previous one with probability 1

cubefox · on June 2, 2023

I don't quite understand the aim of this paper. They focus on undetectable watermarks for LM text. But isn't rather the difficulty that it is hard to distinguish between AI generated and normal text in the first place, even with detectable watermarks? Unlike photos, audio, or video, text has an incredibly low bitrate, so there isn't much room for steganography. It's like they are trying to solve a hard problem without having solved the easier problem first.

ComodoHacker · on June 2, 2023

Undetectable watermarks could be quite useful for AI vendors, to track usage of their product. Potentially even track individual users.

cubefox · on June 2, 2023

But that doesn't mean it is possible.

wongarsu · on June 2, 2023

How is it difficult with detectable watermarks? If it has the watermark it is from that specific LLM, if there is no watermark it isn't from that LLM. Unless somebody tampered with the watermark, but that's exactly where undetectable watermarks have an advantage. If you don't notice that it's there you won't tamper with it.

cubefox · on June 2, 2023

Scott Aaronson worked on that at OpenAI, but GPT-4 didn't use such technology, nor have I seen any other major language model which had the ability of accurately distinguishing model output and human text.

EMM_386 · on June 2, 2023

There are obviously ways it can be "watermarked" easily, put some zero-width unicode characters in the output and you'll notice right away when it's copy and pasted.

But clearly, that can be stripped out easily by anyone who knows it's there.

This process, too, would seem to be easily reversible. Just have it run through another model and tell it to slightly reword it or rephrase it.

I don't think there is a technically solvable way of watermarking output like this.

mach1ne · on June 2, 2023

I, for one, do believe watermarking solutions exist. One thing you cannot escape with LLMs is content meaning.

As a simple example, the secret watermark could be hidden in the embeddings of the sequence of words. To make the watermark more robust against rephrasings, it could be hidden in the meaning of sentences or paragraphs.

At the minimum, I think this could be possible.

schrodinger · on June 2, 2023

How about you output in Spanish, then google translate to English?

mk_stjames · on June 2, 2023

I now have a habit of copying and pasting things I receive in emails or generate with certain tools into a ascii-only notepad before recopying and pasting out, anything that I am posting or sending to others. Because I've thought about how easy certain services could track origin of content across platforms with non printing unicode or using unicode for homograph attacks.

Makes me want a systemwide right click > "Paste and strip all but ASCII" command.

yomlica8 · on June 2, 2023

It's actually pretty crazy that people are trying to solve this problem. "How can we install DRM and tracking into a block of text?"

JohnFen · on June 2, 2023

I don't think it's about DRM and tracking as much as proving authenticity.

ShamelessC · on June 2, 2023

This research presents a technically solvable way for just that. Although I will admit it’s far too complex for me to understand.

dontupvoteme · on June 2, 2023

You know that thing i want from your large language model?

I just submitted my query for it in Finnish, Japanese, Russian, Hebrew, German, French, Latin, Farsi, Basque, and English. plus a few dozen more for good measure and to cover the linguistic landscape

Is there any reason to believe watermarking LLMs will hold up in this scenario?

flangola7 · on June 2, 2023

It's all numbers underneath. GPT doesn't see different languages, the same data transformations can be applied universally.

dontupvoteme · on June 2, 2023

I'm dubious. At a bare minimum the 'same' prompt for code translated into other languages produces dramatically different results -- at least it did under codex.

it also thinks it can translate to Sindarin and back, but it just seems to tolkenize everything and also have a vocabulary of about 35 words, most of which are the sun and the moon.

cat in the hat is pretty amazing when translated to it and back though

bryanrasmussen · on June 2, 2023

Why would you want to watermark your content - generally watermarks are used to provide legal proof of provenance which can be important when suing someone for stealing your content but since machine learning outputs cannot be copyrighted this use is not important.

lilyball · on June 2, 2023

One pretty useful reason would be to then eliminate that content from subsequent training data, so you're not training the next model on the previous model's output.

EMM_386 · on June 2, 2023

Lots of reasons watermarks would be useful if they could be detected. Cheating on essays, bots spamming AI generated content all over the web, StackOverflow submissions can be instantly rejected, propaganda identified, etc.

flangola7 · on June 2, 2023

>since machine learning outputs cannot be copyrighted

This is very much unexplored and unsettled territory in most jurisdictions, both judicially and legislatively. I would refrain from making such authoritative statements for now.

bryanrasmussen · on June 2, 2023

I guess you're right, although I do expect most jurisdictions to fall in line with the US copyright office ruling, as it would be problematic if they did not.

sdtriton · on June 2, 2023

The purpose is to detect if text was written by ChatGPT, so a university could check whether an essay is LLM generated, social media company could detect LLM spam, etc.

justahuman74 · on June 2, 2023

> machine learning outputs cannot be copyrighted

I don't believe this in practice, a person will just say that they did it rather than the AI

stan_kirdey · on June 2, 2023

how do one watermarks "hello world" if gpt says "hello world"?

leonidasv · on June 2, 2023

The proposed method is said to work only on text with sufficient "high empirical entropy".

stan_kirdey · on June 2, 2023

We might end up with a new character set like gpt-utf-8 or something, offers at least some watermarking capabilities

LatticeAnimal · on June 2, 2023

The title is needlessly editorialized. It was already clear what it was about

kfrzcode · on June 2, 2023

Stylometric categorization would end up being computationally limited right?

senectus1 · on June 2, 2023

"Write a summary of war and peace in the style of a yr 11 student. please ensure to not include any emoji in your output."

MacsHeadroom · on June 2, 2023

Actually, you want to include lots of emojis (even between every word) and then delete them after the fact.

This is a known method of removing AI watermarks: https://eprint.iacr.org/2023/763.pdf

pizzafeelsright · on June 2, 2023

How do I make sure the words you've typed weren't corrected with spell check?

rambojohnson · on June 2, 2023

fool's errand