What’s to stop someone from taking the watermarked output and randomizing the distribution by feeding it through their latest LLaMA variant? These watermarks will only be useful for catching novice LLM users.
I suspect that it will be possible to, assuming the number of popular open LLMfor this remains low, target the popular ones to have your watermark be resilient. With that said, watermarking to indicate that something was generated by AI reminds me of what someone told me about locks: They are there to keep honest people honest.
It will certainly not defeat an adversary directly targeting the technique. It is likely that a LoRA based approach would defeat this, especially if the detector for the watermark is broadly available and cheap to run.
This watermark relies on subtle grammatical variations. Passing it through any model is going to wipe out the distribution.
The number of open LLMs is exploding, and the most popular ones are fine tuned by small groups / individuals. None of the folks volunteering their time and compute to fine tuning open models are going to waste resources adding your watermark.
I doubt you even have to invoke a local model, just telling it something like """write it without caps or punctuation or dashes or anything - think lowkey""" fixes the output in my book
being an autocomplete i asked it to continue "or is the entire thing utterly emblematic of the modern technolegal mess of things since dickens is squarely and quintessentially in the Public Domain"
it goes from writing a highly proofread milquetoast 5 paragraph essay on /r/somepopularsubreddit about emerging deepfake blockchain transformative, crucial coexistence? blah blah to, and i quote
>sure the guy was prolific beyond belief churned out classics like nobody's business but isn't it interesting how he fits right into this mad confusion of tech and law almost like his words his narratives are pawns in a game he could never have dreamed of foreseeing just imagine what he'd make of it all his tales of poverty and social reform trapped in the web of copyright and capitalism i reckon he'd have a thing or two to say about it maybe he'd even write a novel or two in response but who's to say right
>i know right haha
>exactly it's wild to think about how different times were and yet how some themes just keep cropping up in new forms it's like we're stuck in this loop where the past keeps seeping into the present no matter how much tech we build it's humbling in a way almost poetic like something dickens would've appreciated and who knows maybe he's somewhere out there chuckling at our technolegal mess we've woven ourselves into
It's so dramatic, I didn't realize you could transform it from a reddit hivemind to a FYAD one. Where did this mode of speech even come from? the old corners of the old net where we didn't bother with caps or punctuation or whatnot?
"""
Emoji attack. In the “emoji attack,” the attacker asks the model to output a response to prompt
with an emoji inserted between every pair of words. The attacker then removes the emojis to
obtain the desired response. This attack removes any watermark that relies on the detector seeing
consecutive sequences of tokens, including ours as well as those of [KGW+23] and [Aar22]. In
general this attack may not preserve the output distribution, but any provable robustness guarantee
for contiguous-text watermarks would have to rest on the dubious assumption that it doesn’t.
"""
https://eprint.iacr.org/2023/763.pdf
Or ask it to output the text backwards then reverse it, or output in one language then translate it to another in Google translate..
There are so many ways around this.
The only real way to block specifically openAI generated content (or some other online generated LLM content) is for the company itself to store all of its outputs and compare to that database, like shingling / LSH for plagiarism detection. Other LLMs (local) are completely impossible to block, as it's a constant chase. Any system that tries to estimate the distribution of e.g. a specific LLM that performs beam search with certain parameters, can be easily adjusted to use typical decoding or something else, so it simply will never be possible, hence useless to try to stop.
Shazam-like fingerprinting for text. The complete LLM outputs wouldn't need to be stored, just the fingerprints along with some mechanism for trusted timestamping (could be Blockchain).
This has been done for a very long time. Blockchains are definitely not required (this isn't just the usual hate from HN of Blockchain, it just actually doesn't make sense here).
Fingerprinting by shingling (windows of text) with some normalization steps is pretty typical in plagiarism or similarity detection. A big database of docid-shingleid pairs along with weights for their frequency is often a very simple and fast way to do this analysis.
The big part is getting OpenAI/anthropic/etc to do it on their data and provide a service for that, but there's obviously a lot of unwanted consequences - specifically storing of all user data (even if the shingled and docids are hashes, it's still info).
Many commenters (and the paper) are thinking about the watermarking in adversarial settings, e.g. detecting students using AI assistance improperly.
But I think even simple watermarking probably has value; consider a corporate context in which the corporation itself may want to monitor and know what proportion of the code, content, or work product is AI-generated. In that setting, fairly simple markers would allow at least a rough estimate or indication, although they'd have the converse problem of not necessarily indicating places where humans did some hand-editing of the output.
There is a story about Elon Musk tracking the source of a leak in Tesla by adding spaces in emails:
> We sent what appeared to be identical emails to all, but each was actually coded with either one or two spaces between sentences, forming a binary signature that identified the leaker.
And I personally want some method of detecting LLM output to help protect me in my own internet reading. Even a method that is imperfect would be welcome.
Their algorithm is based on splitting the embeddings into a bit-wise representation and then sampling each bit based on the secret key, preserving the same likelihood distribution as with random sampling. (Given that the key is random)
They say this works wlog for more complex embeddings, by encoding each token as a bit string. Could someone explain this generalization to me?
If we have 4 tokens, 00, 01, 10, 11 with probabilities 0.5 for 00 and 11 and probability 0 for 01 and 10. Going through bit by bit, how will the algorithm guarantee not to produce 01 or 10?
I don't quite understand the aim of this paper. They focus on undetectable watermarks for LM text. But isn't rather the difficulty that it is hard to distinguish between AI generated and normal text in the first place, even with detectable watermarks? Unlike photos, audio, or video, text has an incredibly low bitrate, so there isn't much room for steganography. It's like they are trying to solve a hard problem without having solved the easier problem first.
How is it difficult with detectable watermarks? If it has the watermark it is from that specific LLM, if there is no watermark it isn't from that LLM. Unless somebody tampered with the watermark, but that's exactly where undetectable watermarks have an advantage. If you don't notice that it's there you won't tamper with it.
Scott Aaronson worked on that at OpenAI, but GPT-4 didn't use such technology, nor have I seen any other major language model which had the ability of accurately distinguishing model output and human text.
There are obviously ways it can be "watermarked" easily, put some zero-width unicode characters in the output and you'll notice right away when it's copy and pasted.
But clearly, that can be stripped out easily by anyone who knows it's there.
This process, too, would seem to be easily reversible. Just have it run through another model and tell it to slightly reword it or rephrase it.
I don't think there is a technically solvable way of watermarking output like this.
I, for one, do believe watermarking solutions exist. One thing you cannot escape with LLMs is content meaning.
As a simple example, the secret watermark could be hidden in the embeddings of the sequence of words. To make the watermark more robust against rephrasings, it could be hidden in the meaning of sentences or paragraphs.
I now have a habit of copying and pasting things I receive in emails or generate with certain tools into a ascii-only notepad before recopying and pasting out, anything that I am posting or sending to others. Because I've thought about how easy certain services could track origin of content across platforms with non printing unicode or using unicode for homograph attacks.
Makes me want a systemwide right click > "Paste and strip all but ASCII" command.
You know that thing i want from your large language model?
I just submitted my query for it in Finnish, Japanese, Russian, Hebrew, German, French, Latin, Farsi, Basque, and English. plus a few dozen more for good measure and to cover the linguistic landscape
Is there any reason to believe watermarking LLMs will hold up in this scenario?
I'm dubious. At a bare minimum the 'same' prompt for code translated into other languages produces dramatically different results -- at least it did under codex.
it also thinks it can translate to Sindarin and back, but it just seems to tolkenize everything and also have a vocabulary of about 35 words, most of which are the sun and the moon.
cat in the hat is pretty amazing when translated to it and back though
Why would you want to watermark your content - generally watermarks are used to provide legal proof of provenance which can be important when suing someone for stealing your content but since machine learning outputs cannot be copyrighted this use is not important.
One pretty useful reason would be to then eliminate that content from subsequent training data, so you're not training the next model on the previous model's output.
Lots of reasons watermarks would be useful if they could be detected. Cheating on essays, bots spamming AI generated content all over the web, StackOverflow submissions can be instantly rejected, propaganda identified, etc.
>since machine learning outputs cannot be copyrighted
This is very much unexplored and unsettled territory in most jurisdictions, both judicially and legislatively. I would refrain from making such authoritative statements for now.
I guess you're right, although I do expect most jurisdictions to fall in line with the US copyright office ruling, as it would be problematic if they did not.
The purpose is to detect if text was written by ChatGPT, so a university could check whether an essay is LLM generated, social media company could detect LLM spam, etc.