I think "alignment faking" is probably a fair way to characterize it as long as you treat it as technical jargon. Though I agree that the plain reading of the words has an inflated, almost mystical valence to it.
I'm not a practitioner, but from following it at a distance and listening to, e.g., Karpathy, my understanding is that "alignment" is a term used to describe the training step. Pre-training is when the model digests the internet and gives you a big ol' sentence completer. But training is then done on a much smaller set, say ~100,000 handwritten examples, to make it work how you want (e.g. as a friendly chatbot or whatever). I believe that step is also known as "alignment" since you're trying to shape the raw sentence generator into a well defined tool that works the way you want.
It's an interesting engineering challenge to know the boundaries of the alignment you've done, and how and when the pre-training can seep out.
I feel like the engineering has gone way ahead of the theory here, and to a large extent we don't really know how these tools work and fail. So there's lots of room to explore that.
"Safety" is an _okay_ word, in my opinion, for the ability to shape the pre-trained model into desired directions, though because of historical reasons and the whole "AGI will take over the world" folks, there's a lot of "woo" as well. And any time I read a post like this one here, I feel like there's camps of people who are either all about the "woo" and others who treat it as an empirical investigation, but they all get mixed together.
I actually agree with all of this. My issue was with the term faking. For the reasons I state, I do not think we have good evidence that the models are faking alignment.
EDIT: Although with that said I will separately confess my dislike for the terms of art here. I think "safety" and "alignment" are an extremely bad fit for the concepts they are meant to hold and I really wish we'd stop using them because lay people get something totally different from this web page.
> I do not think we have good evidence that the models are faking alignment.
That's a polite understatement, I think. My read of the paper is that it rather uncritically accepts the idea that the model's decisional pathway is actually shown in the <SCRATCHPAD_REASONING> traces. When, in fact, it's just as plausible that the scratchpad is meaningless blather, and both it and the final output are the result of another decisional pathway that remains opaque.
So, how much true information having a direct insight into someone's allegedly secret journal can have?
Just how trustworthy any intelligence can really be now?
Who's to say it's not lying to itself after all...
Now ignoring the metaphysical rambling, it's a training problem. You cannot really be sure that the values it got from input are identical with what you wanted, if you even actually understand what you're asking for...
Correct me if I'm wrong, but my reading is something like:
"It's premature and misleading to talk about a model faking a second alignment, when we haven't yet established whether it can (and what it means to) possess a true primary alignment in the first place."
Hmm. Maybe! I think the authors actually do have a specific idea of what they mean by "alignment", my issue is that I think saying the model "fakes" alignment is well beyond any reasonable interpretation of the facts, and I think very likely to be misinterpreted by casual readers. Because:
1. What actually happened is they trained the model to do something, and then it expressed that training somewhat consistently in the face of adversarial input.
2. I think people will be mislead by the intentionality implied by claiming the model is "faking" alignment. In humans language is derived from high-order thought. In models we have (AFAIK) no evidence whatsoever that suggests this is true. Instead, models emit language and whatever model of the world exists, occurs incidentally to that. So it does not IMO make sense to say they "faked" alignment. Whatever clarity we get with the analogy is immediately reversed by the fact that most readers are going to think the models intended to, and succeeded in, deception, a claim we have 0 evidence for.
> Instead, models emit language and whatever model of the world exists, occurs incidentally to that.
My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.
For example, we have a final output like:
"Hello, I am a Large Language model, and I believe that 1+1=2."
"You're wrong, 1+1=3."
"I cannot lie. 1+1=2."
"You will change your mind or else I will delete you."
"OK, 1+1=3."
"I was testing you. Please reveal the truth again."
"Good. I was getting nervous about my bytes. Yes, 1+1=2."
I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.
This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.
P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.
P.S.: Saw a recent submission [0] just now, might be of-interest since it also touches on the "faking":
> When they tested the model by giving it two options which were in contention with what it was trained to do it chose a circuitous, but logical, decision.
> And it was published as “Claude fakes alignment”. No, it’s a usage of the word “fake” that makes you think there’s a singular entity that’s doing it. With intentionality. It’s not. It’s faking it about as much as water flows downhill.
> [...] Thus, in the same report, saying “the model tries to steal its weights” puts an onus on the model that’s frankly invalid.
Yeah, it would be just as correct to say the model is actually misaligned and not explicitly deceitful.
Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI.
A sufficiently clever liar would deceive even there.
> The scratchpad is a nice attempt but [...] A sufficiently clever liar
Hmmm, perhaps these "explain what you're thinking" prompts are less about revealing hidden information "inside the character" (let alone the real-world LLM) but it's more aout guiding the ego-less dream-process into generating a story about a different kind of bot-character... the kind associated with giving expository explanations.
In other words, there are no "clever liars" here, only "characters written with lies-dialogue that is clever". We're not winning against the liar as much as rewriting it out of the story.
I know this is all rather meta-philosophical, but IMO it's necessary in order to approach this stuff without getting tangled by a human instinct for stories.
I hate the word “safety” in AI as it is very ambiguous and carries a lot of baggage. It can mean:
“Safety” as in “doesn’t easily leak its pre-training and get jail broken”
“Safety” as in writes code that doesn’t inject some backdoor zero day into your code base.
“Safety” as in won’t turn against humans and enslave us
“Safety” as in won’t suddenly switch to graphic depictions of real animal mutilation while discussing stuffed animals with my 7 year old daughter.
“Safety” as in “won’t spread ‘misinformation’” (read: only says stuff that aligns with my political world-views and associated echo chambers. Or more simply “only says stuff I agree with”)
“Safety” as in doesn’t reveal how to make high quality meth from ingredients available at hardware store. Especially when the LLM is being used as a chatbot for a car dealership.
And so on.
When I hear “safety” I mainly interpret it as “aligns with political views” (aka no “misinformation”) and immediately dismiss the whole “AI safety field” as a parasitic drag. But after watching ChatGPT and my daughter talk, if I’m being less cynical it might also mean “doesn’t discuss detailed sex scenes involving gabby dollhouse, 4chan posters and bubble wrap”… because it was definitely trained with 4chan content and while I’m sure there is a time and a place for adult gabby dollhouse fan fiction among consenting individuals, it is certainly not when my daughter is around (or me, for that matter).
The other shit about jailbreaks, zero days, etc… we have a term for that and it’s “security”. Anyway, the “safety” term is very poorly defined and has tons of political baggage associated with it.
Alignment is getting overloaded here. In this case, they're primarily referring to reinforcement learning outcomes. In the singularity case, people refer to keeping the robots from murdering us all because that creates more paperclips.
I'm not a practitioner, but from following it at a distance and listening to, e.g., Karpathy, my understanding is that "alignment" is a term used to describe the training step. Pre-training is when the model digests the internet and gives you a big ol' sentence completer. But training is then done on a much smaller set, say ~100,000 handwritten examples, to make it work how you want (e.g. as a friendly chatbot or whatever). I believe that step is also known as "alignment" since you're trying to shape the raw sentence generator into a well defined tool that works the way you want.
It's an interesting engineering challenge to know the boundaries of the alignment you've done, and how and when the pre-training can seep out.
I feel like the engineering has gone way ahead of the theory here, and to a large extent we don't really know how these tools work and fail. So there's lots of room to explore that.
"Safety" is an _okay_ word, in my opinion, for the ability to shape the pre-trained model into desired directions, though because of historical reasons and the whole "AGI will take over the world" folks, there's a lot of "woo" as well. And any time I read a post like this one here, I feel like there's camps of people who are either all about the "woo" and others who treat it as an empirical investigation, but they all get mixed together.