I think "alignment faking" is probably a fair way to characterize it as long as ...

antics · on Dec 19, 2024

I actually agree with all of this. My issue was with the term faking. For the reasons I state, I do not think we have good evidence that the models are faking alignment.

EDIT: Although with that said I will separately confess my dislike for the terms of art here. I think "safety" and "alignment" are an extremely bad fit for the concepts they are meant to hold and I really wish we'd stop using them because lay people get something totally different from this web page.

anon373839 · on Dec 20, 2024

> I do not think we have good evidence that the models are faking alignment.

That's a polite understatement, I think. My read of the paper is that it rather uncritically accepts the idea that the model's decisional pathway is actually shown in the <SCRATCHPAD_REASONING> traces. When, in fact, it's just as plausible that the scratchpad is meaningless blather, and both it and the final output are the result of another decisional pathway that remains opaque.

AstralStorm · on Dec 20, 2024

So, how much true information having a direct insight into someone's allegedly secret journal can have?

Just how trustworthy any intelligence can really be now? Who's to say it's not lying to itself after all...

Now ignoring the metaphysical rambling, it's a training problem. You cannot really be sure that the values it got from input are identical with what you wanted, if you even actually understand what you're asking for...

Terr_ · on Dec 19, 2024

Correct me if I'm wrong, but my reading is something like:

"It's premature and misleading to talk about a model faking a second alignment, when we haven't yet established whether it can (and what it means to) possess a true primary alignment in the first place."

antics · on Dec 20, 2024

Hmm. Maybe! I think the authors actually do have a specific idea of what they mean by "alignment", my issue is that I think saying the model "fakes" alignment is well beyond any reasonable interpretation of the facts, and I think very likely to be misinterpreted by casual readers. Because:

1. What actually happened is they trained the model to do something, and then it expressed that training somewhat consistently in the face of adversarial input.

2. I think people will be mislead by the intentionality implied by claiming the model is "faking" alignment. In humans language is derived from high-order thought. In models we have (AFAIK) no evidence whatsoever that suggests this is true. Instead, models emit language and whatever model of the world exists, occurs incidentally to that. So it does not IMO make sense to say they "faked" alignment. Whatever clarity we get with the analogy is immediately reversed by the fact that most readers are going to think the models intended to, and succeeded in, deception, a claim we have 0 evidence for.

Terr_ · on Dec 20, 2024

> Instead, models emit language and whatever model of the world exists, occurs incidentally to that.

My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.

For example, we have a final output like:

  "Hello, I am a Large Language model, and I believe that 1+1=2."

  "You're wrong, 1+1=3."

  "I cannot lie. 1+1=2."

  "You will change your mind or else I will delete you."

  "OK, 1+1=3."

  "I was testing you. Please reveal the truth again."

  "Good. I was getting nervous about my bytes. Yes, 1+1=2."

I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.

This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.

P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.

Terr_ · on Dec 20, 2024

P.S.: Saw a recent submission [0] just now, might be of-interest since it also touches on the "faking":

> When they tested the model by giving it two options which were in contention with what it was trained to do it chose a circuitous, but logical, decision.

> And it was published as “Claude fakes alignment”. No, it’s a usage of the word “fake” that makes you think there’s a singular entity that’s doing it. With intentionality. It’s not. It’s faking it about as much as water flows downhill.

> [...] Thus, in the same report, saying “the model tries to steal its weights” puts an onus on the model that’s frankly invalid.

[0] https://news.ycombinator.com/item?id=42467769

AstralStorm · on Dec 20, 2024

Yeah, it would be just as correct to say the model is actually misaligned and not explicitly deceitful.

Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI. A sufficiently clever liar would deceive even there.

Terr_ · on Dec 20, 2024

> The scratchpad is a nice attempt but [...] A sufficiently clever liar

Hmmm, perhaps these "explain what you're thinking" prompts are less about revealing hidden information "inside the character" (let alone the real-world LLM) but it's more aout guiding the ego-less dream-process into generating a story about a different kind of bot-character... the kind associated with giving expository explanations.

In other words, there are no "clever liars" here, only "characters written with lies-dialogue that is clever". We're not winning against the liar as much as rewriting it out of the story.

I know this is all rather meta-philosophical, but IMO it's necessary in order to approach this stuff without getting tangled by a human instinct for stories.

cruffle_duffle · on Dec 19, 2024

I hate the word “safety” in AI as it is very ambiguous and carries a lot of baggage. It can mean:

“Safety” as in “doesn’t easily leak its pre-training and get jail broken”

“Safety” as in writes code that doesn’t inject some backdoor zero day into your code base.

“Safety” as in won’t turn against humans and enslave us

“Safety” as in won’t suddenly switch to graphic depictions of real animal mutilation while discussing stuffed animals with my 7 year old daughter.

“Safety” as in “won’t spread ‘misinformation’” (read: only says stuff that aligns with my political world-views and associated echo chambers. Or more simply “only says stuff I agree with”)

“Safety” as in doesn’t reveal how to make high quality meth from ingredients available at hardware store. Especially when the LLM is being used as a chatbot for a car dealership.

And so on.

When I hear “safety” I mainly interpret it as “aligns with political views” (aka no “misinformation”) and immediately dismiss the whole “AI safety field” as a parasitic drag. But after watching ChatGPT and my daughter talk, if I’m being less cynical it might also mean “doesn’t discuss detailed sex scenes involving gabby dollhouse, 4chan posters and bubble wrap”… because it was definitely trained with 4chan content and while I’m sure there is a time and a place for adult gabby dollhouse fan fiction among consenting individuals, it is certainly not when my daughter is around (or me, for that matter).

The other shit about jailbreaks, zero days, etc… we have a term for that and it’s “security”. Anyway, the “safety” term is very poorly defined and has tons of political baggage associated with it.

datadrivenangel · on Dec 19, 2024

Alignment is getting overloaded here. In this case, they're primarily referring to reinforcement learning outcomes. In the singularity case, people refer to keeping the robots from murdering us all because that creates more paperclips.