On the contrary, it is clear to me they definitely ARE modeling the world, either directly or indirectly. I think basically everyone knows this, that is not the problem, to me.
What I'm asking is whether we really have enough evidence to say the models are "alignment faking." And, my position to the replies above is that I think we do not have evidence that is strong enough to suggest this is true.
Oh, I see. I misunderstood what you meant by "they exclusively model the language first, and then incidentally, the world." But assuming you mean that they develop their world model incidentally through language, is that very different than how I develop a mental world-model of Quidditch, time-turner time travel, and flying broomsticks through reading Harry Potter novels?
The main consequence to the models is that whatever they want to learn about the real world has to be learned, indirectly, through an objective function that primarily models things that are mostly irrelevant, like English syntax. This is the reason why it is relatively easy to teach models new "facts" (real of fake) but empirically and theoretically harder to get them to reliably reason about which "facts" are and aren't true: a lot of, maybe most, of the "space" in a model is taken up by information related to either syntax or polysemy (words that mean different things in different contexts), leaving very little left over for models of reasoning, or whatever else you want.
Ultimately, this could be mostly fine except resources for representing what is learned are not infinite and in a contest between storing knowledge about "language" and anything else, the models "generally" (with some complications) will prefer to store knowledge about the language, because that's what the objective function requires.
It gets a little more complicated when you consider stuff like RLHF (which often rewards world modeling) and ICL (in which the model extrapolates from the prompt) but more or less it is true.
What I'm asking is whether we really have enough evidence to say the models are "alignment faking." And, my position to the replies above is that I think we do not have evidence that is strong enough to suggest this is true.