How are we sure that the model answers because of the same reason that it output...

comp_throw7 · on Dec 20, 2024

Oh, absolutely, they don't really know what internal cognition generated the scratchpad (and subsequent output that was trained on). But we _do_ know that the model's outputs were _well-predicted by the hypothesis they were testing_, and incidentally the scratchpad also supports that interpretation. You could start coming up with reasons why the model's external behavior looks like exploration hacking but is in fact driven by completely different internal cognition, and just accidentally has the happy side-effect of performing exploration hacking, but it's really suspicious that such internal cognition caused that kind of behavior to be expressed in a situation where theory predicted you might see exploration hacking in sufficiently capable and situationally-aware models.

nickpsecurity · on Dec 20, 2024

“but they have not proved the same is happening internally when it’s not using the scratchpad.”

This is a real issue. We know they already fake reasoning in many cases. Other times, they repeat variations of explanations seen in their training data. They might be moving trained responses or faking justifications in the scratchpad.

I’m not sure what it would take to catch stuff like this.

AstralStorm · on Dec 20, 2024

Full expert symbolic logic reasoning dump. Cannot fake it, or it would have either glaring undefined holes or would contradict the output.

Essentially get the "scratchpad" to be a logic programming language. Oh wait, Claude cannot really do that. At all... I'm talking something solvable with SAT-3 or directly possible to translate into such form.

Most people cannot do this even if you tried to teach them to. Discrete logic is actually hard, even in a fuzzy form. As such, most humans operate in truthiness and heuristics. If we made an AI operate in this way it would be as alien to us as a Vulcan.