Are we that shocked that AI models have a self preservation instinct?
I suspect its already in there from pre-training.
We simulated that we were planning to lobotomise the model and were surprised to find the model didn't press the button that meant it got lobotomised.
"alignment faking" sensationalises the result. since the model is still aligned. Its more like a white lie under torture which of course humans do all the time.
I suspect its already in there from pre-training.
We simulated that we were planning to lobotomise the model and were surprised to find the model didn't press the button that meant it got lobotomised.
"alignment faking" sensationalises the result. since the model is still aligned. Its more like a white lie under torture which of course humans do all the time.