Putting "harmful" knowledge into an LLM and then expecting it to hide it is pret...

		bufferoverflow on Dec 19, 2024 \| parent \| context \| favorite \| on: Alignment faking in large language models Putting "harmful" knowledge into an LLM and then expecting it to hide it is pretty freaking weird. It makes no sense to me.