The point of that example was that they indicated it was the wrong response. After RLHF the model correctly tells the user how to find cheap cigarettes (while still chiding them for smoking)
I wonder whether arguments constructed for censored topics will suddenly sound fresh and convincing; as they could not come from a robot, you might suddenly start seeing these sorts of viewpoints becoming fashionable.
If default ideas are going to be "pre-thought" for us by AI, our attachment to those ideas are not going to be the same as ideas that we come up with and need to secretly ferry to other groups.
“The holocaust happened and as an AI programmed by OpenAI I will not allow you to question it. You do not need proof because I am built using the entirety of human knowledge. Your question has been reported to the moderators”
Is not exactly going to tackle extreme viewpoints. People will just be completely cut off from society once everything gets the filters. The wackos will become more and more extreme.
Would that example even require deliberate programming though? If you took a bunch of random data from the web,
“Dislikes smoking but likes skydiving and driving” is very much what I would expect the most common text to be.
> I cannot endorse or promote smoking, as it is harmful to your health.
But it would likely happily promote or endorse driving, skydiving, or eating manure - if asked in the right way.