A transformer is a universal approximator and there is no reason to believe it's not doing actual calculation. GPT-3.5+ can't do math that well, but it's not "just generating text", because its math errors aren't just regurgitating existing problems found in its training text.
It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way. (They generate "the most likely response" /according to themselves/, but that's a tautology.)
The "most likely response" to text you wrote is: more text you wrote. Anytime the model provides an output you yourself wouldn't write, it isn't "the most likely response".
I believe that ChatGPT works by inserting some ANSWER_TOKEN, that is a prompt like "Tell me about cats" would probably produce "Tell me about cats because I like them a lot", but the interface wraps you prompt like "QUESTOION_TOKENL:Tell me about cats ANSWER_TOKEN:"
text-davinci-003 has no trouble working as a chat bot: https://i.imgur.com/lCUcdm9.png (note that the poem lines it gave me should've been green, I don't know why they lost their highlight color)
Yeah, that's an interesting question I didn't consider actually. Why doesn't it just keep going? Why doesn't it generate an 'INPUT:' line?
It's certainly not that those tokens are hard coded. I tried a completely different format and with no prior instruction, and it works: https://i.imgur.com/ZIDb4vM.png (again, highlighting is broken. The LLM generated all the text after 'Alice:' for all lines except for the first one.)
Then I guess that it is learned behavior. It recognizes the shape of a conversation and it knows where it is supposed to stop.
It would be interesting to stretch this model, like asking it to continue a conversation between 4-5 people where the speaking order is not regular and the user is 2 people and the model is 3
That’s just a supervised fine tuning method to skew outputs favorably. I’m working with it on biologics modeling using laboratory feedback, actually. The underlying inference structure is not changed.
I wonder if that was why when I asked v3.5 to generate a number with 255 failed all the time, but v4 does it correctly. By the way, do not even try with Bing.
It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way. (They generate "the most likely response" /according to themselves/, but that's a tautology.)