advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.
we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.
I don't find the articles argument that this is due to tokenisation convincing.
> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.
right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.