Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's just a token predictor what do you expect? What we need are tools that embrace that and ping the agent to validate what it just said or double check. But the trade off is that this might hamper their capabilities to some level


> It's just a token predictor what do you expect?

The point isn't that it's unexpected. It's that prior text-to-speech systems were much better about this particular failure mode, prone to spitting out entirely incorrect words but not rephrasing entire sentences.

This is a particularly bad failure mode because people don't notice it.

> What we need are tools that embrace that and ping the agent to validate what it just said or double check.

This is not a problem that can be fixed by throwing more AI at it. It's a shared problem to all such systems, whether they're audio-text transformers or LLMs. Agentic review would just further push the system towards creating output that looks correct, but is not.

LLM translation does the same, yielding more natural text, but generally not better translation. In several cases, especially the "easy" translation between similar languages (e.g. within a language group like Germanic or Nordic) LLM-powered translation is notably worse than more primitive "word & phrase book" systems, tending to change the meaning of the text in order to have good grammar whereas these older systems would give crude or grammatically incorrect translations that still retained the core meaning.


I often (ish) translate between English and German, two languages I speak very well. The quality of translation is amazing and far better than what old systems did.

Maybe it depends on topics or length, for me it's usually 1-2 paragraphs of a German article to share online.


> The quality of translation is amazing and far better than what old systems did.

Are you native in both languages? If you are only native in one of them, it would be insightful to find if people with your skillset but native in the language you are not have the same opinion as you.


It’s rather unlikely that the translation in one direction is great, but lacking in the other, while also being just good enough (compared to before) that my close-to-native English skill misses it, while the old google translate somehow magically made me think it was bad.

Sadly there are no examples here to compare.


> Maybe it depends on topics or length, for me it's usually 1-2 paragraphs of a German article to share online.

Same languages, same use case. My experience is different. On both google translate and others. ¯\_(ツ)_/¯


Haven’t used google translate in a long time, mostly because of quality issues before LLMs. Deepl was leading for a while, nowadays I’m very happy with Kagi translate.


Older ML systems were much better at exposing their internal confidence. Plenty of papers reverse out this kind of interpretability for open weight models. All the models exposed logprobs early on. This seems solvable if prioritized. The unintelligible words should be lower confidence. Getting per-token data for the output that aids with understanding the predictions is entirely feasible as engineering effort - it just won't be enough to address all the problems - but it should help quite a bit.


While you're correct in what tthe audio models are - at least somewhat (they're not exactly like text based llms), you seem to brush his point away too quickly before fully exploring it.

This is a solvable issue, the current model and harnesses just aren't made with that assumption - hence they're doing "best effort while guessing if unsure".

Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

Currently there is basically only one mode - and it's optimized for conversation. The note taking is just glued on with that functionality as the backbone, and that's probably not going to stay.


> Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

I'm hesitant to admit even that. Like any computational linguistics problem, accuracy relies on coverages of all levels: form morphology, through syntax and semantics to speech act and world knowledge.

I worked with state of art speech recognition in healthcare setting. The model was specifically trained on small set of languages with emphasis on covering medical terminology.

It worked great for conversations most of the time, but sometimes messed up very badly. For instance when patient would mention the name of a relative, a street address or phone number. Spelling out an email address would mess it up completely.

It's just like when you're a horrible typist and rely on spell checking: The red squibles are gone, but the story no longer makes sense. Or when you "autofix" a syntax error, but the meaning diverges from your intention.

As the technology improved the number of words decreases, but the mistakes get more severe.


> what do you expect?

If the prediction strength is below X, put an indicator that it couldn't make a valid prediction?


>It's just a token predictor what do you expect?

Someone tell Altman




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: