Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

it is interesting. i believe they may be hitting a wall as well


A paper from a week ago found that models trained on multiple data modes perform an order of magnitude better than text-only models of the same or even larger size.


Genuinely curious, what does it mean in this context to perform better? (hopefully that doesn't come across as snarky as text sometimes does).


Some of these large models are able to do zero shot learning and perform tasks they weren't explicitly trained on since the training objective is very general.

Being able to perform more advanced types of zero shot learning tasks would be comparable and further the accuracy on those tasks can be evaluated


Any chance this could improve coding LLMs like Co-pilot? Or would that sort of thing be limited to source code feeds (not that Github has a shortage).


The next big step for coding LLMs will be context window increases, leaked docs have OpenAI pricing for up to 16K I believe, 4x the current maximum. Now you're talking "write a class" instead of this line and maybe sometimes a method


It can already reliably write a todo web browser application.

With 16k and some other techniques, I’m guessing it could write a custom CMS database backed web application.


Not 16k, 32k. 8x the current window.


Nice


What is 16k referring to here


I’ve begun to grok it as “the amount of ram I have to play in before I have to start sharding work”

more literally and correctly, it’s the maximum number of tokens in the input and output, combined, where a token is 4/3 of a word

So we’re shifting from 5K words maximum to 40K (per sibling comment, who pointed out 32K context leaked as well)


A minor correction: a token is 3/4 of a word. ie. it’s slightly smaller than a word, not larger.


Are you referring to PALM-E? It didn't have any positive transfer for NLP tasks, in fact the unfrozen model performed slightly worse after the finetune. That being said, PALM-E wasn't really a multimodal model from the start, it's still basically just a text model with a visual one glued on top. Whether a truly multimodal model will be better at reasoning and data efficiency is still an open question though.


I figure it’s a stand-in for embodiment.


Can you share a link to this paper?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: