Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can really fast inference (e.g. 1M tok/sec) make LLMs more intelligent? I am imagining you could run multiple agents and can choose and discard outputs using other LLMs simultaneously. Will the output look more like a real thought process? Or will it remain just same?


It is mentioned in the post:

> Traditional LLMs output everything they think immediately, without stopping to consider the best possible answer. New techniques like scaffolding, on the other hand, function like a thoughtful agent who explores different possible solutions before deciding. This “thinking before speaking” approach provides over 10x performance on demanding tasks like code generation, fundamentally boosting the intelligence of AI models without additional training.


Is there a tool that provides functionality like this that you can layer on top of cerebras's API, given you are not worried about using 10x-50x more tokens per query.


Many of the results in the 'agent' literature require several agents and many iterations to produce an output. See some examples here [1]. Getting these results in seconds instead of minutes or hours would be incredible - and would help with iteration and experimentation to improve algorithms.

[1] https://langchain-ai.github.io/langgraph/tutorials/multi_age...


Fast inference can substitute for larger models in some circumstances. As you said, you can run multiple times. DeepMind had a detailed look, see https://arxiv.org/abs/2408.03314.


Not just that, but you could have a network of LLM's all talk and discuss a answer before answering at this sort of speed. Literally could have it generate internal thoughts and challenges to itself before responding via scripting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: