Can really fast inference (e.g. 1M tok/sec) make LLMs more intelligent? I am ima...

fittom · on Aug 27, 2024

It is mentioned in the post:

> Traditional LLMs output everything they think immediately, without stopping to consider the best possible answer. New techniques like scaffolding, on the other hand, function like a thoughtful agent who explores different possible solutions before deciding. This “thinking before speaking” approach provides over 10x performance on demanding tasks like code generation, fundamentally boosting the intelligence of AI models without additional training.

Workaccount2 · on Aug 27, 2024

Is there a tool that provides functionality like this that you can layer on top of cerebras's API, given you are not worried about using 10x-50x more tokens per query.

glial · on Aug 27, 2024

Many of the results in the 'agent' literature require several agents and many iterations to produce an output. See some examples here [1]. Getting these results in seconds instead of minutes or hours would be incredible - and would help with iteration and experimentation to improve algorithms.

[1] https://langchain-ai.github.io/langgraph/tutorials/multi_age...

sanxiyn · on Aug 27, 2024

Fast inference can substitute for larger models in some circumstances. As you said, you can run multiple times. DeepMind had a detailed look, see https://arxiv.org/abs/2408.03314.

cchance · on Aug 27, 2024

Not just that, but you could have a network of LLM's all talk and discuss a answer before answering at this sort of speed. Literally could have it generate internal thoughts and challenges to itself before responding via scripting.