Page 4 https://arxiv.org/abs/2302.13971 *When training a 65B-parameter model, ou...

Robotbeat · on April 17, 2023

Hmmm… so a 7 billion parameter model could probably be trained on consumer GPUs for one or two orders of magnitude lower cost, particularly if you didn’t go well beyond Chinchilla-optimal training time.

nl · on April 17, 2023

The whole point of Llama is to go beyond Chinchilla optional:

> The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference.