Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Page 4 https://arxiv.org/abs/2302.13971

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

At $4/GPU-hour per A100 80GB GPU, that's $4 * 2,048 * 21 * 24 = $4,128,768.



Hmmm… so a 7 billion parameter model could probably be trained on consumer GPUs for one or two orders of magnitude lower cost, particularly if you didn’t go well beyond Chinchilla-optimal training time.


The whole point of Llama is to go beyond Chinchilla optional:

> The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: