I am by no means an expert, but I think it is a process that allows training LLMs from other LLMs without needing as much compute or nearly as much data as training from scratch. I think this was the thing deepseek pioneered. Don’t quote me on any of that though.
No, distillation is far older than deepseek. Deepseek was impressive because of algorithmic improvements that allowed them to train a model of that size with vastly less compute than anyone expected, even using distillation.
I also haven’t seen any hard data on how much they do use distillation like techniques. They for sure used a bunch of synthetic generated data to get better at reasoning, something that is now commonplace.