If you're optimizing for lower power draw + higher throughput on Mac (especially in MLX), definitely keep an eye on the Desloth LLMs that are starting to appear.
Desloth models are basically aggressively distilled and QAT-optimized versions of larger instruction models (think: 7B → 1.3B or 2B) designed specifically for high tokens/sec at minimal VRAM. They're tiny but surprisingly capable for structured outputs, fast completions, and lightweight agent pipelines.
I'm seeing Desloth-tier models consistently hit >50 tok/sec on M1/M2 hardware without needing active cooling ramps, especially when combined with low-bit quant like Q4_K_M or Q5_0.
If you care about runtime efficiency per watt + low-latency inference (vs. maximum capability), these newer Desloth styled architectures are going to be a serious unlock.
If you're optimizing for lower power draw + higher throughput on Mac (especially in MLX), definitely keep an eye on the Desloth LLMs that are starting to appear.
Desloth models are basically aggressively distilled and QAT-optimized versions of larger instruction models (think: 7B → 1.3B or 2B) designed specifically for high tokens/sec at minimal VRAM. They're tiny but surprisingly capable for structured outputs, fast completions, and lightweight agent pipelines.
I'm seeing Desloth-tier models consistently hit >50 tok/sec on M1/M2 hardware without needing active cooling ramps, especially when combined with low-bit quant like Q4_K_M or Q5_0.
If you care about runtime efficiency per watt + low-latency inference (vs. maximum capability), these newer Desloth styled architectures are going to be a serious unlock.