Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Oh 8*H200 is nice - for llama.cpp definitely look at https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locall... - llama.cpp has a high throughput mode which should be helpful.

You should be able to get 40 to 50 tokens / s in the minimum. High throughput mode + a small draft model might get you 100 tokens / s generation



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: