Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I did! I do think Mistral models are pretty okay, but even the 4-bit quantized version runs at about 16 tokens/second, more or less usable but a biiiig step down from the MoE options.

Might have to swap out Ollama for vLLM though and see how different things are.



> Might have to swap out Ollama for vLLM though and see how different things are.

Oh, that might be it. Using gguf is slower than say AWQ if you want 4bit, or fp8 if you want the best quality (especially on Ada arch that I think your GPUs are).

edit: vLLM is better for Tensor Parallel and also better for batched inference, some agentic stuff can do multiple queries in parallel. We run devstral fp8 on 2x A6000 (old, not even Ada) and even with marlin kernels we get ~35-40 t/s gen and 2-3k pp on a single session, with ~4 parallel sessions supported at full context. But in practice it can work with 6 people using it concurrently, as not all sessions get to the max context. You'd get 1/2 of that for 2x L4, but should see higher t/s in generation since you have Ada GPUs (native support for fp8).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: