What sort of hardware will run Qwen3-Coder-480B-A35B-Instruct?
With the performance apparently comparable to Sonnet some of the heavy Claude Code users could be interested in running it locally. They have instructions for configuring it for use by Claude Code. Huge bills for usage are regularly shared on X, so maybe it could even be economical (like for a team of 6 or something sharing a local instance).
Any significant benefits at 3 or 4 bit? I have access to twice that much VRAM and system RAM but of course that could potentially be better used for KV cache.
So dynamic quants like what I upload are not actually 4bit! It's a mixture of 4bit to 8bit with important layers being in higher precision! I wrote about our method here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs
For coding you want more precision so the higher the quant the better.
But there is discussion if a smaller model in higher quant is better than a larger one in lower quant. Need to test for yourself with your use cases I'm afraid.
e: They did announce smaller variants will be released.
I can say that this really works great, I'm a heavy user of the unsloth dyanmic quants. I run DeepSeek v3/r1 in Q3, and ernie-300b and KimiK2 in Q3 too. Amazing performance. I run Qwen3-235b in both Q4 and Q8 and can barely tell the difference so much so that I just keep Q4 since it's twice as fast.
LLMs usually have about 3.6 bits of data per parameter. You're losing a lot of information if quantized to 2 bits. 4 bit quants are the sweet spot where there's not much quality loss.
I would say that three or four bit are likely to be significantly better. But that’s just from my previous experience with quants. Personally, I try not to use anything smaller than a Q4.
Interesting, so with enough memory bandwidth, even the server CPU has enough compute to do inference on a rather large model? Enough to compete against M4 gpu?
Edit: I just aked chatgpt and it says with no memory bandwidth bottleneck, i can still only achieve around 1 token/s from a 96 core cpu.
For a single user prompting with one or few prompts at a time, compute is not the bottleneck. Memory bandwidth is. This is because the entire model's weights must be run through the algorithm many times per prompt. This is also why multiplexing many prompts at the same time is relatively easy and effective, as many matrix multiplications can happen in the time it takes to do a single fetch from memory.
> This is because the entire model's weights must be run through the algorithm many times per prompt.
And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.
Yes, but with a 400B parameter model, at fp16 it's 800GB right? So with 800GB/s memory bandwidth, you'd still only be able to bring them in once per second.
Edit: actually forgot the MoE part, so that makes sense.
Approximately, yes. For MoE models, there is less required bandwidth, as you're generally only processing the weights from one or two experts at a time. Though which experts can change from token to token, so it's best if all fit in RAM. The sort of machines hyperscalers are using to run these things have essentially 8x APUs each with about that much bandwidth, connected to other similar boxes via infiniband or 800gbps ethernet. Since it's relatively straightforward to split up the matrix math for parallel computation, segmenting the memory in this way allows for near linear increases in memory bandwidth and inference performance. And is effectively the same thing you're doing when adding GPUs.
Out of curiosity I've repeatedly compared the tokens/sec of various open weight models and consistently come up with this: tokens/sec/USD is near constant.
If a $4,000 Mac does something at X tok/s, a $400 AMD PC on pure CPU does it at 0.1*X tok/s.
Assuming good choices for how that money is spent. You can always waste more money. As others have said, it's all about memory bandwidth. AMD's "AI Max+ 395" is gonna make this interesting.
And of course you can always just not have enough RAM to even run the model. This tends to happen with consumer discrete GPUs not having that much VRAM, they were built for gaming.
Intel already has a great value GPU. Everyone wants them to disrupt the game, destroy the product niches. It's general purpose compute performance is quite ass alas but maybe that doesn't matter for AI?
I'm not sure if there are higher capacity gddr6 & 7's rams to buy. I semi doubt you can add more without more channels, to some degree, but also, AMD just shipped R9700 based on rx9070 but with double the ram. But something like Strix Halo, an API with more lpddr channels could work. Word is that Strix Halo's 2027 successor Medusa Halo will go to 6 channels and it's hard to see a significant advantage without that win; the processing is already throughput constrained-ish and a leap on memory bandwidth will definitely be required. Dual channel 128b isn't enough!
There's also MRDIMMs standard, which multiplexes multiple chips. That promises a doubling of both capacity and throughout.
Apple's definitely done two brilliant costly things, by putting very wide (but not really fast) memory on package (Intel had dabbled in doing similar with regular width ram in consumer space a while ago with Lakefield). And then by tiling multiple cores together, making it so that if they had four perfect chips next to each other they could ship it as one. Incredibly brilliant maneuver to get fantastic yields, and to scale very big.
It's not faster at running Qwen3-Coder, because Qwen3-Coder does not fit in 96GB, so can't run at all. My goal here is to run Qwen3-Coder (or similarly large models).
Sure you can build a cluster of RTX 6000s but then you start having to buy high-end motherboards and network cards to achieve the bandwidth necessary for it to go fast. Also it's obscenely expensive.
i mean sure its not quite 512gb levels but you can get 128gb on a ryzen AI max chipset which has unified memory like apple, theyre also pretty reasonably priced, i saw an AI max 370 with 96gb on amazon earlier for a shade over £1000, guess you could boost that with an eGPU to gain a bit extra but 64gb would likely be the max you could add so still not quite enough to run full qwen3 coder at a decent quant but not far off, hopefully the next gen will offer more ram or another model comes out that can beat Q3 with fewer params
That's very informative, thanks! So a DGX H200 should be able to run it at 16-bit precision. If I recall correctly, the current hourly rate should be around $25. Not sure what the throughput is, though.
To run the real version with the bench arks they give, it would be a nonquantized non distilled version. So I am guessing that is a cluster of 8 H200s if you want to be more or less up to date. They have B200s now which are much faster but also much more expensive. $300,000+
You will see people making quantized distilled versions but they never give benchmark results.
Oh you can run the Q8_0 / Q8_K_XL which is nearly equivalent to FP8 (maybe off by 0.01% or less) -> you will need 500GB of VRAM + RAM + Disk space. Via MoE layer offloading, it should function ok
With RAM you would need at least 500gb to load it but some 100-200gb more for context too. Pair it with a 24gb GPU and the speed will be 10t/s, at least, I estimate.
Oh yes for the FP8, you will need 500GB ish. 4bit around 250GB - offloading MoE experts / layers to RAM will definitely help - as you mentioned a 24GB card should be enough!
With the performance apparently comparable to Sonnet some of the heavy Claude Code users could be interested in running it locally. They have instructions for configuring it for use by Claude Code. Huge bills for usage are regularly shared on X, so maybe it could even be economical (like for a team of 6 or something sharing a local instance).