What sort of hardware will run Qwen3-Coder-480B-A35B-Instruct? With the performa...

danielhanchen · 2025-07-22T22:23:15 1753222995

I'm currently trying to make dynamic GGUF quants for them! It should use 24GB of VRAM + 128GB of RAM for dynamic 2bit or so - they should be up in an hour or so: https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc.... On running them locally, I do have docs as well: https://docs.unsloth.ai/basics/qwen3-coder

zettabomb · 2025-07-22T22:29:29 1753223369

Any significant benefits at 3 or 4 bit? I have access to twice that much VRAM and system RAM but of course that could potentially be better used for KV cache.

danielhanchen · 2025-07-22T22:42:07 1753224127

So dynamic quants like what I upload are not actually 4bit! It's a mixture of 4bit to 8bit with important layers being in higher precision! I wrote about our method here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs

sourcecodeplz · 2025-07-22T22:34:27 1753223667

For coding you want more precision so the higher the quant the better. But there is discussion if a smaller model in higher quant is better than a larger one in lower quant. Need to test for yourself with your use cases I'm afraid.

e: They did announce smaller variants will be released.

danielhanchen · 2025-07-22T22:43:39 1753224219

Yes the higher the quant, the better! The other approach is dynamically choosing to upcast some layers!

segmondy · 2025-07-23T00:18:46 1753229926

I can say that this really works great, I'm a heavy user of the unsloth dyanmic quants. I run DeepSeek v3/r1 in Q3, and ernie-300b and KimiK2 in Q3 too. Amazing performance. I run Qwen3-235b in both Q4 and Q8 and can barely tell the difference so much so that I just keep Q4 since it's twice as fast.

someone13 · 2025-07-23T04:21:44 1753244504

What hardware do you use, out of curiosity?

jychang · 2025-07-23T10:33:19 1753266799

In the current era of MoE models, the system RAM memory bandwidth determines your speed more than the GPU does.

danielhanchen · 2025-07-23T02:16:33 1753236993

Thanks for using them! :)

jychang · 2025-07-23T10:31:42 1753266702

You definitely want to use 4bit quants at minimum.

https://arxiv.org/abs/2505.24832

LLMs usually have about 3.6 bits of data per parameter. You're losing a lot of information if quantized to 2 bits. 4 bit quants are the sweet spot where there's not much quality loss.

fzzzy · 2025-07-22T22:34:49 1753223689

I would say that three or four bit are likely to be significantly better. But that’s just from my previous experience with quants. Personally, I try not to use anything smaller than a Q4.

gardnr · 2025-07-22T22:25:27 1753223127

Legend

danielhanchen · 2025-07-22T22:26:32 1753223192

simonw · 2025-07-22T23:22:05 1753226525

There's a 4bit version here that uses around 272GB of RAM on a 512GB M3 Mac Studio: https://huggingface.co/mlx-community/Qwen3-Coder-480B-A35B-I... - see video: https://x.com/awnihannun/status/1947771502058672219

That machine will set you back around $10,000.

jychang · 2025-07-22T23:50:52 1753228252

You can get similar performance on an Azure HX vm:

https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

osti · 2025-07-23T00:14:47 1753229687

How? These don't even have GPU's right?

jychang · 2025-07-23T00:37:38 1753231058

They have similar memory bandwidth compared to the Mac Studio. You can run it off CPU at the same speed.

osti · 2025-07-23T01:15:46 1753233346

Interesting, so with enough memory bandwidth, even the server CPU has enough compute to do inference on a rather large model? Enough to compete against M4 gpu?

Edit: I just aked chatgpt and it says with no memory bandwidth bottleneck, i can still only achieve around 1 token/s from a 96 core cpu.

timschmidt · 2025-07-23T01:44:48 1753235088

For a single user prompting with one or few prompts at a time, compute is not the bottleneck. Memory bandwidth is. This is because the entire model's weights must be run through the algorithm many times per prompt. This is also why multiplexing many prompts at the same time is relatively easy and effective, as many matrix multiplications can happen in the time it takes to do a single fetch from memory.

yencabulator · 2025-07-24T15:53:00 1753372380

> This is because the entire model's weights must be run through the algorithm many times per prompt.

And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.

osti · 2025-07-23T01:52:54 1753235574

Yes, but with a 400B parameter model, at fp16 it's 800GB right? So with 800GB/s memory bandwidth, you'd still only be able to bring them in once per second.

Edit: actually forgot the MoE part, so that makes sense.

timschmidt · 2025-07-23T02:05:47 1753236347

Approximately, yes. For MoE models, there is less required bandwidth, as you're generally only processing the weights from one or two experts at a time. Though which experts can change from token to token, so it's best if all fit in RAM. The sort of machines hyperscalers are using to run these things have essentially 8x APUs each with about that much bandwidth, connected to other similar boxes via infiniband or 800gbps ethernet. Since it's relatively straightforward to split up the matrix math for parallel computation, segmenting the memory in this way allows for near linear increases in memory bandwidth and inference performance. And is effectively the same thing you're doing when adding GPUs.

yencabulator · 2025-07-24T15:50:04 1753372204

Out of curiosity I've repeatedly compared the tokens/sec of various open weight models and consistently come up with this: tokens/sec/USD is near constant.

If a $4,000 Mac does something at X tok/s, a $400 AMD PC on pure CPU does it at 0.1*X tok/s.

Assuming good choices for how that money is spent. You can always waste more money. As others have said, it's all about memory bandwidth. AMD's "AI Max+ 395" is gonna make this interesting.

And of course you can always just not have enough RAM to even run the model. This tends to happen with consumer discrete GPUs not having that much VRAM, they were built for gaming.

jychang · 2025-07-23T10:39:31 1753267171

ChatGPT is wrong.

Here's Deepseek R1 running off of RAM at 8tok/sec: https://www.youtube.com/watch?v=wKZHoGlllu4

kentonv · 2025-07-23T00:41:19 1753231279

Ugh, why is Apple the only one shipping consumer GPUs with tons of RAM?

I would totally buy a device like this for $10k if it were designed to run Linux.

jauntywundrkind · 2025-07-23T01:07:27 1753232847

Intel already has a great value GPU. Everyone wants them to disrupt the game, destroy the product niches. It's general purpose compute performance is quite ass alas but maybe that doesn't matter for AI?

I'm not sure if there are higher capacity gddr6 & 7's rams to buy. I semi doubt you can add more without more channels, to some degree, but also, AMD just shipped R9700 based on rx9070 but with double the ram. But something like Strix Halo, an API with more lpddr channels could work. Word is that Strix Halo's 2027 successor Medusa Halo will go to 6 channels and it's hard to see a significant advantage without that win; the processing is already throughput constrained-ish and a leap on memory bandwidth will definitely be required. Dual channel 128b isn't enough!

There's also MRDIMMs standard, which multiplexes multiple chips. That promises a doubling of both capacity and throughout.

Apple's definitely done two brilliant costly things, by putting very wide (but not really fast) memory on package (Intel had dabbled in doing similar with regular width ram in consumer space a while ago with Lakefield). And then by tiling multiple cores together, making it so that if they had four perfect chips next to each other they could ship it as one. Incredibly brilliant maneuver to get fantastic yields, and to scale very big.

sbrother · 2025-07-23T02:49:50 1753238990

You can buy a RTX 6000 Pro Blackwell for $8000-ish which has 96GB VRAM and is much faster than the Apple integrated GPU.

thenaturalist · 2025-07-23T05:52:06 1753249926

In depth comparison of an RTX vs. M3 Pro with 96 GB VRAM: https://www.youtube.com/watch?v=wzPMdp9Qz6Q

kentonv · 2025-07-23T14:24:49 1753280689

It's not faster at running Qwen3-Coder, because Qwen3-Coder does not fit in 96GB, so can't run at all. My goal here is to run Qwen3-Coder (or similarly large models).

Sure you can build a cluster of RTX 6000s but then you start having to buy high-end motherboards and network cards to achieve the bandwidth necessary for it to go fast. Also it's obscenely expensive.

sagarm · 2025-07-23T01:08:42 1753232922

You can get 128GB @ ~500GB/s now for ~$2k: https://a.co/d/bjoreRm

It has 8 channels of DDR5-8000.

ac29 · 2025-07-23T03:17:18 1753240638

AMD says "256-bit LPDDR5x"

It might be technically correct to call it 8 channels of LPDDR5 but 256-bits would only be 4 channels of DDR5.

p_l · 2025-07-23T05:27:54 1753248474

DDR5 uses 32bit channels as well. A DDR5 DIMM holds two channels accessed separately.

sagarm · 2025-08-02T02:36:21 1754102181

Thank you for the correction: I didn't realize each DDR5 transaction was only 32 bits.

kentonv · 2025-07-23T01:20:31 1753233631

Per above, you need 272GB to run Qwen3-Coder (at 4 bit quantization).

Avlin67 · 2025-07-23T02:48:15 1753238895

wrong it is approx half bandwith

gaspoweredcat · 2025-07-31T07:16:55 1753946215

i mean sure its not quite 512gb levels but you can get 128gb on a ryzen AI max chipset which has unified memory like apple, theyre also pretty reasonably priced, i saw an AI max 370 with 96gb on amazon earlier for a shade over £1000, guess you could boost that with an eGPU to gain a bit extra but 64gb would likely be the max you could add so still not quite enough to run full qwen3 coder at a decent quant but not far off, hopefully the next gen will offer more ram or another model comes out that can beat Q3 with fewer params

ashvardanian · 2025-07-24T10:39:22 1753353562

That's very informative, thanks! So a DGX H200 should be able to run it at 16-bit precision. If I recall correctly, the current hourly rate should be around $25. Not sure what the throughput is, though.

ilaksh · 2025-07-22T22:28:09 1753223289

To run the real version with the bench arks they give, it would be a nonquantized non distilled version. So I am guessing that is a cluster of 8 H200s if you want to be more or less up to date. They have B200s now which are much faster but also much more expensive. $300,000+

You will see people making quantized distilled versions but they never give benchmark results.

danielhanchen · 2025-07-22T22:31:49 1753223509

Oh you can run the Q8_0 / Q8_K_XL which is nearly equivalent to FP8 (maybe off by 0.01% or less) -> you will need 500GB of VRAM + RAM + Disk space. Via MoE layer offloading, it should function ok

summarity · 2025-07-22T22:47:23 1753224443

This should work well for MLX Distributed. The low activation MoE is great for multi node inference.

ilaksh · 2025-07-23T00:09:11 1753229351

1. What hardware for that. 2. Can you do a benchmark?

sourcecodeplz · 2025-07-22T22:22:54 1753222974

With RAM you would need at least 500gb to load it but some 100-200gb more for context too. Pair it with a 24gb GPU and the speed will be 10t/s, at least, I estimate.

danielhanchen · 2025-07-22T22:24:51 1753223091

Oh yes for the FP8, you will need 500GB ish. 4bit around 250GB - offloading MoE experts / layers to RAM will definitely help - as you mentioned a 24GB card should be enough!

vFunct · 2025-07-22T23:29:17 1753226957

Do we know if the full model is FP8 or FP16/BF16? The hugging face page says BF16: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

So likely it needs 2x the memory.

danielhanchen · 2025-07-22T23:47:20 1753228040

I think it's BF16 trained then quantized to FP8, but unsure fully - I was also trying to find out if they used FP8 for training natively!

jychang · 2025-07-22T23:55:16 1753228516

Qwen uses 16bit, Kimi and Deepseek uses FP8.

danielhanchen · 2025-07-23T03:07:51 1753240071

Oh ok cool thanks!

chisleu · 2025-07-23T15:38:35 1753285115

A Mac Studio 512GB can run it in 4bit quantization. I'm excited to see unsloth dynamic quants for this today.

827a · 2025-07-23T05:11:52 1753247512

The initial set of prices on OpenRouter look pretty similar to Claude Sonnet 4, sadly.

btian · 2025-07-22T22:31:00 1753223460

Do need to be super fancy. Just RTX Pro 6000 and 256GB of RAM.

chisleu · 2025-07-23T13:05:01 1753275901

A mac studio can run it at 4bit. Maybe at 6 bit.

Avlin67 · 2025-07-23T02:46:18 1753238778

xeon 6980P which now costs 6K€ instead of 17K€