Are those outputs actually from the 671B model? The 671B model needs 8xH200 GPUs...

sylware · on Feb 4, 2025

Nope, you can run the 671B on 100% CPU and storage. It is going to be longer to get tokens out of it, but it will work.

Heard there are some optimizations for CPU inference on storage, then it should be somewhat a tad "less slow".

Time to split that RAM among your CPU cores and mmap blocks of weights for inference from storage.

Anaphylaxis · on Feb 5, 2025

Sure but he explicitly stated, 'GPU Servers', making it likely he didn't use the CPU for inferencing, validating the question about what GPU setup did they use