Can't they make a GPU instead? Please save us!

AlotOfReading · 2025-06-06T15:34:09 1749224049

A GPU is a very different beast that relies much more heavily on having a gigantic team of software developers supporting it. A CPU is (comparatively) straightforward. You fab and validate a world class design, make sure compiler support is good enough, upstream some drivers and kernel support, and make sure the standard documentation/debugging/optimization tools are all functional. This is incredibly difficult, but achievable because these are all standardized and well understood interface points.

With GPUs you have all these challenges while also building a massively complicated set of custom compilers and interfaces on the software side, while at the same time trying to keep broken user software written against some other company's interface not only functional, but performant.

esafak · 2025-06-06T15:41:02 1749224462

It's not the GPU I want per se but its ability to run ML tasks. If you can do that with your CPU fine!

AlotOfReading · 2025-06-06T16:30:35 1749227435

Echoing the other comment, this isn't easier. I was on a team that did it. The ML team was overheard by media complaining that we were preventing them from achieving their goals because we had taken 2 years to build something that didn't beat the latest hardware from Nvidia, let alone keep pace with how fast their demands had grown.

mdaniel · 2025-06-06T20:25:53 1749241553

I don't need it to beat the latest from nvidia, just be affordable, available, and have user servicable ram slots so "48gb" isn't such an ooo-ahh amount of memory

I couldn't find any buy it now links but 512gb sticks don't seem to be fantasies, either: https://news.samsung.com/global/samsung-develops-industrys-f...

mdaniel · 2025-06-07T01:50:40 1749261040

Now that I'm back at my computer, can search harder and it seems one can legitimately buy 256GB sticks at approximately USD$2000 a pop <https://www.ebay.com/itm/267177294719> or 128GB for $790 <https://www.ebay.com/itm/205354052535>

Since it seems A100s top out at 80GB, and appear to start at $10,000 I'd say it's a steal

Yes, I'm acutely aware that bandwidth matters, but my mental model is the rest of that sentence is "up to a point," since those "self hosted LLM" threads are filled to the brim with people measuring tokens-per-minute or even running inference on CPU

I'm not hardware adjacent enough to try such a stunt, but there was also recently a submission of a BSD-3-Clause implementation of Google's TPU <https://news.ycombinator.com/item?id=44111452>

AlotOfReading · 2025-06-07T20:37:56 1749328676

What would you want from on-gpu slots that you don't get from existing mechanisms to give your GPU access to system memory?

mdaniel · 2025-06-08T01:01:14 1749344474

prelude: I realized that I typed out a ton of words but in the end engineering is all about tradeoffs, so, fine if there's a way I can teach some existing GPU, or some existing PCIe TPU, to access system RAM over an existing PCIe slot, that sounds like a fine step forward. I just don't have a lot of experience in that setup to know if only certain video cards allow that or what

Bearing in mind the aforementioned "I'm not a hardware guy," my mental model of any system RAM access for GPUs is:

  1. copy weights from SSD to RAM
  2. trigger GPU with that RAM location
  3. GPU copies weights over PCIe bus to do calculation
  4. GPU copies activations over PCIe bus back to some place in RAM
  5. goto 3

If my understanding is correct, this PCIe (even at 16 lanes) is still shared with everything else on the motherboard that is also using PCIe, to say nothing of the actual protocol handshaking since it's a common bus and thus needs contention management. I would presume doing such a stunt would at bare minimum need to contend with other SSD traffic and the actual graphical part of the GPU's job[1][2]

Contrast this with memory socket(s) on the "GPU's mainboard" where it is, what, 3mm of trace wires away from ripping the data back and forth between its RAM and its processors, only choosing to PCIe the result out to RAM. It can have its own PCIe to speak to other sibling GPGPU setups for doing multi-device inference[3]

I would entertain people saying "but what a waste having 128GB of RAM only usable for GPGPU tasks" but if all these folks are right in claiming that it's the end of software engineering as we know it, I would guess it's not going to be that idle

1: I wish I had actually made a bigger deal out of wanting a GPGPU since for this purpose I don't care at all what DirectX or Vulkan whatever it runs

2: furthermore, if the "just use system RAM" was such a hot idea, I don't think it would be 2025 and we still have graphics cards with only 8GB of RAM on them. I'm not considering the Apple architecture because they already solder RAM and mark it up so much that normal people can't afford a sane system anyway, so I give no shits how awesome their unified architecture is

3: I also should have drew more attention to the inference need, since AIUI things like the TPUs I have on my desk aren't (able to do|good at) training jobs but that's where my expertise grinds to a halt because I have no idea why that is or how to fix it

AlotOfReading · 2025-06-08T04:11:48 1749355908

Oh, it's not a good idea at all from a performance perspective to use system memory because it's slow as heck. The important thing is that you can do it. Some way of allowing the GPU to page in data from system RAM (or even storage) on an as-needed basis has been supported by Nvidia since at least Tesla generation.

There's actually a multitude of different ways now that each have their own performance tradeoffs like direct DMA from the Nvidia card, data copied via CPU, GPU direct storage, and so on. You seem to understand the gist though, so these are mainly implementation details. Sometimes there's weird limitations with one method like limited to Quadro, or only up to a fixed percentage of system memory.

The short answer is that all of them suck to different degrees and you don't want to use them if possible. They're enabled by default for virtually all systems because they significantly simplify CUDA programming. DDR is much less suitable than GDDR for feeding a bandwidth hungry monster like a GPU, PCI introduces high latency and further constructions, and any CPU involvement is a further slowdown. This would also apply to socketed memory on a GPU though: Significantly slower and less bandwidth.

There's also some additional downsides to accessing system RAM that we don't need to get into, like sometimes losing the benefits of caching and getting full cost memory accesses every time.

mdaniel · 2025-06-08T17:08:32 1749402512

That's interesting, thanks for making me aware. I'll try to dig up some reading material, but in some sense this is going the opposite of how I want the world to work because nvidia is already a supply chain bottleneck and so therefore saying "the solution to this supply and demand is more CUDA" doesn't get me where I want to go

> any CPU involvement is a further slowdown. This would also apply to socketed memory on a GPU though: Significantly slower and less bandwidth

I am afraid what I'm about to say doubles down on my inexperience, but: I could have sworn that series of problems is what DMA was designed to solve: peripherals do their own handshaking without requiring the CPU's involvement (aside from the "accounting" bits of marking regions as in-use). And thus if a GPGPU comes already owning its own RAM, it most certainly does not need to ask the CPU to do jack squat to talk to its own RAM because there's no one else who could possibly be using it

I was looking for an example of things that carried their own RAM and found this, which strictly speaking is what I searched for but is mostly just funny so I hope others get a chuckle too: a SCSI ram disk <https://micha.freeshell.org/ramdisk/RAM_disk.jpg>

AlotOfReading · 2025-06-08T17:23:16 1749403396

Sorry if that was confusing. I was trying to communicate a generality about multiple very different means of accessing the memory: the way we currently build GPUs is a local maximum for performance. Changing anything, even putting dedicated memory on sockets, has a dramatic and negative impact on performance. The latest board I've worked on saw the layout team working overtime to place the memory practically on top of the chip and they were upset it couldn't be closer.

Also, other systems have similar technologies, I'm just mentioning Nvidia as an example.

kvemkon · 2025-06-06T22:49:41 1749250181

And now, 4 years later, I still can choose only among micron and hynix for consumer DDR5 DIMM. No samsung or nanya which I could order right now.

While micron (crucial) 64GB DDR5 (SO-)DIMMs are available since few months.

mort96 · 2025-06-06T15:48:44 1749224924

Well that's even more difficult because not only do you need drivers for the widespread graphics libraries Vulkan, OpenGL and Direct3D, but you also need to deal with the GPGPU mess. Most software won't ever support your compute-focused GPU because you won't support CUDA.

Bolwin · 2025-06-06T17:47:37 1749232057

I mean you most certainly can. Pretty much every ml library has cpu support

esafak · 2025-06-06T18:31:50 1749234710

Not theoretically, but practically, viably.

Asraelite · 2025-06-06T15:47:58 1749224878

> make sure compiler support is good enough

Do compilers optimize for specific RISC-V CPUs, not just profiles/extensions? Same for drivers and kernel support.

My understanding was that if it's RISC-V compliant, no extra work is needed for existing software to run on it.

Arnavion · 2025-06-06T16:59:26 1749229166

You want to optimize for specific chips because different chips have different capabilities that are not captured by just what extensions they support.

A simple example is that the CPU might support running two specific instructions better if they were adjacent than if they were separated by other instructions ( https://en.wikichip.org/wiki/macro-operation_fusion ). So the optimizer can try to put those instructions next to each other. LLVM has target features for this, like "lui-addi-fusion" for CPUs that will fuse a `lui; addi` sequence into a single immediate load.

A more complex example is keeping track of the CPU's internal state. The optimizer models the state of the CPU's functional units (integer, address generation, etc) so that it has an idea of which units will be in use at what time. If the optimizer has to allocate multiple instructions that will use some combination of those units, it can try to lay them out in an order that will minimize stalling on busy units while leaving other units unused.

That information also tells the optimizer about the latency of each instruction, so when it has a choice between multiple ways to compute the same operation it can choose the one that works better on this CPU.

See also: https://myhsu.xyz/llvm-sched-model-1/ https://myhsu.xyz/llvm-sched-model-1.5/

If you don't do this your code will still run on your CPU. It just won't necessarily be as optimal as it could be.

Bolwin · 2025-06-06T17:46:48 1749232008

Wonder if we could generalize this so you can just give the optimizer a file containing all this info, without needing to explicitly add support for each cpu

frankchn · 2025-06-06T18:39:50 1749235190

These configuration files exist (https://llvm.org/docs/TableGen/, https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...) but it is very complicated because the processors themselves are very complicated.

AlotOfReading · 2025-06-06T16:11:38 1749226298

The major compilers optimize for microarchitecture, yes. Here's the tablegen scheduling definition behind LLVM's -mtune=sifive-670 flag as an example: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

It's not that things won't run, but this is necessary for compilers to generate well optimized code.

speedgoose · 2025-06-06T15:53:06 1749225186

I hope to see dedicated GPU coprocessors disappear sooner rather than later, just like arithmetic coprocessors did.

wtallis · 2025-06-06T16:13:32 1749226412

Arithmetic co-processors didn't disappear so much as they moved onto the main CPU die. There were performance advantages to having the FPU on the CPU, and there were no longer significant cost advantages to having the FPU be separate and optional.

For GPUs today and in the foreseeable future, there are still good reasons for them to remain discrete, in some market segments. Low-power laptops have already moved entirely to integrated GPUs, and entry-level gaming laptops are moving in that direction. Desktops have widely varying GPU needs ranging from the minimal iGPUs that all desktop CPUs now already have, up to GPUs that dwarf the CPU in die and package size and power budget. Servers have needs ranging from one to several GPUs per CPU. There's no one right answer for how much GPU to integrate with the CPU.

otabdeveloper4 · 2025-06-06T18:24:40 1749234280

By "GPU" they probably mean "matrix multiplication coprocessor for AI tasks", not actually a graphics processor.

wtallis · 2025-06-06T19:16:56 1749237416

That doesn't really change anything. The use cases for a GPU in any given market segment don't change depending on whether you call it a GPU.

And for low-power consumer devices like laptops, "matrix multiplication coprocessor for AI tasks" is at least as likely to mean NPU as GPU, and NPUs are always integrated rather than discrete.

otabdeveloper4 · 2025-06-07T15:07:13 1749308833

Yes it does change something.

A GPU needs to run $GAME from $CURRENT_YEAR at 60 fps despite the ten million SLoC of shit code and legacy cruft in $GAME. That's where the huge expense for the GPU manufacturer lies.

Matrix multiplication is a solved probelm and we need to implement it just once in hardware. At some point matrix multiplication will be ubiquitous like floating-point is now.

wtallis · 2025-06-07T23:28:35 1749338915

You're completely ignoring that there are several distinct market segments that want hardware to do AI/ML. Matrix multiplication is not something you can implement in hardware just once.

NVIDIA's biggest weakness right now is that none of their GPUs are appropriate for any system with a lower power budget than a gaming laptop. There's a whole ecosystem of NPUs in phone and laptop SoCs targeting different tradeoffs in size, cost, and power than any of NVIDIA's offerings. These accelerators represent the biggest threat NVIDIA's CUDA monopoly has ever faced. The only response NVIDIA has at the moment is to start working with MediaTek to build laptop chips with NVIDIA GPU IP and start competing against pretty much the entire PC ecosystem.

At the same time, all the various low-power NPU architectures have differing limitations owing to their diverse histories, and approximately none of them currently shipping were designed from the beginning with LLMs in mind. On the timescale of hardware design cycles, AI is still a moving target.

So far, every laptop or phone SoC that has shipped with both an NPU and a GPU has demonstrated that there are some AI workloads where the NPU offers drastically better power efficiency. Putting a small-enough NVIDIA GPU IP block onto a laptop or phone SoC probably won't be able to break that trend.

In the datacenter space, there are also tradeoffs that mean you can't make a one-size-fits-all chip that's optimal for both training and inference.

In the face of all the above complexity, the question of whether a GPU-like architecture retains any actual graphics-specific hardware is a silly question. NVIDIA and AMD have both demonstrated that they can easily delete that stuff from their architectures to get more TFLOPs for general compute workloads using the same amount of silicon.

touisteur · 2025-06-06T19:35:07 1749238507

Wondering how you'd classify Gaudi, tenstorrent-stuff, groq, or lightmatter's photonic thing.

Calling something a GPU tends to make people ask for (good, performant) support for opengl, Vulkan, direct3d... which seem like a huge waste of effort if you want to be an "AI-coprocessor".

wtallis · 2025-06-06T19:57:33 1749239853

> Wondering how you'd classify Gaudi, tenstorrent-stuff, groq, or lightmatter's photonic thing.

Completely irrelevant to consumer hardware, in basically the same way as NVIDIA's Hopper (a data center GPU that doesn't do graphics). They're ML accelerators that for the foreseeable future will mostly remain discrete components and not be integrated onto Xeon/EPYC server CPUs. We've seen a handful of products where a small amount of CPU gets grafted onto a large GPU/accelerator to remove the need for a separate host CPU, but that's definitely not on track to kill off discrete accelerators in the datacenter space.

> Calling something a GPU tends to make people ask for (good, performant) support for opengl, Vulkan, direct3d... which seem like a huge waste of effort if you want to be an "AI-coprocessor".

This is not a problem outside the consumer hardware market.

otabdeveloper4 · 2025-06-07T15:21:10 1749309670

Consumer hardware and AI inference are joined at the hip right now due to perverse historical reasons.

AI inference's big bottleneck right now is RAM and memory bandwidth, not so much compute per se.

If we redid AI inference from scratch without consumer gaming considerations then it probably wouldn't be a coprocessor at all.

saltcured · 2025-06-06T18:34:53 1749234893

Aspects of this has been happening for a long time, as SIMD extensions and as multi-core packaging.

But, there is much more to discrete GPUs than vector instructions or parallel cores. It's very different memory and cache systems with very different synchronization tradeoffs. It's like an embedded computer hanging off your PCI bus, and this computer does not have the same stable architecture as your general purpose CPU running the host OS.

In some ways, the whole modern graphics stack is a sort of integration and commoditization of the supercomputers of decades ago. What used to be special vector machines and clusters full of regular CPUs and RAM has moved into massive chips.

But as other posters said, there is still a lot more abstraction in the graphics/numeric programming models and a lot more compiler and runtime tools to hide the platform. Unless one of these hidden platforms "wins" in the market, it's hard for me to imagine general purpose OS and apps being able to handle the massive differences between particular GPU systems.

It would easily be like prior decades where multicore wasn't taking off because most apps couldn't really use it. Or where special things like the "cell processor" in the playstation required very dedicated development to use effectively. The heterogeneity of system architectures makes it hard for general purpose reuse and hard to "port" software that wasn't written with the platform in mind.

rjsw · 2025-06-06T16:27:15 1749227235

That was one of the ideas behind Larrabee [1]. You can run Mesa on the CPU today using the llvmpipe backend.

https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)