More

g-mork · 2026-03-11T09:20:56 1773220856

CPU compute is infinity times less expensive and much easier to work with in general

boxed · 2026-03-11T09:30:48 1773221448

Less expensive how? The reason GPUs are used is because they are more efficient. You CAN run matmul on CPUs for sure, but it's going to be much slower and take a ton more electricity. So to claim it's "less expensive" is weird.

dspillett · 2026-03-11T10:25:25 1773224725

In situations where you have space CPU power but not spare GPU power because your GPU(s) & VRAM are allocated to be busy on other tasks, you might prefer to use what you have rather than needing to upgrade that will cost (even if that means the task will run more slowly).

If you are wanting to run this on a server to pipe the generated speech to a remote user (live, or generating it to send at some other appropriate moment) and your server resources don't have GPUs, then you either have to change your infrastructure, use CPU, or not bother.

Renting GPU access on cloud systems can be more expensive than CPU, especially if you only need GPU processing for specific occasional run tasks. Spinning up a VM to server a request then pulling it down is rarely as quick as cloud providers like to suggest in advertising, so you end up keeping things alive longer than absolutely needed meaning spot-pricing rates quoted are lower than you end up paying.

g-mork · 2026-03-11T10:21:37 1773224497

This is far too simplistic, you can't discuss perf per watt unless you're talking about a job running at any decent level of utilisation. Numbers like that only matter for larger scale high utilisation services, meanwhile Intel boxes mastered the art of power efficient idle modes decades ago while almost any contemporary GPU still isn't even remotely close, and you can pick up 32 core boxes like that for pennies on the dollar.

Even if utilisation weren't a metric, "efficient" can be interpreted in so many ways as to be pointless to try and apply in the general case. I consider any model I can foist into a Lambda function "efficient" because of secondary concerns you simply cannot meaningfully address with GPU hardware at present (elasticity and manageability for example). That it burns more energy per unit output is almost meaningless to consider for any kind of workload where Lambda would be applicable.

It's the same for any edge-deployed software where "does it run on CPU?" translates to "does the general purpose user have a snowball's chance in hell of running it?", having to depend on 4GB of CUDA libraries to run a utility fundamentally changes the nature and applicability of any piece of software

A few years ago we had smaller cuts of Whisper running at something like 0.5x realtime on CPU, people struggled along anyway. Now we have Nvidia's speech model family comfortably exceeding 2x real time on older processors with far improved word error rate. Which would you prefer to deploy to an edge device? Which improves the total number of addressable users? Turns out we never needed GPUs for this problem in in the first place, the model architecture mattered all along, as did the question, "does it run on CPU?".

It's not even clear cut when discussing raw achievable performance. With a CPU-friendly speech model living in a Lambda, no GPU configuration will come close to the achievable peak throughput for the same level of investment. Got a year-long audio recording to process once a year? Slice it up and Lambda will happily chew through it at 500 or 1000x real time

woadwarrior01 · 2026-03-11T11:04:32 1773227072

GPUs are a near monopoly. There are at least handful of big players in the CPU space. Competition alone makes the latter space a lot cheaper.

Also, for inference (and not training) there are other ways to efficiently do matmuls besides the GPU. You might want to look up Apple's undocumented AMX CPU ISA, and also this thing that vendors call the "Neural Engine" in their marketing (capabilities and the term's specific meaning varies broadly from vendor to vendor).

For small 1-3B parameter transformers like TADA, both these options are much more energy efficient, compared to GPU inference.

g-mork · 2026-02-27T22:14:18 1772230458

Yep same, I'd sooner starve than cut my Anthropic sub

rvnx · 2026-02-27T22:29:59 1772231399

If tomorrow Claude pricing changes and they start charging real API costs like 2000+ USD, and there is another service: "NotReallyClaude" that is a bit less good but 200 USD, then what would you do ?

bandrami · 2026-02-28T02:53:46 1772247226

Man, they really got good at hitting that dopamine button, huh?

g-mork · 2026-02-26T00:37:52 1772066272

Don't forget Wine ships a faithful notepad.exe reimplementation. It should run just fine on Windows

edit: just checked the version that ships with Steam on Linux, yep, works great in a VM

g-mork · 2026-02-24T23:32:25 1771975945

How does this compare to Parakeet, which runs wonderfully on CPU?

g-mork · 2026-02-23T13:36:36 1771853796

The massive DC overbuild matches demand, prices normalise somewhat in 3-5 years.

The massive DC overbuild does not match demand, prices tank in 3-5 years.

Third possibility: some approach like Taalas renders the current storyline meaningless. Would put 3 in 10 odds of this happening but I'd looove to see it.

Fourth: entire planet gets profoundly sick of emdashes, we all move back into caves and live in eternal gratitude of the moment humanity woke up to how little all of this really matters.

g-mork · 2026-02-23T13:02:13 1771851733

What would that achieve? Here, have 1.5% discount on your subnet purchase

g-mork · 2026-02-21T03:27:48 1771644468

how do you make CC talk via a proxy? I had a few googles for this and got nowhere

behnamoh · 2026-02-21T03:35:19 1771644919

set Anthropic base URL in CC to your proxy server and map each model to your preferred models (I keep opus↔opus but technically you can do opus↔gpt-5.3, etc.). then check the incoming messages for the string that triggers compaction (it's a system prompt btw) and modify that message before it hits the LLM server.

tyre · 2026-02-21T03:35:05 1771644905

have you tried asking CC to build something that does it? I'm guessing it could.

g-mork · 2026-02-20T23:22:01 1771629721

I do like the idea of an aftermarket of ancient LLM chips that still have tons of useful life on text processing tasks etc. They don't talk about their architecture much, I wonder how well power can scale down. 200W for such a small model is not something I see happening in a laptop any time soon. Pretty hilarious implications for moat-building of the big providers too.

gen220 · 2026-02-21T02:38:31 1771641511

Yea I mean this is the first publishable draft of a startup cooking on this.

I'm confident there are at least 1-2 OOMs of improvement to come here in terms of the (intelligence : wattage) ratio.

I really thought we were going to need to see a couple of dramatic OOM-improvement changes to the model composition / software layer, in order to get models of Opus 3.7's capability running on our laptops.

This release tells me that eventual breakthrough won't even be strictly necessary, imo.

g-mork · 2026-02-21T03:30:54 1771644654

The way I imagine it in 2-4 years we're going to be hit with a triple glut of better architecture, massive oversupply of hardware and potentially one or two hardware efforts like this really taking off. It's pretty crazy we're already 4 years in and outside of very niche / low availability solutions, it's still either GPU or bust

gen220 · 2026-02-23T16:04:40 1771862680

That's interesting! How do you see "oversupply of hardware" playing out?

Is it because we stop doing ~2024-style, large-scale training (marginal returns aren't worth it)? Or because supply way outpaces the training+inference demand?

AFAIU if the trend lines /S-curves keep chugging along as they are, we won't hit hardware oversupply for a long, long time without some sort of AI training winter.

g-mork · 2026-02-20T18:02:22 1771610542

One of these things, however old, coupled with robust tool calling is a chip that could remain useful for decades. Baking in incremental updates of world knowledge isn't all that useful. It's kinda horrifying if you think about it, this chip among other things contains knowledge of Donald Trump encoded in silicon. I think this is a way cooler legacy for Melania than the movie haha.

g-mork · 2026-02-20T17:36:41 1771609001

The irony with Zitron is that summarising his astoundingly verbose anti-AI articles is one of the most consistently productive uses I've had for AI