Less expensive how? The reason GPUs are used is because they are more efficient. You CAN run matmul on CPUs for sure, but it's going to be much slower and take a ton more electricity. So to claim it's "less expensive" is weird.
In situations where you have space CPU power but not spare GPU power because your GPU(s) & VRAM are allocated to be busy on other tasks, you might prefer to use what you have rather than needing to upgrade that will cost (even if that means the task will run more slowly).
If you are wanting to run this on a server to pipe the generated speech to a remote user (live, or generating it to send at some other appropriate moment) and your server resources don't have GPUs, then you either have to change your infrastructure, use CPU, or not bother.
Renting GPU access on cloud systems can be more expensive than CPU, especially if you only need GPU processing for specific occasional run tasks. Spinning up a VM to server a request then pulling it down is rarely as quick as cloud providers like to suggest in advertising, so you end up keeping things alive longer than absolutely needed meaning spot-pricing rates quoted are lower than you end up paying.
This is far too simplistic, you can't discuss perf per watt unless you're talking about a job running at any decent level of utilisation. Numbers like that only matter for larger scale high utilisation services, meanwhile Intel boxes mastered the art of power efficient idle modes decades ago while almost any contemporary GPU still isn't even remotely close, and you can pick up 32 core boxes like that for pennies on the dollar.
Even if utilisation weren't a metric, "efficient" can be interpreted in so many ways as to be pointless to try and apply in the general case. I consider any model I can foist into a Lambda function "efficient" because of secondary concerns you simply cannot meaningfully address with GPU hardware at present (elasticity and manageability for example). That it burns more energy per unit output is almost meaningless to consider for any kind of workload where Lambda would be applicable.
It's the same for any edge-deployed software where "does it run on CPU?" translates to "does the general purpose user have a snowball's chance in hell of running it?", having to depend on 4GB of CUDA libraries to run a utility fundamentally changes the nature and applicability of any piece of software
A few years ago we had smaller cuts of Whisper running at something like 0.5x realtime on CPU, people struggled along anyway. Now we have Nvidia's speech model family comfortably exceeding 2x real time on older processors with far improved word error rate. Which would you prefer to deploy to an edge device? Which improves the total number of addressable users? Turns out we never needed GPUs for this problem in in the first place, the model architecture mattered all along, as did the question, "does it run on CPU?".
It's not even clear cut when discussing raw achievable performance. With a CPU-friendly speech model living in a Lambda, no GPU configuration will come close to the achievable peak throughput for the same level of investment. Got a year-long audio recording to process once a year? Slice it up and Lambda will happily chew through it at 500 or 1000x real time
GPUs are a near monopoly. There are at least handful of big players in the CPU space. Competition alone makes the latter space a lot cheaper.
Also, for inference (and not training) there are other ways to efficiently do matmuls besides the GPU. You might want to look up Apple's undocumented AMX CPU ISA, and also this thing that vendors call the "Neural Engine" in their marketing (capabilities and the term's specific meaning varies broadly from vendor to vendor).
For small 1-3B parameter transformers like TADA, both these options are much more energy efficient, compared to GPU inference.
If tomorrow Claude pricing changes and they start charging real API costs like 2000+ USD, and there is another service: "NotReallyClaude" that is a bit less good but 200 USD, then what would you do ?
The massive DC overbuild matches demand, prices normalise somewhat in 3-5 years.
The massive DC overbuild does not match demand, prices tank in 3-5 years.
Third possibility: some approach like Taalas renders the current storyline meaningless. Would put 3 in 10 odds of this happening but I'd looove to see it.
Fourth: entire planet gets profoundly sick of emdashes, we all move back into caves and live in eternal gratitude of the moment humanity woke up to how little all of this really matters.
set Anthropic base URL in CC to your proxy server and map each model to your preferred models (I keep opus↔opus but technically you can do opus↔gpt-5.3, etc.). then check the incoming messages for the string that triggers compaction (it's a system prompt btw) and modify that message before it hits the LLM server.
I do like the idea of an aftermarket of ancient LLM chips that still have tons of useful life on text processing tasks etc. They don't talk about their architecture much, I wonder how well power can scale down. 200W for such a small model is not something I see happening in a laptop any time soon. Pretty hilarious implications for moat-building of the big providers too.
Yea I mean this is the first publishable draft of a startup cooking on this.
I'm confident there are at least 1-2 OOMs of improvement to come here in terms of the (intelligence : wattage) ratio.
I really thought we were going to need to see a couple of dramatic OOM-improvement changes to the model composition / software layer, in order to get models of Opus 3.7's capability running on our laptops.
This release tells me that eventual breakthrough won't even be strictly necessary, imo.
The way I imagine it in 2-4 years we're going to be hit with a triple glut of better architecture, massive oversupply of hardware and potentially one or two hardware efforts like this really taking off. It's pretty crazy we're already 4 years in and outside of very niche / low availability solutions, it's still either GPU or bust
That's interesting! How do you see "oversupply of hardware" playing out?
Is it because we stop doing ~2024-style, large-scale training (marginal returns aren't worth it)? Or because supply way outpaces the training+inference demand?
AFAIU if the trend lines /S-curves keep chugging along as they are, we won't hit hardware oversupply for a long, long time without some sort of AI training winter.
One of these things, however old, coupled with robust tool calling is a chip that could remain useful for decades. Baking in incremental updates of world knowledge isn't all that useful. It's kinda horrifying if you think about it, this chip among other things contains knowledge of Donald Trump encoded in silicon. I think this is a way cooler legacy for Melania than the movie haha.
reply