Hacker Newsnew | past | comments | ask | show | jobs | submit | dist-epoch's commentslogin

Because that's where the compute happens, in those "verbose" tokens. A transformer has a size, it can only do so many math operations in one pass. If your problem is hard, you need more passes.

Asking it to be shorter is like doing fewer iteration of numerical integral solving algorithm.


Yeah, but all the models live in chatGPT have reasoning (iirc) - they could use reasoning tokens to do the 'compute', and still show the user a succinct response that directly answers the query

OpenClaw Peter is using codex to analyze/de-duplicate PRs, extract good ideas from them and then re-implement them.

> I spun up 50 codex in parallel, let them analyze the PR and generate a JSON report with various signals, comparing with vision, intent (much higher signal than any of the text), risk and various other signals. Then I can ingest all reports into one session and run AI queries/de-dupe/auto-close/merge as needed on it.

Some people bitch, others are real engineers solving novel problems.

https://x.com/steipete/status/2025591780595429385?s=20


I know someone who started making a game by building his own engine. 5 years later he had made half an engine and zero games made on it.

Most of the people I know that are into herding AI spend most of their time doing that, but I can't say I've seen them accomplish much more than other colleagues, even the ones just using built-in AI or copy pasting code from an AI chat.


> Some people bitch, others are real engineers solving novel problems

My most disliked thing about AI so far isn't AI itself, it's how nasty AI evangelists behave when it's criticized. You don't have to attack and/or insult people, you could have just left out that last bit.


You are confusing trolling with not handling AI being criticized.

It's funny seeing programmers mind shut down when faced with an easy to fix problem - too many PRs, just because they hate AI.


That's an incredibly dismissive attitude to a real problem.

> Same kit I bought 8 months ago for 90 EUR is now 400+. That's not normal market dynamics.

That's exactly normal market dynamics during acute shortage. Remember 2020 when filtering face masks went up in price 10-100x?


You could be charitable and say the bus is narrow because it has to travel a long distance and this makes it hard to have a lot of traces.

It's not. It's narrow even between the CPU and RAM. That's just the way x86 is designed. Nvidia and AMD by contrast have the luxury of being able to rearchitect their single-board computers each generation as long as they honor the PCIe interface.

It is also true that having a 384-bit memory bus shared with the video card would necessitate a redesigned PCIe slot as well as an outrageous number of traces on the motherboard, though.


Traditionally, the width of the GPU memory interfaces was many times greater than that of CPUs.

However the maximum width in consumer GPUs, of up to 1024-bit, has been reached many years ago.

Since then the width of the memory interfaces in consumer GPUs has been decreasing continuously, and this decrease has been only partially compensated by higher memory clock frequencies. This reduction has been driven by NVIDIA, in order to increase their profit margins by reducing the memory cost.

Nowadays, most GPU owners must be content with a memory interface no better than 192-bit, like in RTX 5070, which is only 50% wider than for a desktop CPU and much narrower than for a workstation or server CPU.

The reason why using the main memory in GPUs is slow has nothing to do with the width of the CPU memory interface, but it is caused by the fact that the GPU accesses the main memory through PCIe, so it is limited by the throughput of at most 16 PCIe lanes, which is much lower than that of either the GPU memory interface or the CPU memory interface.


ThreadRipper has 8 memory channels versus 2 for a desktop AMD CPU. It's not an x86 limitation.

"x86" as in the computer architecture, not the ISA. Why do you think they put extra channels instead of just having a single 512-bit bus?

The memory interface of CPUs is made wider by adding more channels because there are no memory modules with a 512-bit interface. Thus you must add multiples of the module width to the CPU memory interface.

This has nothing to do with x86, but it is determined by the JEDEC standards for DRAM packages and DRAM modules. The ARM server CPUs use the same number of memory channels, because they must use the same memory modules.

A standard DDR5 memory module has a width of the memory interface that is of 64-bit or 72-bit or 80-bit, depending on how many extra bits may be available for ECC. The interface of a module is partitioned in 2 channels, to allow concurrent accesses at different memory addresses. Despite the fact that the current memory channels have a width of 32-bit/36-bit/40-bit, few people are aware of this, so by "memory channel" most people mean 64 bits (or 72-bit for ECC), because that was the width of the memory channel in older memory generations.

Not counting ECC bits, most desktop and laptop CPUs have an 128-bit memory interface, some cheaper server and workstation CPUs have a 256-bit memory interface, many server CPUs and some workstation CPUs have a 512-bit memory interface, while the state-of-the-art server CPUs have a 768-bit memory interface.

For comparison, RTX 5070 has a 192-bit memory interface, RTX 5080 has a 256-bit memory interface and RTX 5090 has a 512-bit memory interface. However, the GDDR7 memory has a transfer rate that is 4 to 5 times higher than DDR5, which makes the GPU interfaces faster, despite their similar or even lower widths.


The best camera is the camera you have on you.

Smartphones have terrible camera ergonomics, yet they killed the compact dedicated camera.


You can't have it both ways. As a library author choose MIT to encourage commercial usage because companies are afraid of GPL, but then complain that companies are actually using it in a MIT license way without contributing back.

I can find it off putting regardless. Especially since I’m not the person who released it under MIT license.

License it GPL, and it will be fed to a model as training data to recreate it copyright free anyways.

Training falls outside of copyright concerns because of fair use, so proprietary or free is orthogonal. This is how the world is currently trending.

But you were lucky. You were in the right places at the right time, just didn't realize it.

This is lack of vision, not lack of luck.


That's not the real thinking, it's a super summarized view of it.

Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)

My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.


I still remember gemini 1.5 ultra and gpt 4.5 as extremely strong at some areas that no benchmark capture. It was probably not economical to use them at 20 usd subscription, but they felt differently and smarter at some ways. The benchmarks seems to be missing something, because flash 3 was very close on some benchmarks to 3 pro, but much, much dumber.

You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.


What a wild world, sending 50 emails costs money :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: