Hacker Newsnew | past | comments | ask | show | jobs | submit | boole1854's commentslogin

https://openai.com/index/hello-gpt-4o/

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.


“You are absolutely right, and I apologize for the confusion.”


Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.

- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.

- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.

- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.

I did two concrete head-to-head comparisons where both models had the same code and the same prompt.

First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.

Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.

Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.

Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.


Try checking your temp for any tool using Gemini.

"For Gemini 3, we strongly recommend keeping the temperature parameter at its default value of 1.0.While previous models often benefited from tuning temperature to control creativity versus determinism, Gemini 3's reasoning capabilities are optimized for the default setting. Changing the temperature (setting it below 1.0) may lead to unexpected behavior, such as looping or degraded performance, particularly in complex mathematical or reasoning tasks."

https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high


Anthropic doesnt even allow temperature changes when you turn thinking on.


This tells you all you need to know about benchmarks:

Didn't Google proudly tout their Gemini 3 as beating everything under the sun in every benchmark imaginable by a margin?


> - As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.

This has been an annoying Gemini feature since the beginning. I ask it to evaluate, check or analyse something, tab away and come back to it rewriting half the fucking codebase.

Please Google, use a percentage of your billions and add a "plan" mode to Gemini-cli - just like Claude has and I'd use your stuff a lot more often. The 1M context is excellent for large scale reviews, but its tendency to start writing code on its own is a pain in my ass.


Yea, I can't get gemini to stop and think, even if I tell it to not write code it will rewrite the code block each time


Ok, so this post is a joke of some kind (there was no 1989 version of Blue Prince).

But it raises an interesting question: would it have been possible to implement that upside down floppy disk puzzle in a game?

1. Was it even possible to insert floppy disks upside down? I lived through the floppy disk era in my childhood, but I have to admit I can't remember if the drives would even let you do this.

2. If the answer to #1 is yes, would there be any way of programmatically detecting the floppy-disk-was-inserted-the-wrong-way state?


There are in fact two sided floppies! IIRC they behave a lot like the two sides of a cassette tape, the floppy reader only reads from one side at a time.

A fun fact in that regard: the game Karateka (an actual game for the Apple II) had an easter egg, where the team realized that their game entirely fit in the capacity of one side of a floppy, so they put a second copy of the game on the other side, but set up so that it would render upside-down.

I'd not be surprised if the inclusion of that detail in this post was directly inspired by Karateka.


The Apple II had a non-linear layout of video memory, so programmer Jordan Mechner used a layer of indirection where he had an array of pointers to rows of screen memory.

They realized that inverting the screen was as simple as inverting the row-pointer array. Then they managed to convince Broderbund to ship a double-sided floppy with that change in the software.


The Apple II used single-sided floppy disks so it was possible to insert a double sided disk upside-down to story data on the other side.

If the other side contains other data it should be easy to detect the disk was inserted upside down just by reading it.


Yes. Lots of software was double sided. With a small hand punch would could make non-double sided disks double sided (even if manufacturer said no no).

https://www.webcommand.net/wp-content/uploads/2019/07/commod...


That’s true of 5.25” floppies. The newer, higher capacity, 3.5” floppies had both sides accessible without physically flipping, so all drives only supported inserting the disks in one orientation.

But the Apple II mainly used 5.25” floppies. So I’m not correcting you, just adding more context.


5.25 floppies also taste better.


Semi-related; One of the Zelda DS games required you to close the DS (so the top and bottom screen met), which moved a mark from the top to bottom screen. Was infuriating for me, only figured it out after closing the DS in frustration. Not really something you can do with modern portables, but clever in retrospect.


1. No. For an obvious and good reason.


We're talking about 5,25 inch floppies. It was easy to insert those in any way imaginable including several wrong ones ;)


Yep, my memory was bad.

In my defense, so were 5.25" floppies. Literally the worst.


1. (edited) Yes, but you couldn't run it.

1.a. ...unless you altered the shape of the floppy.

https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/5....


You absolutely could put in disks upside down.


I recall doing this on my BBC micro with 5.25" disks. In fact, some disks were deliberately designed for this, and had a 'notch' (which you would cover with some tape to make read-only) on both the left and right, so you could set the read-only state for each side individually.

The version of Elite that I played had the standard version on one side, and a version for the "BBC Master" (which had an extra 64KiB RAM) which had more colours than the standard version, on the other.


If anyone knows of a steelman version of the "AGI is not possible" argument, I would be curious to read it. I also have trouble understanding what goes into that point of view.


If you genuinely want the strongest statement of it, read The Emperor's New Mind followed by Shadows of the Mind, both by Roger Penrose.

These books often get shallowly dismissed in terms that imply he made some elementary error in his reasoning, but that's not the case. The dispute is more about the assumptions on which his argument rests, which go beyond mathematical axioms and include statements about the nature of human perception of mathematical truth. That makes it a philosophical debate more than a mathematical one.

Personally, I strongly agree with the non-mathematical assumptions he makes, and am therefore persuaded by his argument. It leads to a very different way of thinking about many aspects of maths, physics and computing than the one I acquired by default from my schooling. It's a perspective that I've become increasingly convinced by over the 30+ years since I first read his books, and one that I think acquires greater urgency as computing becomes an ever larger part of our lives.


Can you critique my understanding of his argument?

1. Any formal mathematical system (including computers) have true statements that cannot be proven within that system.

2. Humans can see the truth of some such unprovable statements.

Which is basically Gödel's Incompleteness Theorem. https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...

Maybe a more ELI5

1. Computers follow set rules

2. Humans can create rules outside the system of rules in which they follow

Is number 2 an accurate portrayal? It seems rather suspicious. It seems more likely that we just havent been able to fully express the rules under which humans operate.


Notably, those true statements can be proven in a higher level mathematical system. So why wouldn’t we say that humans are likewise operating in a certain system ourselves and likewise we have true statements that we can’t prove. We just wouldn’t be aware of them.


>likewise we have true statements that we can’t prove

Yes, and "can't" as in it is absolutely impossible. Not that we simple haven't been able to due to information or tech constraints.

Which is an interesting implication. That there are (or may be) things that are true which cannot be proved. I guess it kinda defies an instinct I have that at least in theory, everything that is true is provable.


That's too brief to capture it, and I'm not going to try to summarise(*). The books are well worth a read regardless of whether you agree with Penrose. (The Emperor's New Mind is a lovely, wide-ranging book on many topics, but Shadows of the Mind is only worth it if you want to go into extreme detail on the AI argument and its counterarguments.)

* I will mention though that "some" should be "all" in 2, but that doesn't make it a correct statement of the argument.


Is it too brief to capture it? Here is a one sentence statement I found from one of his slides:

>Turing’s version of Gödel’s theorem tells us that, for any set of mechanical theorem-proving rules R, we can construct a mathematical statement G(R) which, if we believe in the validity of R, we must accept as true; yet G(R) cannot be proved using R alone.

I have no doubt the books are good but the original comment asked about steelmanning the claim that AGI is impossible. It would be useful to share the argument that you are referencing so that we can talk about it.


That's a summary of Godel's theorem, which nobody disputes, not of Penrose's argument that it implies computers cannot emulate human intelligence.

I'm really not trying to evade further discussion. I just don't think I can sum that argument up. It starts with basically "we can perceive the truth not only of any particular Godel statement, but of all Godel statements, in the abstract, so we can't be algorithms because an algorithm can't do that" but it doesn't stop there. The obvious immediate response is to say "what if we don't really perceive its truth but just fool ourselves into thinking we do?" or "what if we do perceive it but we pay for it by also wrongly perceiving many mathematical falsehoods to be true?". Penrose explored these in detail in the original book and then wrote an entire second book devoted solely to discussing every such objection he was aware of. That is the meat of Penrose' argument and it's mostly about how humans perceive mathematical truth, argued from the point of view of a mathematician. I don't even know where to start with summarising it.

For my part, with a vastly smaller mind than his, I think the counterarguments are valid, as are his counter-counterarguments, and the whole thing isn't properly decided and probably won't be for a very long time, if ever. The intellectually neutral position is to accept it as undecided. To "pick a side" as I have done is on some level a leap of faith. That's as true of those taking the view that the human mind is fundamentally algorithmic as it is of me. I don't dispute that their position is internally consistent and could turn out to be correct, but I do find it annoying when they try to say that my view isn't internally consistent and can never be correct. At that point they are denying the leap of faith they are making, and from my point of view their leap of faith is preventing them seeing a beautiful, consistent and human-centric interpretation of our relationship to computers.

I am aware that despite being solidly atheist, this belief (and I acknowledge it as such) of mine puts me in a similar position to those arguing in favour of the supernatural, and I don't really mind the comparison. To be clear, neither Penrose nor I am arguing that anything is beyond nature, rather that nature is beyond computers, but there are analogies and I probably have more sympathy with religious thinkers (while rejecting almost all of their concrete assertions about how the universe works) than most atheists. In short, I do think there is a purely unique and inherently uncopyable aspect to every human mind that is not of the same discrete, finite, perfectly cloneable nature as digital information. You could call it a soul, but I don't think it has anything to do with any supernatural entity, I don't think it's immortal (anything but), I don't think it is separate from the body or in any sense "non-physical", and I think the question of where it "goes to" when we die is meaningless.

I realise I've gone well beyond Penrose' argument and rambled about my own beliefs, apologies for that. As I say, I struggle to summarise this stuff.


Thank you for taking the time to clarify. Lots to chew on here.


Gonna grab those, thanks for the recommendation.

If you are interested in the opposite point of view, I can really recommend "Vehicles: Experiments in Synthetic Psychology" by V. Braitenberg.

Basically builds up to "consciousness as emergent property" in small steps.


Thanks, I will have a read of that. The strongest I've seen before on the opposing view to Penrose was Daniel Dennett.


Dennett, Darwins Dangerous Idea, p448

... No wonder Penrose has his doubts about the algorithmic nature of natural selection. If it were, truly, just an algorithmic process at all levels, all its products should be algorithmic as well. So far as I can see, this isn't an inescapable formal contradiction; Penrose could just shrug and propose that the universe contains these basic nuggets of nonalgorithmic power, not themselves created by natural selection in any of its guises, but incorporatable by algorithmic devices as found objects whenever they are encountered (like the oracles on the toadstools). Those would be truly nonreducible skyhooks.

Skyhook is Dennett's term for an appeal to the supernatural.


Braitenberg emphasised the importance of analog circuits though.


To be honest, the core of Penrose’s idea is pretty stupid. That we can understand mathematics despite incompleteness theorem being a thing, therefore our brains use quantum effects allowing us to understand it. Instead of just saying, you know, we use a heuristic instead and just guess that it’s true. I’m pretty sure a classical system can do that.


I'm sure if you email him explaining how stupid he is he'll send you his Nobel prize.

Less flippantly, Penrose has always been extremely clear about which things he's sure of, such as that human intelligence involves processes that algorithms cannot emulate, and which things he puts forward as speculative ideas that might help answer the questions he has raised. His ideas about quantum mechanical processes in the brain are very much on the speculative side, and after a career like his I think he has more than earned the right to explore those speculations.

It sounds like you probably would disagree with his assumptions about human perception of mathematical truth, and it's perfectly valid to do so. Nothing about your comment suggests you've made any attempt to understand them, though.


I want to ignore the flame fest developing here. But, in case you are interested in hearing a doubter's perspective, I'll try to express one view. I am not an expert on Penrose's ideas, but see this as a common feature in how others try to sell his work.

Starting with "things he's sure of, such as that human intelligence involves processes that algorithms cannot emulate" as a premise makes the whole thing an exercise in Begging the Question when you try to apply it to explain why an AI won't work.


"That human intelligence involves processes that algorithms cannot emulate" is the conclusion of his argument. The premise could be summed up as something like "humans have complete, correct perception of mathematical truth", although there is a lot of discussion of in what sense it is "complete" and "correct" as, of course, he isn't arguing that any mathematician is omniscient or incapable of making a mistake.

Linking those two is really the contribution of the argument. You can reject both or accept both (as I've said elsewhere I don't think it's conclusively decided, though I know which way my preferences lie), but you can't accept the premise and reject the conclusion.


Hmm, I am less than certain this isn't still begging the question, just with different phrasing. I.e. I see how they are "linked" to the point they seem almost tautologically the same rather than a deductive sequence.


You realise that this isn’t even a reply so much as a series of insults dressed up in formal language?

Yes, of course you do.


It wasn't intended as an insult and I apologise if it comes across as such. It's easy to say things on the internet that we wouldn't say in person.

It did come from a place of annoyance, after your middlebrow dismissal of Penrose' argument as "stupid".


And you do it again, you apologise while insulting me. When challenged you refuse to defend the points you brought up, so that you can pretend to be right rather than be proved wrong. Incompleteness theorem is where the idea came from, but you don’t want to discuss that, you just want to drop the name, condescend to people and run away.


Here are the substantive things you've said so far (i.e. the bits that aren't calling things "stupid" and taking umbridge at imagined slights):

1. You think that instead of actually perceiving mathematical truth we use heuristics and "just guess that it's true". This, as I've already said, is a valid viewpoint. You disagree with one of Penrose' assumptions. I don't think you're right but there is certainly no hard proof available that you're not. It's something that (for now, at least) it's possible to agree to disagree on, which is why, as I said, this is a philosophical debate more than a mathematical one.

2. You strongly imply that Penrose simply didn't think of this objection. This is categorically false. He discusses it at great length in both books. (I mentioned such shallow dismissals, assuming some obvious oversight on his part, in my original comment.)

3 (In your latest reply). You think that Godel's incompleteness theorem is "where the idea came from". This is obviously true. Penrose' argument is absolutely based on Godel's theorem.

4. You think that somehow I don't agree with point 3. I have no idea where you got that idea from.

That, as far as I can see, is it. There isn't any substantive point made that I haven't already responded to in my previous replies, and I think it's now rather too late to add any and expect any sort of response.

As for communication style, you seem to think that writing in a formal tone, which I find necessary when I want to convey information clearly, is condescending and insulting, whereas dismissing things you disagree with as "stupid" on the flimsiest possible basis (and inferring dishonest motives on the part of the person you're discussing all this with) is, presumably, fine. This is another point on which we will have to agree to disagree.


The dismissal is on point.

The whole category of ideas of "Magic Fairy Dust is required for intelligence, and thus, a computer can never be intelligent" is extremely unsound. It should, by now, just get thrown out into the garbage bin, where it rightfully belongs.


In what way is it unsound?

To be clear, any claim that we have mathematical proof that something beyond algorithms is required is unsound, because the argument is not mathematical. It rests on assumptions about human perception of mathematical truth that may or may not be correct. So if that's the point you're making I don't dispute it, although to say an internally consistent alternative viewpoint should be "thrown out into the garbage" on that basis is unwarranted. The objection is just that it doesn't have the status of a mathematical theorem, not that it is necessarily wrong.

If, on the other hand you think that it is impossible for anything more than algorithms to be required, that the idea that the human mind must be equivalent to an algorithm is itself mathematically proven, then you are simply wrong. Any claim that the human mind has to be an algorithm rests on exactly the same kind of validly challengable, philosophical assumptions (specifically the physical Church-Turing thesis) that Penrose' argument does.

Given two competing, internally consistent world-views that have not yet been conclusively separated by evidence, the debate about which is more likely to be true is not one where either "side" can claim absolute victory in the way that so many people seem to want to on this issue, and talk of tossing things in the garbage isn't going to persuade anybody that's leaning in a different direction.


It is unsound because: not only it demands an existence of a physical process that cannot be computed (so far, none found, and not for the lack of searching), but it also demands that such a physical process would conveniently be found to be involved in the functioning of a human brain, and also that it would be vital enough that you can't just replace it with something amenable to computation at a negligible loss of function.

It needs too many unlikely convenient coincidences. The telltale sign of wishful thinking.

At the same time: we have a mounting pile of functions that were once considered "exclusive to human mind" and are now implemented in modern AIs. So the case for "human brain must be doing something Truly Magical" is growing weaker and weaker with each passing day.


This is the usual blurring of lines you see in dismissals of Penrose. You call the argument "unsound" as if it contains some hard error of logic and can be dismissed as a result, but what you state are objections to the assumptions (not the reasoning) based on your qualitative evaluation of various pieces of evidence, none of which are conclusive.

There's nothing wrong with seeing the evidence and reaching your own conclusions, but I see exactly the same evidence and reach very different ones, as we interpret and weight it very differently. On the "existence of a physical process that cannot be computed", I know enough of physics (I have a degree in it, and a couple of decades of continued learning since) to know how little we know. I don't find any argument that boils down to "it isn't among the things we've figured out therefore it doesn't exist" remotely persuasive. On the achievements of AI, I see no evidence of human-like mathematical reasoning in LLMs and don't expect to, IMO demos and excitable tweets notwithstanding. My goalpost there, and it has never moved and never will, is independent, valuable contributions to frontier research maths - and lots of them! I want the crank-the-handle-and-important-new-theorems-come-out machine that people have been trying to build since computers were invented. I expect a machine implementation of human-like mathematical thought to result in that, and I see no sign of it on the horizon. If it appears, I'll change my tune.

I acknowledge that others have different views on these issues and that however strongly I feel I have the right of it, I could still turn out to be wrong. I would enjoy some proper discussion of the relative merits of these positions, but it's not a promising start to talk about throwing things in the garbage right at the outset or, like the person earlier in this thread, call the opposing viewpoint "stupid".


There is no "hard error of logic" in saying "humans were created by God" either. There's just no evidence pointing towards it, and an ever-mounting pile of evidence pointing otherwise.

Now, what does compel someone to go against a pile of evidence this large and prop up an unsupported hypothesis that goes against it not just as "a remote and unlikely possibility, to be revisited if any evidence supporting it emerges", but as THE truth?

Sheer wishful thinking. Humans are stupid dumb fucks.

Most humans have never "contributed to frontier research maths" in their entire lives either. I sure didn't, I'm a dumb fuck myself. If you set the bar of "human level intelligence" at that, then most of humankind is unthinking cattle.

"Advanced mathematical reasoning" is a highly specific skill that most humans wouldn't learn in their entire lives. Is it really a surprise that LLMs have a hard time learning it too? They are further along it than I am already.


I don't know if we're even able to continue with the thread this old, but this is fun so I'll try to respond.

You're correct to point out that defending my viewpoint as merely internally consistent puts me in a position analogous to theists, and I volunteered as much elsewhere in this thread. However, the situation isn't really the same since theists tend to make wildly internally inconsistent claims, and claims that have been directly falsified. When theists reduce their ideas to a core that is internally consistent and has not been falsified they tend to end up either with something that requires surrendering any attempt at establishing the truth of anything ourselves and letting someone else merely tell us what is and is not true (I have very little time for such views), or with something that doesn't look like religion as typically practised at all (and which I have a certain amount of sympathy for).

As far as our debate is concerned, I think we've agreed that it is about being persuaded by evidence rather than considering one view to to have been proven or disproven in a mathematical sense. You could consider it mere semantics, but you used the word "unsound" and that word has a particular meaning to me. It was worth establishing that you weren't using it that way.

When it comes to the evidence, as I said I interpret and weight it differently than you. Merely asserting that the evidence is overwhelmingly against me is not an effective form of debate, especially when it includes calling the other position "stupid" (as has happened twice now in this thread) and especially not when the phrase "dumb fuck" is employed. I know I come across as comically formal when writing about this stuff, but I'm trying to be precise and to honestly acknowledge which parts of my world view I feel I have the right to assert firmly and which parts are mere beliefs-on-the-basis-of-evidence-I-personally-find-persuasive. When I do that, it just tends to end up sounding formal. I don't often see the same degree of honesty among those I debate this with here, but that is likely to be a near-universal feature HN rather than a failing of just the strong AI proponents here. At any rate "stupid dumb fucks" comes across as argument-by-ridicule to me. I don't think I've done anything to deserve it and it's certainly not likely to change my mind about anything.

You've raised one concrete point about the evidence, which I'll respond to: you've said that the ability to contribute to frontier research maths is posessed only by a tiny number of humans and that a "bar" of "human level" intelligence set there would exclude everyone else.

I don't consider research mathematicians to possess qualitatively different abilities to the rest of the population. They think in human ways, with human minds. I think the abilities that are special to human mathematicians relative to machine mathematicians are (qualitatively) the same abilities that are special to human lawyers, social workers or doctors relative to machine ones. What's special about the case of frontier maths, I claim, is that we can pin it down. We have an unambiguous way of determining whether the goal I decided to look for (decades ago) has actually been achieved. An important-new-theorem-machine would revolutionise maths overnight, and if and when one is produced (and it's a computer) I will have no choice but to change my entire world view.

For other human tasks, it's not so easy. Either the task can't be boiled down to text generation at all or we have no unambiguous way to set a criterion for what "human-like insight" putatively adds. Maths research is at a sweet spot: it can be viewed as pure text generation and the sort of insight I'm looking for is objectively verifiable there. The need for it to be research maths is not because I only consider research mathematicians to be intelligent, but because a ground-breaking new theorem (preferably a stream of them, each building on the last) is the only example I can think of where human-like insight would be absolutely required, and where the test can be done right now (and it is, and LLMs have failed it so far).

I dispute your "level" framing, BTW. I often see people with your viewpoint assuming that the road to recreating human intelligence will be incremental, and that there's some threshold at which success can be claimed. When debating with someone who sees the world as I do, assuming that model is begging the question. I see something qualitative that separates the mechanism of human minds from all computers, not a level of "something" beyond which I think things are worthy of being called intelligent. My research maths "goal" isn't an attempt to delineate a feat that would impress me in some way, while all lesser feats leave me cold. (I am already hugely impressed by LLMs.) My "goal" is rather an attempt to identify a practically-achievable piece of evidence that would be sufficient for me to change my world view. And that, if it ever happens, will be a massive personal upheaval, so strong evidence is needed - certainly stronger than "HN commenter thinks I'm a dumb fuck".


AI does not need to be conscious for it to harm us.


Isnt the question more if it needs to be conscious to actually be intelligent?


My layman thought about that is that, with consciousness, the medium IS the consciousness -- the actual intelligence is in the tangible material of the "circuitry" of the brain. What we call consciousness is an emergent property of an unbelievably complex organ (that we will probably never fully understand or be able to precisely model). Any models that attempt to replicate those phenomena will be of lower fidelity and/or breadth than "true intelligence" (though intelligence is quite variable, of course)... But you get what I mean, right? Our software/hardware models will always be orders of magnitude less precise or exhaustive than what already happens organically in the brain of an intelligent life form. I don't think AGI is strictly impossible, but it will always be a subset or abstraction of "real"/natural intelligence.


I think it's also the case that you can't replicate something actually happening, by describing it.

Baseball stats aren't a baseball game. Baseball stats so detailed that they describe the position of every subatomic particle to the Planck scale during every instant of the game to arbitrarily complete resolution still aren't a baseball game. They're, like, a whole bunch of graphite smeared on a whole bunch of paper or whatever. A computer reading that recording and rendering it on a screen... still isn't a baseball game, at all, not even a little. Rendering it on a holodeck? Nope, 0% closer to actually being the thing, though it's representing it in ways we might find more useful or appealing.

We might find a way to create a conscious computer! Or at least an intelligent one! But I just don't see it in LLMs. We've made a very fancy baseball-stats presenter. That's not nothing, but it's not intelligence, and certainly not consciousness. It's not doing those things, at all.


I think you're tossing around words like "always" or "never" too lightly, with no justification behind them. Why do you think that no matter how much effort is spent, fully understanding the human brain will always be impossible? Always is a really long time. As long as we keep doing research to increasingly precisely model the universe around us, I don't see what would stop this from happening, even if it takes many centuries or millennia. Most people who argue this justify their point by asserting that there is some unprovable quality of the human brain which can't be modeled at all and can only be created in one way - which both lacks substance and seems arbitrary, since I don't think that this relationship provably exists for anything else that we do know about. It seems like a way to justify that humans and only humans are special.


This is how I (also as a layman) look at it as well.

AI right now is limited to trained neural networks, and while they function sort of like a brain, there is no neurogenesis. The trained neural network cannot grow, cannot expand on it's own, and is restrained by the silicon it is running on.

I believe that true AGI will require hardware and models that are able to learn, grow and evolve organically. The next step required for that in my opinion is biocomputing.


The only thing I can come up with is that compressing several hundred million years of natural selection of animal nervous systems into another form, but optimised by gradient descent instead, just takes a lot of time.

Not that we can’t get there by artificial means, but that correctly simulating the environment interactions, the sequence of progression, getting the all the details right, might take hundreds to thousands of years of compute, rather than on the order of a few months.

And it might be that you can get functionally close, but hit a dead end, and maybe hit several dead ends along the way, all of which are close but no cigar. Perhaps LLMs are one such dead end.


I don't disagree, but I think the evolution argument is a red herring. We didn't have to re-engineer horses from the ground up along evolutionary lines to get to much faster and more capable cars.


The evolution thing is kind of a red herring in that we probably don't have to artificially construct the process of evolution, though your reasoning isn't a good explanation for why the "evolution" reason is a red herring: Yeah, nature already established incomprehensibly complex organic systems in these life forms -- so we're benefiting from that. But the extent of our contribution is making some select animals mate with others. Hardly comparable to building our own replacement for some millennia of organic iteration/evolution. Luckily we probably don't actually need to do that to produce AGI.


Most arguments and discussions around AGI talk past each other about the definitions of what is wanted or expected, mostly because sentience, intelligence, consciousness are all unagreed upon definitions and therefore are undefined goals to build against.

Some people do expect AGI to be a faster horse; to be the next evolution of human intelligence that's similar to us in most respects but still "better" in some aspects. Others expect AGI to be the leap from horses to cars; the means to an end, a vehicle that takes us to new places faster, and in that case it doesn't need to resemble how we got to human intelligence at all.


True, but I think this reasoning is a category error: we were and are capable of rationally designing cars. We are not today doing the same thing with AI, we’re forced to optimize them instead. Yes, the structure that you optimize around is vitally important, but we’re still doing brute force rather than intelligent design at the end of the day. It’s not comparing like with like.


Even this is a weak idea. There's nothing that restricts the term 'AGI' to a replication of animal intelligence or consciousness.


> correctly simulating the environment interactions, the sequence of progression, getting the all the details right, might take hundreds to thousands of years of compute

Who says we have to do that? Just because something was originally produced by natural process X, that doesn't mean that exhaustively retracing our way through process X is the only way to get there.

Lab grown diamonds are a thing.


Who says that we don’t? The point is that the bounds on the question are completely unknown, and we operate on the assumption that the compute time is relatively short. Do we have any empirical basis for this? I think we do not.


The overwhelming majority of animal species never developed (what we would consider) language processing capabilities. So agi doesn't seem like something that evolution is particularly good at producing; more an emergent trait, eventually appearing in things designed simply to not die for long enough to reproduce...


Define "animal species", if you mean vertebrates, you might be surprised by the modern ethological literature. If you mean to exclude non-vertebrates ... you might be surprised by the ethological literature too.

If you just mean majority of spp, you'd be correct, simply because most are single celled. Though debate is possible when we talk about forms of chemical signalling.


Yeah, it's tricky to talk about in the span of a comment. I work on Things Involving Animals - animals provide an excellent counter-current to discussion around AGI, in numerous ways.

One interesting parallel was the gradual redefinition of language over the course of the 20th century to exclude animals as their capabilities became more obvious. So, when I say 'language processing capacities', I mean it roughly in the sense of Chomsky-era definitions, after the goal posts had been thoroughly moved away from much more inclusive definitions.

Likewise, we've been steadily moving the bar on what counts as 'intelligence', both for animals and machines. Over the last couple decades the study of animal intelligence has been more inclusive, IMO, and recognize intelligence as capabilities within the specific sensorium and survival context of the particular species. Our study of artificial intelligence are still very crude by comparison, and are still in the 'move the goalposts so that humans stay special' stage of development...


I suppose intelligence can be partitioned as less than, equal to, or greater than human. Given the initial theory depends on natural evidence, one could argue there's no proof that "greater than human" intelligence is possible - depending on your meaning of AGI.

But then intelligence too is a dubious term. An average mind with infinite time and resources might have eventually discovered general relativity.


The steelman would be that knowledge is possible outside the domain of Science. So the opposing argument to evolution as the mechanism for us (the "general intelligence" of AGI) would be that the pathway from conception to you is not strictly material/natural.

Of course, that's not going to be accepted as "Science", but I hope you can at least see that point of view.


the penrose-lucas argument is the best bet: https://en.wikipedia.org/wiki/Penrose%E2%80%93Lucas_argument

the basic idea being that either the human mind is NOT a computation at all (and it's instead spooky unexplainable magic of the universe) and thus can't be replicated by a machine OR it's an inconsistent machine with contradictory logic. and this is a deduction based on godel's incompleteness theorems.

but most people that believe AGI is possible would say the human mind is the latter. technically we don't have enough information today to know either way but we know the human mind (including memories) is fallible so while we don't have enough information to prove the mind is an incomplete system, we have enough to believe it is. but that's also kind of a paradox because that "belief" in unproven information is a cornerstone of consciousness.


The real point isn’t AGI, it’s that the speed of knowledge is empiricism, not intelligence.

An infinitely intelligent creature still has to create a standard model from scratch. We’re leaning too hard on the deductive conception of the world, when reality is, it took hundreds of thousands of years for humans as intelligent as we are to split the atom.


I think the best argument against us ever finding AGI is that the search space is too big and the dead ends are too many. It's like wandering through a monstrously huge maze with hundreds of very convincingly fake exits that lead to pit traps. The first "AGI" may just be a very convincing Chinese room that kills all of humanity before we can ever discover an actual AGI.

The necessary conditions for "Kill all Humanity" may be the much more common result than "Create a novel thinking being." To the point where it is statistically improbable for the human race to reach AGI. Especially since a lot of AI research is specifically for autonomous weapons research.


Is there a plausible situation where a humanity-killing superintelligence isn't vulnerable to nuclear weapons?

If a genuine AGI-driven human extinction scenario arises, what's to stop the world's nuclear powers from using high-altitude detonations to produce a series of silicon-destroying electromagnetic pulses around the globe? It would be absolutely awful for humanity don't get me wrong, but it'd be a damn sight better than extinction.


Physically, maybe not, but an AGI would know that, would think a million times faster than us, and would have incentive to prioritize disabling our abilities to do that. Essentially, if an enemy AGI is revealed to us, it's probably too late to stop it. Not guaranteed, but a valid fear.


What stops them is: being politically captured by an AGI.

Not to mention that the whole idea of "radiation pulses destroying all electronics" is cheap sci-fi, not reality. A decently well prepared AGI can survive a nuclear exchange with more ease than human civilization would.


I think it's much more likely that a non-AGI platform will kill us before AGI even happens. I'm thinking the doomsday weapon from Doctor Strangelove more than Terminator.


If you have a wide enough definition of AGI having a baby is making “AGI.” It’s a human made, generally intelligent thing. What people mean by the “A” though is we have some kind of inorganic machine realize the traits of “intelligence” in the medium of a computer.

The first leg of the argument would be that we aren’t really sure what general intelligence is or if it’s a natural category. It’s sort of like “betterness.” There’s no general thing called “betterness” that just makes you better at everything. To get better at different tasks usually requires different things.

I would be willing to concede to the AGI crowd that there could be something behind g that we could call intelligence. There’s a deeper problem though that the first one hints at.

For AGI to be possible, whatever trait or traits make up “intelligence” need to have multiple realizablity. They need to be at least realizable in both the medium of a human being and at least some machine architectures. In programmer terms, the traits that make up intelligence could be tightly coupled to the hardware implementation. There are good reasons to think this is likely.

Programmers and engineers like myself love modular systems that are loosely coupled and cleanly abstracted. Biology doesn’t work this way — things at the molecular level can have very specific effects on the macro scale and vice versa. There’s little in the way of clean separation of layers. Who is to say that some of the specific ways we work at a cellular level aren’t critical to being generally intelligent? That’s an “ugly” idea but lots of things in nature are ugly. Is it a coincidence too that humans are well adapted to getting around physically, can live in many different environments, etc.? There’s also stuff from the higher level — does living physically and socially in a community of other creatures play a key role in our intelligence? Given how human beings who grow up absent those factors are developmentally disabled in many ways it would seem so. It could be there’s a combination of factors here, where very specific micro and macro aspects of being a biological human turn out to contribute and you need the perfect storm of these aspects to get a generally intelligent creature. Some of these aspects could be realizable and computers, but others might not be, at least in a computationally tractable way.

It’s certainly ugly and goes against how we like things to work for intelligence to require a big jumbly mess of stuff, but nature is messy. Given the only known case of generally intelligent life is humans, the jury is still out that you can do it any other way.

Another commenter mentioned horses and cars. We could build cars that are faster than horses, but speed is something that is shared by all physical bodies and is therefore eminently multiply realizable. But even here, there are advantages to horses that cars don’t have, and which are tied up with very specific aspects of being a horse. Horses generally can go over a wider range of terrain than cars. This is intrinsically tied to them having long legs and four hooves instead of rubber wheels. They’re only able to have such long legs because of their hooves too because the hooves are required to help them pump blood when they run, and that means that in order for them to pump their blood successfully they NEED to run fast on a regular basis. there’s a deep web of influence both on a part-to-part, and the whole macro-level behaviors of horses. Having this more versatile design also has intrinsic engineering trade-offs. A horse isn’t ever going to be as fast as a gas powered four-wheeled vehicle on flat ground but you definitely can’t build a car that can do everything a horse can do with none of the drawbacks. Even if you built a vehicle that did everything a horse can do, but was faster, I would bet you it would be way more expensive and consume much more energy than a horse. There’s no such thing as a free lunch in engineering. You could also build a perfect replica of a horse at a molecular level and claim you have your artificial general horse.

Similarly, human beings are good at a lot of different things besides just being smart. But maybe you need to be good at seeing, walking, climbing, acquiring sustenance, etc. In order to be generally intelligent in a way that’s actually useful. I also suspect our sense of the beautiful, the artistic is deeply linked with our wider ability to be intelligent.

Finally it’s an open philosophical question whether human consciousness is explainable in material terms at all. If you are a naturalist, you are methodologically committed to this being the case — but that’s not the same thing as having definitive evidence that it is so. That’s an open research project.


In short, by definition, computers are symbol manipulating devices. However complex the rules of symbol manipulation, it is still a symbol manipulating device, and therefore neither intelligent nor sentient. So AGI on computers is not possible.


This is not an argument at all, you just restate your whole conclusion as an assumption ("a symbol manipulating device is incapable of cognition").

It's not even a reasonable assumption (to me), because I'd assume an exact simulation of a human brain to have the exact same cognitive capabilities (which is inevitable, really, unless you believe in magic).

And machines are well capable of simulating physics.

I'm not advocating for that approach because it is obviously extremely inefficient; we did not achieve flight by replicating flapping wings either, after all.


You can assume whatever you want to, but if you were right, than the human brain itself would be nothing more than a symbol manipulating device. While that is not necessarily a falsifiable stance, the really interesting questions are what is consciousness, and how do we recognise consciousness.


Computer can simulate human brain on subatomic level (in theory). Do you agree this would be "sentient and intelligent" and not just symbol manipulating?

If yes, everything else is just optimization.


Say we do have a 1:1 representation of the human brain in software. How could we know if we're talking to a conscious simulation of a human being, versus some kind of philosophical zombie which appears conscious but isn't?

Without a solid way to differentiate 'conscious' from 'not conscious' any discussion of machine sentience is unfalsifiable in my opinion.


How do you tell the difference in other humans? Do you just believe them because they claim to be conscious instead of pointing a calibrated and certified consciousness-meter at them?


I obviously can't prove they're conscious in a rigorous way, but it's a reasonable assumption to make that other humans are conscious. "I think therefore I am" and since there's no reason to believe I'm exceptional among humans, it's more likely than not that other humans think too.

This assumption can't be extended to other physical arrangements though, not unless there's conclusive evidence that consciousness is a purely logical process as opposed to a physical one. If consciousness is a physical process, or at least a process with a physical component, then there's no reason to believe that a simulation of a human brain would be conscious any more than a simulation of biology is alive.


So, what if I told you that some humans have been vat-grown without brains and had a silicon brain emulator inserted into their skulls. Are they p-zombies? Would you demand x-rays before talking to anyone? What would you use then to determine consciousness?

Relying on these status quo proxy-measures (looks human :: 99.9% likely to have a human brain :: has my kind of intelligence) is what gets people fooled even by basic AI (without G) fake scams.


It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed.

I would have thought it uncontroversial view among software engineers that token quality is much important than token output speed.


It depends how fast.

If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output.

Ad absurdum, if it could injest and work on an entire project in milliseconds, then it has mucher geater value to me, than a process which might take a day to do the same, even if the likelihood of success is also strongly affected.

It simply enables a different method of interactive working.

Or it could supply 3 different suggestions in-line while working on something, rather than a process which needs to be explicitly prompted and waited on.

Latency can have critical impact on not just user experience but the very way tools are used.

Now, will I try Grok? Absolutely not, but that's a personal decision due to not wanting anything to do with X, rather than a purely rational decision.


>If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output.

Before MoE was a thing, I built what I called the Dictator, which was one strong model working with many weaker ones to achieve a similar result as MoE, but all the Dictator ever got was Garbage In, so guess what came out?


Sounds more like a Mixture of Idiots.


That doesn't seem similar to MoE at all.


Well, I really didn't provide sufficient detail to make that determination either way.


You just need to scale out more. As you approach infinite monkeys, sorry - models, you'll surely get the result you need.


why's this guy getting downvoted? SamA says we need a Dyson Sphere made of GPUs surrounding the solar system and people take it seriously but this guy takes a little piss out of that attitude and he's downvoted?

this site is the fucking worst


Maybe because this site is full of people with differing opinions and stances on things, and react differently to what people say and do?

Not sure who was taking SamA seriously about that; personally I think he's a ridiculous blowhard, and statements like that just reinforce that view for me.

Please don't make generalizations about HN's visitors'/commenters' attitudes on things. They're never generally correct.


Besides being a faster slot machine, to the extent that they're any good, a fast agentic LLM would be very nice to have for codebase analysis.


For 10% less time you can get 10% worse analysis? I don’t understand the tradeoff.


I mean, if that's literally what the numbers are, sure, maybe that's not great. But what if it's 10% less time and 3% worse analysis? Maybe that's valuable.


> If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output.

Asking any model to do things in steps is usually better too, as opposed to feeding it three essays.


I thought the current vibe was doing the former to produce the latter and then use the output as the task plan?


I don't know what other people are doing, I mostly use LLMs:

* Scaffolding

* Ask it what's wrong with the code

* Ask it for improvements I could make

* Ask it what the code does (amazing for old code you've never seen)

* Ask it to provide architect level insights into best practices

One area where they all seem to fail is lesser known packages they tend to either reference old functionality that is not there anymore, or never was, they hallucinate. Which is part of why I don't ask it for too much.

Junie did impress me, but it was very slow, so I would love to see a version of Junie using this version of Grok, it might be worthwhile.


> Ask it what's wrong with the code

That's phase 1, ask it to "think deeply" (Claude keyword, only works with the anthropic models) while doing that. Then ask it to make a detailed plan of solving the issue and write that into current-fix.md and ask it to add clearly testable criteria when the issuen is solved.

Now you manually check the criteria wherever they sound plausible, if not - it's analysis failed and its output was worthless.

But if it sounds good, you can then start a new session and ask it to read the-markdown-file and implement the change.

Now you can plausibility check the diff and are likely done

But as the sister comment pointed out, agentic coding really breaks apart with large files like you usually have in brownfield projects.


> amazing for old code you've never seen

not if you have too much! a few hundred thousand lines of code and you can't ask shit!

plus, you just handed over your company's entire IP to whoever hosts your model


If Apple keeps improving things, you can run the model locally. I'm able to run models on my Macbook with an M4 that I can't even run on my 3080 GPU (mostly due to VRAM constraints) but they run reasonably fast, would the 3080 be faster? Sure, but its also plenty fast to where I'm not sitting there waiting longer than I wait for a cloud model to "reason" and look things up.

I think the biggest thing for offline LLMs will have to be consistency for having them search the web with an API like Google's or some other search engines API, maybe Kagi could provide an API for people who self-host LLMs (not necessarily for free, but it would still be useful).


It's a fair trade off for smaller companies where IP or the software is necessary evil, not the main unique value added. It's hard to see what evil would anyone do with crappy legacy code.

The IP risks taken may be well worth of productiviry boosts.


I hope in the future tooling and MCP will be better so agents can directly check what functionality exists in the installed package version instead of hallucinations.


That's far from the worst metric that xAI has come up with...

https://xcancel.com/elonmusk/status/1958854561579638960


what's wrong with rapid updates to an app?


I have a coworker who outshines everybody else in number of commits and pushes in any given time period. It’s pretty amazing the number they can accomplish!

Of course, 95% of them are fixing things they broke in earlier commits and their overall quality is the worst on the team. But, holy cow, they can output crap faster than anyone I’ve seen.


That metric doesn't really tell you anything. Maybe I'm making rapid updates to my app because I'm a terrible coder and I keep having to push out fixes to critical bugs. Maybe I'm bored and keep making little tweaks to the UI, and for some reason think that's worth people's time to upgrade. (And that's another thing: frequent upgrades can be annoying!)

But sure, ok, maybe it could mean making much faster progress than competitors. But then again, it could also mean that competitors have a much more mature platform, and you're only releasing new things so often because you're playing catch-up.

(And note that I'm not specifically talking about LLMs here. This metric is useless for pretty much any kind of app or service.)


It's like measuring how fast your car can go by counting how often you clean the upholstery.

There's nothing wrong with doing it, but it's entirely unrelated to performance.


I don't think he was saying their release cadence is a direct metric on their model performance. Just that the team iterates and improves the app user experience much more quickly than on other teams.


He seems to be stating that app release cadence correlates with internal upgrades that correlate with model performance. There is no reason for this to be true. He does not seem to be talking about user experience.


Oh c'mon, I know it's usually best to try to interpret things in the most charitable way possible, but clearly Musk was implying the actual meat of things, the model itself, is what's being constantly improved.

But even if your interpretation is correct, frequency of releases still is not a good metric. That could just mean that you have a lot to fix, and/or you keep breaking and fixing things along the way.


It's a fucking chat. How many times a day do you need to ship an update?


They aren't a metric for showing you are better than the competition.


It's a metric for showing you can move more quickly on product improvements. Anyone who has worked on a product team at a large tech company knows how much things get slowed down by process bloat.


See the reply, currently at #2 on that Twitter thread, from Jamie Voynow.


After trying Cerebras free API (not affiliated) which delivers Qwen Coder 480b and gpt-oss-120b a mind boggling ~3000 tps, that output speed is the first thing I checked out when considering a model for speed. I just wish Cerebras had a better overall offering on their cloud, usage is capped at 70M tokens / day and people are reporting that it's easily hit and highly crippling for daily coding.


They have a "max" plan with 120m tokens/day limit for $200/month: https://www.cerebras.ai/blog/introducing-cerebras-code


depends for what.

For autocompleting simple functions (string manipulation, function definitions, etc), the quality bar is pretty easy to hit, and speed is important.

If you're just vibe coding, then yeah, you want quality. But if you know what you're doing, I find having a dumber fast model is often nicer than a slow smart model that you still need to correct a bit, because it's easier to stay in flow state.

With the slow reasoning models, the workflow is more like working with another engineer, where you have to review their code in a PR


Speed absolutely matters. Of course if the quality is trash then it doesn't matter, but a model that's on par with Claude Sonnet 4 AND very speedy would be an absolute game changer in agentic coding. Right now you craft a prompt, hit send and then wait, and wait, and then wait some more, and after some time (anywhere from 30 seconds to minutes later) the agent finishes its job.

It's not long enough for you to context switch to something else, but long enough to be annoying and these wait times add up during the whole day.

It also discourages experimentation if you know that every prompt will potentially take multiple minutes to finish. If it instead finished in seconds then you could iterate faster. This would be especially valuable in the frontend world where you often tweak your UI code many times until you're satisfied with it.


For agentic workflows, speed and good tool use are the most important thing. Agents should use tools for things by design, and that can include reasoning tools and oracles. The agent doesn't need to be smart, it just needs a line to someone who is that can give the agent a hyper-detailed plan to follow.


Tbh I kind of disagree ; there are certain use-cases were legitimately speed would be much more interesting such as generating a massive amount of HTML. Tough I agree this makes it look like even more of a joke for anything serious.

They reduce the costs tough !


> I would have thought it uncontroversial view among software engineers that token quality is much important than token output speed.

We already know that in most software domains, fast (as in, getting it done faster) is better than 100% correct.


To a point. If gpt5 takes 3 minutes to output and qwen3 does it in 10 seconds and the agent can iterate 5 times to finish before gpt5, why do I care if gpt5 one shot it and qwen took 5 iterations


It doesn’t though. Fast but dumb models don’t progressively get better with more iterations.


There are many ways to skin a cat.

Often all it takes is to reset to a checkpoint or undo and adjust the prompt a bit with additional context and even dumber models can get things right.

I've used grok code fast plenty this week alongside gpt 5 when I need to pull out the big guns and it's refreshing using a fast model for smaller changes or for tasks that are tedious but repetitive during things like refactoring.


Yes fast/dumb models are useful! But that's not what OP said - they said they can be as useful as the large models by iterating them.

Do you use them successfully in cases where you just had to re-run them 5 times to get a good answer, and was that a better experience than going straight to GPT 5?


That very much depends on the usecase

Different models for different things.

Not everyone is solving complicated things every time they hit cmd-k in Cursor or use autocomplete, and they can easily switch to a different model when working harder stuff out via longer form chat.


ChaptGPT5 takes 5 times the time to finish, and still produces garbage.


Fast inference can change the entire dynamic or working with these tools. At the typical speeds, I usually try to do something else while the model works. When the model works really fast, I can easily wait for it to finish.

So the total difference includes the cost of context switching, which is big.

Potentially speed matters less in a scenario that is focused on more autonomous agents running in the background. However I think most usage is still highly interactive these days.


I'm more curious if its based on Grok 3 or what, I used to get reasonable answers from Grok 3. If that's the case, the trick that works for Grok and basically any model out there is to ask for things in order and piecemeal, not all at once. Some models will be decent at the 'all at once' approach, but when me and others have asked it in steps it gave us much better output. I'm not yet sure how I feel about Grok 4, have not really been impressed by it.


I agree. Coding faster than humans can review it is pointless. Between fast, good, and cheap, I'd prioritize good and cheap.

Fast is good for tool use and synthesizing the results.


Fast can buy you a little quality by getting more inference on the same task.

I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds. I will usually have eyeballed the code somewhere in the middle here but I'm not fully reviewing until this whole dance is done.

I mean, I obviously agree with you in that I've chosen the slowest models available at every turn here, but my point is I would be very excited if they also got faster because I am using a lot of extra inference to buy more quality before I'm touching the code myself.


  > I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds.
I'd love to hear how you have this set up.


This is a nice setup. I wonder how much it helps in practice? I suspect most of the problems opus has for me are more context related, and I’m not sure more models would help. Speculation on my part.


Here are my own anecdotes from using o3-pro recently.

My primary use cases where I am willing to wait 10-20 minutes for an answer from the "big slow" model (o3-pro) is code reviews of large amounts of code. I have been comparing results on this task from the three models above.

Oddly, I see many cases where each model will surface issues that the other two miss. In previous months when running this test (e.g., Claude 3.7 Sonnet vs o1-pro vs earlier Gemini), that wasn't the case. Back then, the best model (o1-pro) would almost always find all the issues that the other models found. But now it seems they each have their own blindspots (although they are also all better than the previous generation of models).

With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

Whether o3-pro or Gemini 2.5 Pro is better is less clear. o3-pro will report more issues, but it also has a tendency to confabulate problems. My workflow involves providing the model with a diff of all changes, plus the full contents of the files that were changed. o3-pro seems to have a tendency to imagine and report problems in the files that were not provided to it. It also has an odd new failure mode, which is very consistent: it gets confused by the fact that I provide both the diff and the full file contents. It "sees" parts of the same code twice and will usually report that there has accidentally been some code duplicated. Base o3 does this as well. None of the other models get confused in that way, and I also do not remember seeing that failure mode with o1-pro.

Nevertheless, it seems o3-pro can sometimes find real issues that Gemini 2.5 Pro and Opus 4 cannot more often than vice versa.

Back in the o1-pro days, it was fairly straightforward in my testing for this use case that o1-pro was simply better across the board. Now with o3-pro compared particularly with Gemini 2.5 Pro, it's no longer clear whether the bonus of occasionally finding a problem that Gemini misses is worth the trouble of (1) waiting way longer for an answer and (2) sifting through more false positives.

My other common code-related use case is actually writing code. Here, Claude Code (with Opus 4) is amazing and has replaced all my other use of coding models, including Cursor. I now code almost exclusively by peer programming with Claude Code, allowing it to be the code writer while I oversee and review. The OpenAI competitor to Claude Code, called Codex CLI, feels distinctly undercooked. It has a recurring problem where it seems to "forget" that it is an agent that needs to go ahead and edit files, and it will instead start to offer me suggestions about how I can make the change. It also hallucinates running commands on a regular basis (e.g., I tell it to commit the changes we've done, and outputs that it has done so, but it has not.)

So where will I spend my $200 monthly model budget? Answer: Claude, for nearly unlimited use of Claude Code. For highly complex tasks, I switch to Gemini 2.5 Pro, which is still free in AI Studio. If I can wait 10+ minutes, I may hand it to o3-pro. But once my ChatGPT Pro subscription expires this month, I may either stop using o3-pro altogether, or I may occasionally use it as a second opinion by paying on-demand through the API.


> With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

I've found the same thing. That claude is more likely miss a bug than o3 or gemini but more likely to catch something o3 and gemini missed. If I had to pick one model I'd pick o3 or gemini, but if I had to pick a second model I'd pick opus.

It's also seems to have a much higher false positive rate where as gemini seems to have the lowest false positive rate.

Basically o3 and gemini are better, but also more correlated which gives opus a lot of value.


For the code review use case, maybe can try to create the diff with something like `git diff -U99999`, and then send only the diff.


Even without including employer health insurance costs, real wages are up 67% since 1980.

Source: https://fred.stlouisfed.org/graph/?g=1JxBn

Details: uses the "Wage and salary accruals per full-time-equivalent employee" time series, which is the broadest wage measure for FTE employees, and adjusts for inflation using the PCE price index, which is the most economically meaningful measure of "how much did prices change for consumers" (and is the inflation index that the Fed targets)


How has inflation behaved since 1980?


It's probably worth noting that the "real" in "real wages" indicates that the number is already inflation adjusted.


It rose 2.75% per year (239% over 45 years).

Source with details: https://fred.stlouisfed.org/graph/?g=1JxIa


Can you walk me through how to reach this '239%' number? Thank you.


You can hover over places on the chart to get exact values. In January 1980, the index was at 37.124. In April 2025, it was at 125.880.

Then calculate cumulative inflation as the proportional change in the price level, like this:

(P_final - P_initial) / P_initial = (125.880 - 37.124) / 37.124 = 2.39

This shows that the overall price level (the cumulative inflation embodied in the PCEPI) has increased by about 2.39 times over the period, which is 239%.


The thing that bugs me to no end when talking about inflation in an historical context, is that everyone forgets to consider how the indexes of consumption it's calculated from (PCEPI, CPI, etc.) are NOT static, and very arbitrarily are changed over time, often to make inflation seem lower than it actually is for the consumer.

Overall, historical comparisons of inflation numbers are so imprecise to be practically worthless the longer the timescale. You can expect the real figure to be much greater in reality for consumers, given the political incentive to lie over inflation data.


1.0275 ^ 45 = 3.389

3.389 - 1 (to account for increase) = 2.38 ~ 239%


I also don't have that tweet saved, but I do remember it.


No, this doesn't seem to be correct, although confusion regarding model names is understandable.

o4-mini-high is the label on chatgpt.com for what in the API is called o4-mini with reasoning={"effort": "high"}. Whereas o4-mini on chatgpt.com is the same thing as reasoning={"effort": "medium"} in the API.

o3 can also be run via the API with reasoning={"effort": "high"}.

o3-pro is different than o3 with high reasoning. It has a separate endpoint, and it runs for much longer.

See https://platform.openai.com/docs/guides/reasoning?api-mode=r...


OpenAI started strong in the naming department (ChatGPT, DALL-E) then fell off so hard since.


It's arguable ChatGPT is not such a great name, either. The general public has no idea what GPT means and will often swap the letters around. It does benefit from being unique, however.


They are working on it: https://jules.google/


I got the feeling Jules was targeted at Web (ala Github) PR workflows. Is it not?

The Claude Code UX is nice imo, but i didn't get the impression Jules is that.


At Google, our PR flow and editing is all done in web based tools. Except for the nerds who like vi.


people don't use local editors? it's weird to lock people into workflows like that


Damn... you guys don't use proper text editors?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: