Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.
- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.
- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.
I did two concrete head-to-head comparisons where both models had the same code and the same prompt.
First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.
Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.
Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.
Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.
"For Gemini 3, we strongly recommend keeping the temperature parameter at its default value of 1.0.While previous models often benefited from tuning temperature to control creativity versus determinism, Gemini 3's reasoning capabilities are optimized for the default setting. Changing the temperature (setting it below 1.0) may lead to unexpected behavior, such as looping or degraded performance, particularly in complex mathematical or reasoning tasks."
> - As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
This has been an annoying Gemini feature since the beginning. I ask it to evaluate, check or analyse something, tab away and come back to it rewriting half the fucking codebase.
Please Google, use a percentage of your billions and add a "plan" mode to Gemini-cli - just like Claude has and I'd use your stuff a lot more often. The 1M context is excellent for large scale reviews, but its tendency to start writing code on its own is a pain in my ass.
- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.
- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.
I did two concrete head-to-head comparisons where both models had the same code and the same prompt.
First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.
Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.
Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.
Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.