In my tests[0] it does only slightly better than Kimi K2.5.
Kimi K2.6 seems to struggle most with puzzle/domain-specific and trick-style exactness tasks, where it shows frequent instruction misses and wrong-answer failures.
It is probably a great coding model, but a bit less intelligent overall than SOTAs
I tried it on openrouter and set max tokens to 8192, and every response is truncated, even in non-thinking mode. Maybe there's an issue with the deployment, but in your link also shows it generates tons of output tokens.
I was initially excited by 4.7, as it does a lot better in my tests, but their reasoning/pricing is really weird and unpredictable.
Apart from that, in real-life usage, gpt-5.3-codex is ~10x cheaper in my case, simply because of the cached input discount (otherwise it would still be around 3-4x cheaper anyway).
Medium reasoning has regressed since 4.6. While None and Max have improved since 4.6 in our benchmark.
We suspect that this is how Claude tries to cope with the increased user base.
Note, Google and OpenAI probably did something similar long ago.
> Instruction following. Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.
Yay! They finally fixed instruction following, so people can stop bashing my benchmarks[0] for being broken, because Opus 4.6 did poorly on them and called my tests broken...
reply