Hacker Newsnew | past | comments | ask | show | jobs | submit | XCSme's commentslogin

Oh wow, scrolling through the page on mobile makes me dizzy

And so the bubble keeps bubbling...

In my tests[0] it does only slightly better than Kimi K2.5.

Kimi K2.6 seems to struggle most with puzzle/domain-specific and trick-style exactness tasks, where it shows frequent instruction misses and wrong-answer failures.

It is probably a great coding model, but a bit less intelligent overall than SOTAs

[0]: https://aibenchy.com/compare/moonshotai-kimi-k2-6-medium/moo...


I tried it on openrouter and set max tokens to 8192, and every response is truncated, even in non-thinking mode. Maybe there's an issue with the deployment, but in your link also shows it generates tons of output tokens.

Oh yeah, I just noticed, like 3x the reasoning tokens.

A bit weird to be comparing it to Opus-4.5 when 4.7 was released...

(commented on the wrong thread, HN doesn't let me delete it :( )

They're comparing to Opus 4.6, not 4.5. It was Anthropic's best public model up until last week.

Some people would say it's still Anthropic's best public model!

Yeah, I noticed that, HN doesn't let me delete my comment.

The other release, Qwen-3.6-Max is the one comparing it to 4.5


I loved implementing the Rabin-Karp algoritm, such a fun and celever solution.

I tried it on their own website:

We couldn't scan this site isitagentready.com returned 522 <none>

The site appears to be experiencing server errors. This is not an agent-readiness issue. Try scanning again later.


I was initially excited by 4.7, as it does a lot better in my tests, but their reasoning/pricing is really weird and unpredictable.

Apart from that, in real-life usage, gpt-5.3-codex is ~10x cheaper in my case, simply because of the cached input discount (otherwise it would still be around 3-4x cheaper anyway).


The reasoning modes are really weird with 4.7

In my tests, asking for "none" reasoning resulted in higher costs than asking for "medium" reasoning...

Also, "medium" reasoning only had 1/10 of the reasoning tokens 4.6 used to have.


Medium reasoning has regressed since 4.6. While None and Max have improved since 4.6 in our benchmark. We suspect that this is how Claude tries to cope with the increased user base. Note, Google and OpenAI probably did something similar long ago.

Oh, and also, the "none" and "medium" variants performed the same (??)

Insane! Even Haiku doesn't make such mistakes.

I am not sure it's a mistake, this might be their new "adaptive reasoning" + hidden reasoning trace, so we can't verify.

Claude is known for its shitty metering.

> Instruction following. Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.

Yay! They finally fixed instruction following, so people can stop bashing my benchmarks[0] for being broken, because Opus 4.6 did poorly on them and called my tests broken...

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: