More

XCSme · 2026-04-21T23:48:06 1776815286

Oh wow, scrolling through the page on mobile makes me dizzy

XCSme · 2026-04-21T15:44:04 1776786244

And so the bubble keeps bubbling...

XCSme · 2026-04-20T17:57:06 1776707826

In my tests[0] it does only slightly better than Kimi K2.5.

Kimi K2.6 seems to struggle most with puzzle/domain-specific and trick-style exactness tasks, where it shows frequent instruction misses and wrong-answer failures.

It is probably a great coding model, but a bit less intelligent overall than SOTAs

[0]: https://aibenchy.com/compare/moonshotai-kimi-k2-6-medium/moo...

deepsquirrelnet · 2026-04-20T19:43:14 1776714194

I tried it on openrouter and set max tokens to 8192, and every response is truncated, even in non-thinking mode. Maybe there's an issue with the deployment, but in your link also shows it generates tons of output tokens.

XCSme · 2026-04-20T19:49:53 1776714593

Oh yeah, I just noticed, like 3x the reasoning tokens.

XCSme · 2026-04-20T17:23:25 1776705805

A bit weird to be comparing it to Opus-4.5 when 4.7 was released...

XCSme · 2026-04-20T17:17:11 1776705431

(commented on the wrong thread, HN doesn't let me delete it :( )

wizee · 2026-04-20T17:19:50 1776705590

They're comparing to Opus 4.6, not 4.5. It was Anthropic's best public model up until last week.

zozbot234 · 2026-04-20T17:21:12 1776705672

Some people would say it's still Anthropic's best public model!

XCSme · 2026-04-20T17:27:02 1776706022

Yeah, I noticed that, HN doesn't let me delete my comment.

The other release, Qwen-3.6-Max is the one comparing it to 4.5

XCSme · 2026-04-18T12:03:27 1776513807

I loved implementing the Rabin-Karp algoritm, such a fun and celever solution.

XCSme · 2026-04-17T14:31:00 1776436260

I tried it on their own website:

We couldn't scan this site isitagentready.com returned 522 <none>

The site appears to be experiencing server errors. This is not an agent-readiness issue. Try scanning again later.

XCSme · 2026-04-17T00:46:24 1776386784

I was initially excited by 4.7, as it does a lot better in my tests, but their reasoning/pricing is really weird and unpredictable.

Apart from that, in real-life usage, gpt-5.3-codex is ~10x cheaper in my case, simply because of the cached input discount (otherwise it would still be around 3-4x cheaper anyway).

XCSme · 2026-04-17T00:37:41 1776386261

The reasoning modes are really weird with 4.7

In my tests, asking for "none" reasoning resulted in higher costs than asking for "medium" reasoning...

Also, "medium" reasoning only had 1/10 of the reasoning tokens 4.6 used to have.

Eifert · 2026-04-18T11:10:02 1776510602

Medium reasoning has regressed since 4.6. While None and Max have improved since 4.6 in our benchmark. We suspect that this is how Claude tries to cope with the increased user base. Note, Google and OpenAI probably did something similar long ago.

XCSme · 2026-04-17T00:38:04 1776386284

Oh, and also, the "none" and "medium" variants performed the same (??)

nisarg2 · 2026-04-17T03:21:08 1776396068

Insane! Even Haiku doesn't make such mistakes.

XCSme · 2026-04-17T10:36:38 1776422198

I am not sure it's a mistake, this might be their new "adaptive reasoning" + hidden reasoning trace, so we can't verify.

amelius · 2026-04-17T07:48:29 1776412109

Claude is known for its shitty metering.

XCSme · 2026-04-16T21:07:50 1776373670

> Instruction following. Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.

Yay! They finally fixed instruction following, so people can stop bashing my benchmarks[0] for being broken, because Opus 4.6 did poorly on them and called my tests broken...

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...