Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5.

Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.



They are not even leading in Terminal-Bench... GPT 5.1-codex is better than Gemini 3 Pro




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: