Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) an... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		cube2222 45 days ago \| parent \| context \| favorite \| on: Gemini 3 Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5. Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.

nico1207 45 days ago [–]

They are not even leading in Terminal-Bench... GPT 5.1-codex is better than Gemini 3 Pro

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact