Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

All of this will depend on the settings on the model (reasoning effort, temperature, top_k,etc) as well.

Which is why you should have benchmarks that are a bit broader generally (>10 questions for a personal setup) otherwise you overfit to noise



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: