Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey, yes the reported results do not restrict any time limit or token limit for the benchmarks. We run our baseline with the same config 0.6 temp and max_token 32k but we set a timeout after 600 secs. Otherwise it would take forever to benchmark with the resources we had. I have a note in the actual paper on that in the implementation details section.


GPQA-Diamond is 200 questions. Any GPU since 2019 with 12GB of VRAM should be able to run tens if not hundreds of queries for a 1.5B model in parallel.


If we try to benchmark GPQA-Diamond with DeepSeek-R1 in the suggested configuration of 0.6 temp and 32k max_tokens and say if every instance takes the maximum tokens it will require 6.4 M tokens. Which without batching on a single H100 at 80 tok/s will take 23 hrs to run. To run with 32k context length on a single H100 a 1.5B model will require ~15-20 GB VRAM so you cannot run 10s or 100s of queries in parallel.

MMLU-PRO is 12,000 instances. To avoid this we set a 600 seconds timeout for each instance to run.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: