All of this will depend on the settings on the model (reasoning effort, temperat...

All of this will depend on the settings on the model (reasoning effort, temperature, top_k,etc) as well.

Which is why you should have benchmarks that are a bit broader generally (>10 questions for a personal setup) otherwise you overfit to noise