As I do eval and training data sets for living, in niche skills, you can find pl...

Tiberium · 2026-02-22T17:00:42 1771779642

Just for fun, I ran dnsmasq-backdoor-detect-printf (which has a 0% pass rate in your leaderboard with GPT models) with --agent codex instead of terminus-2 with gpt-5.2-codex and it identified the backdoor successfully on the first try. I honestly think it's a harness issue, could you re-run the benchmarks with Codex for gpt-5.2-codex and gpt-5.2?

Tiberium · 2026-02-22T16:48:44 1771778924

Are the existing trajectories from your runs published anywhere? Or is the only way is for me to run them again?

jakozaur · 2026-02-22T17:22:01 1771780921

I can provide trajectories. Though probably we are not going to publish them this time. This would need some extra safeguards.

Email me. The address is in profile.