Hacker Newsnew | past | comments | ask | show | jobs | submit | chrtng's commentslogin

Thanks, fixed


Thank you for your question! While we haven't published a formal evaluation yet, it's something we are working toward. Currently, we rely mostly on human reviews to monitor and assess LLM outputs. We also maintain a golden test suite that is run against every release to ensure consistency and quality over time, using regex-based evaluations.

Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them!


What is a false positive rate? Is it when the agent falsely passes or falsely “finds a bug”? And regardless of which: why don’t you include the other as a key metric?

I’m not aware of any evals or shared metrics. But measuring a testing agents performance seems pretty important.

What is your tool’s FPR on your golden suite?


Great question! Yes, GPT Driver runs according to the test prompt each time, which makes it resilient to small changes. To speed up execution, we also use a caching mechanism that runs quickly if nothing has changed, and only uses the models when needed.


From what we understand the term GPT was deemed too general for OpenAI to claim as its own.

https://www.theverge.com/2024/2/16/24075304/trademark-pto-op...


Thank you.


One of the Christian's here :) Christian would suggest itself I guess?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: