chrtng's comments

chrtng · on Jan 31, 2025

Thanks, fixed

chrtng · on Oct 23, 2024

Thank you for your question! While we haven't published a formal evaluation yet, it's something we are working toward. Currently, we rely mostly on human reviews to monitor and assess LLM outputs. We also maintain a golden test suite that is run against every release to ensure consistency and quality over time, using regex-based evaluations.

Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them!

aksophist · on Oct 24, 2024

What is a false positive rate? Is it when the agent falsely passes or falsely “finds a bug”? And regardless of which: why don’t you include the other as a key metric?

I’m not aware of any evals or shared metrics. But measuring a testing agents performance seems pretty important.

What is your tool’s FPR on your golden suite?

chrtng · on Oct 23, 2024

Great question! Yes, GPT Driver runs according to the test prompt each time, which makes it resilient to small changes. To speed up execution, we also use a caching mechanism that runs quickly if nothing has changed, and only uses the models when needed.

chrtng · on Oct 23, 2024

From what we understand the term GPT was deemed too general for OpenAI to claim as its own.

https://www.theverge.com/2024/2/16/24075304/trademark-pto-op...

archerx · on Oct 25, 2024

Thank you.

chrtng · on Nov 30, 2021

One of the Christian's here :) Christian would suggest itself I guess?