Vorp Labs//Evaluation
November 20, 2025Technical
Designing an eval harness that doesn't lie to you
Most LLM evaluations give you a false sense of confidence. Here's how we built one that surfaces real decisions.
P
Phil Glazer•Founder
Decision timeMinutes
The problem with most LLM evaluation frameworks is that they optimize for the wrong thing: aggregate scores that hide the failures that matter.
What's Wrong with Benchmarks
A model scoring 85% on your eval suite tells you almost nothing about whether it will work in production. What you need to know is:
- Where does it fail catastrophically?
- How does it handle edge cases specific to your domain?
- Is it getting better at the things that matter to your users?
Decision-Oriented Evaluation
We flipped the framing. Instead of "how accurate is this model?", we ask "what decision can I make based on this eval run?"
Each eval run surfaces a decision: ship this model, investigate these failure modes, or collect more training data in these areas.
Want to see this in action?
We build systems like this for clients. Let's talk.