Skip to content
Vorp Labs//Evaluation
November 20, 2025

Technical

Designing an eval harness that doesn't lie to you

Most LLM evaluations give you a false sense of confidence. Here's how we built one that surfaces real decisions.

P
Phil GlazerFounder
6 min read
Decision timeMinutes

The problem with most LLM evaluation frameworks is that they optimize for the wrong thing: aggregate scores that hide the failures that matter.

What's Wrong with Benchmarks

A model scoring 85% on your eval suite tells you almost nothing about whether it will work in production. What you need to know is:

  • Where does it fail catastrophically?
  • How does it handle edge cases specific to your domain?
  • Is it getting better at the things that matter to your users?

Decision-Oriented Evaluation

We flipped the framing. Instead of "how accurate is this model?", we ask "what decision can I make based on this eval run?"

Each eval run surfaces a decision: ship this model, investigate these failure modes, or collect more training data in these areas.

Want to see this in action?

We build systems like this for clients. Let's talk.