Skip to content

Benchmarks

Workflow evals for systems, not leaderboard theater.

We care less about whether a model can solve a clean puzzle and more about whether a system can complete a messy workflow repeatedly, cheaply, and with enough evidence to be trusted.

Collecting cases

Coding-agent efficiency

Tasks for measuring whether context, memory, prompt reuse, tool design, and model routing reduce cost without hurting engineering output.

Spec in progress

Enterprise data navigation

Tasks for finding, transforming, validating, and explaining information across spreadsheets, exports, finance models, and legacy systems.

Collecting cases

Cost-adjusted routing

Comparing frontier models, small models, retrieval, caching, and deterministic code on quality, latency, and spend.

Design notes

Workflow completion

Evaluating whether a system completes an end-to-end business workflow with enough auditability to trust.

Methodology

How we want evals to behave.

  • Task definitions should be public enough to criticize.
  • Scores should include quality, latency, cost, and verification burden.
  • Failure modes should be documented, not hidden behind aggregate accuracy.
  • Synthetic tasks are acceptable only when they preserve the structure of real work.

Task submissions

Suggest a benchmark task.

We are especially interested in workflows where correctness is obvious after the fact but hard for a generic model to infer up front.

Submit a benchmark task