Enterprise data navigation
Tasks for finding, transforming, validating, and explaining information across spreadsheets, exports, finance models, and legacy systems.
Benchmarks
We care less about whether a model can solve a clean puzzle and more about whether a system can complete a messy workflow repeatedly, cheaply, and with enough evidence to be trusted.
Tasks for measuring whether context, memory, prompt reuse, tool design, and model routing reduce cost without hurting engineering output.
Tasks for finding, transforming, validating, and explaining information across spreadsheets, exports, finance models, and legacy systems.
Comparing frontier models, small models, retrieval, caching, and deterministic code on quality, latency, and spend.
Evaluating whether a system completes an end-to-end business workflow with enough auditability to trust.
Methodology
Task submissions
We are especially interested in workflows where correctness is obvious after the fact but hard for a generic model to infer up front.
Submit a benchmark task