Skip to content

Benchmark task

Submit a benchmark task.

Send a real task where coding agents are expensive, unreliable, or hard to evaluate. The task may shape future benchmark specs and tools.

Good benchmark tasks include

  • The exact job the agent is supposed to complete.
  • What information the agent needs to inspect.
  • How success can be checked after the fact.
  • Where cost, context, tools, or reliability make the task hard.

What happens next

  1. 1We triage for a concrete cost, reliability, eval, or workflow-architecture problem.
  2. 2If there is a fit, we ask for a small sample: traces, prompts, tool lists, repo instructions, workflow notes, or anonymized task examples.
  3. 3The first output is a scoped path: what to inspect, what to measure, and where savings or leverage are most likely.

Direct email

For lightweight notes, use research@vorplabs.com.

Start a Diagnostic | Vorp Labs