Skip to content

Benchmark stub

Coding-agent token efficiency should be measured as workflow quality per dollar.

This benchmark track is for testing whether context architecture, memory, prompt reuse, MCP design, and model routing actually reduce spend while preserving the outputs engineering teams care about.

Dimensions

Cost is not meaningful unless quality stays visible.

Context reuse

How often the agent can retrieve stable project facts, prior decisions, and debugging lessons instead of rediscovering them.

Task completion

Whether the workflow still lands correct code, tests, summaries, and handoffs after reducing token-heavy context.

Tool overhead

How much context is consumed by MCP tool definitions, schemas, and responses before useful work begins.

Routing quality

Whether routine substeps can move to retrieval, deterministic commands, cheaper models, or open-source models without quality loss.

Candidate tasks

Early tasks are intentionally practical.

  • Fix a previously-seen bug after retrieving the prior lesson.
  • Modify a repo with concise project instructions versus a long pasted briefing.
  • Choose between MCP tools when only a subset is relevant to the task.
  • Route a simple refactor away from a frontier model while preserving test quality.

Status

Collecting real workflows before publishing scores.

The benchmark spec will become useful only if it reflects real engineering-agent work. Submissions with repeated context, expensive prompts, MCP overhead, or model-routing questions are especially useful.

Submit a benchmark task