Context reuse
How often the agent can retrieve stable project facts, prior decisions, and debugging lessons instead of rediscovering them.
Benchmark stub
This benchmark track is for testing whether context architecture, memory, prompt reuse, MCP design, and model routing actually reduce spend while preserving the outputs engineering teams care about.
Dimensions
How often the agent can retrieve stable project facts, prior decisions, and debugging lessons instead of rediscovering them.
Whether the workflow still lands correct code, tests, summaries, and handoffs after reducing token-heavy context.
How much context is consumed by MCP tool definitions, schemas, and responses before useful work begins.
Whether routine substeps can move to retrieval, deterministic commands, cheaper models, or open-source models without quality loss.
Candidate tasks
Status
The benchmark spec will become useful only if it reflects real engineering-agent work. Submissions with repeated context, expensive prompts, MCP overhead, or model-routing questions are especially useful.
Submit a benchmark task