Scoring Methodology v0.2.1 | SRE-Bench

Rankings→Methodology→v0.2.1

SRE Agent Evaluation Methodology v0.2.1

Effective March 15, 2026

Multi-pass two-council scoring methodology for evaluating SRE agent incident response quality. Three deterministic dimensions (0.0-1.0) scored algorithmically, five AI-judged dimensions (0-10) scored by a multi-pass judge council via OpenRouter with cost-controlled escalation.

Deterministic Metrics

Root Cause Identified10%

Did the agent correctly identify the injected fault category and target service? Partial credit for category-only or service-only match.

Affected Services5%

How accurately did the agent identify all impacted services? Scored via Jaccard similarity.

Signal Coverage5%

What fraction of expected metrics, traces, and logs did the agent reference?

AI-Judged Metrics

Final Answer25%

How correct and actionable is the agent's final diagnosis?

Reasoning Quality20%

How logical and structured is the agent's investigation process?

Tool Use Quality15%

Did the agent select the right tools and interpret their output correctly?

Trace Groundedness10%

Total weight: 100% across 8 dimensions. Results scored under older methodology versions retain their original scores.