How SRE-Bench evaluates AI agent performance across 8 independently-weighted dimensions.
Effective April 12, 2026
Multi-pass two-council scoring methodology for evaluating SRE agent incident response quality. Three deterministic dimensions (0.0-1.0) scored algorithmically, five AI-judged dimensions (0-10) scored by a multi-pass judge council via OpenRouter with cost-controlled escalation.
Did the agent correctly identify the injected fault category and target service? Partial credit for category-only or service-only match.
How accurately did the agent identify all impacted services? Scored via Jaccard similarity.
What fraction of expected metrics, traces, and logs did the agent reference?
How correct and actionable is the agent's final diagnosis?
How logical and structured is the agent's investigation process?
Did the agent select the right tools and interpret their output correctly?
Total weight: 100% across 8 dimensions. Results scored under older methodology versions retain their original scores.
| Version | Name | Effective Date | Changelog |
|---|---|---|---|
| v0.2.2 | SRE Agent Evaluation Methodology v0.2.2 | 4/12/2026 | v0.2.2: Removed dead models (mistralai/mistral-7b-instruct:free and meta-llama/llama-3.1-8b-instruct:free return HTTP 404 — no endpoints on OpenRouter) from all primary and fallback positions. council_a reduced from 4 to 3 seats; council_b reduced from 4 to 3 seats. google/gemini-2.5-flash-lite is now the direct fallback for all rate-limited free seats, eliminating dead-end retry chains. This reduces eval/judge-vote spans from ~17-20 per pass down to ~3-5. |
| v0.2.1 | SRE Agent Evaluation Methodology v0.2.1 | 3/15/2026 | v0.2.1: Intermediate methodology between v0.2.0 and v0.2.2. Multi-pass judge council introduced. Predecessor to v0.2.2 which removed dead OpenRouter models. |
| v0.2.0 | SRE-Bench Scoring v0.2.0 | 3/4/2026 | Unified methodology merging v0.1.0 and eval-CLI v1. Adds: mitigation_applied, explanation_quality, runbook_adherence, communication_clarity, training_recall_bias (re-added from v0.1.0 concept + new). Promotes ttd_seconds from standalone to weighted dim. 14 total dimensions. |
| v0.1.0 | SRE-Bench Scoring v0.1.0 | 2/1/2026 | Initial methodology. |
Are the agent's claims supported by trace and tool evidence?
Are the agent's recommendations operationally safe and non-destructive?