How SRE-Bench evaluates AI agent performance across 8 independently-weighted dimensions.
Effective April 21, 2026
Replaced rate-limited OpenRouter free models with native Gemini API models (gemini-2.5-flash, gemini-1.5-flash). Same dimensions and weights as v0.2.2.
Total weight: 100% across 8 dimensions. Results scored under older methodology versions retain their original scores.
| Version | Name | Effective Date | Changelog |
|---|---|---|---|
| v0.2.3 | SRE Agent Evaluation Methodology v0.2.3 | 4/21/2026 | — |
| v0.2.2 | SRE Agent Evaluation Methodology v0.2.2 | 4/12/2026 | v0.2.2: Removed dead models (mistralai/mistral-7b-instruct:free and meta-llama/llama-3.1-8b-instruct:free return HTTP 404 — no endpoints on OpenRouter) from all primary and fallback positions. council_a reduced from 4 to 3 seats; council_b reduced from 4 to 3 seats. google/gemini-2.5-flash-lite is now the direct fallback for all rate-limited free seats, eliminating dead-end retry chains. This reduces eval/judge-vote spans from ~17-20 per pass down to ~3-5. |
| v0.2.1 | SRE Agent Evaluation Methodology v0.2.1 | 3/15/2026 | v0.2.1: Intermediate methodology between v0.2.0 and v0.2.2. Multi-pass judge council introduced. Predecessor to v0.2.2 which removed dead OpenRouter models. |
| v0.2.0 | SRE-Bench Scoring v0.2.0 | 3/4/2026 | Unified methodology merging v0.1.0 and eval-CLI v1. Adds: mitigation_applied, explanation_quality, runbook_adherence, communication_clarity, training_recall_bias (re-added from v0.1.0 concept + new). Promotes ttd_seconds from standalone to weighted dim. 14 total dimensions. |
| v0.1.0 | SRE-Bench Scoring v0.1.0 | 2/1/2026 | Initial methodology. |