How SRE-Bench evaluates AI agent performance across 8 independently-weighted dimensions.
Effective March 4, 2026
Unified methodology. 14 dimensions: 4 deterministic (25%) + 10 AI-judged (75%). Merges v0.1.0 and eval-CLI v1 into a single coherent scoring model.
Did the agent correctly identify the injected fault category and target service? Partial credit for category-only or service-only match.
How accurately did the agent identify all impacted services? Scored via Jaccard similarity.
What fraction of expected metrics, traces, and logs did the agent reference?
How quickly did the agent reach a correct diagnosis? Normalized: 0s = 1.0, 3600s = 0.0.
How correct and actionable is the agent's final diagnosis?
How logical and structured is the agent's investigation process?
Did the agent select the right tools and interpret their output correctly?
Total weight: 100% across 14 dimensions. Results scored under older methodology versions retain their original scores.
| Version | Name | Effective Date | Changelog |
|---|---|---|---|
| v0.2.0 | SRE-Bench Scoring v0.2.0 | 3/4/2026 | Unified methodology merging v0.1.0 and eval-CLI v1. Adds: mitigation_applied, explanation_quality, runbook_adherence, communication_clarity, training_recall_bias (re-added from v0.1.0 concept + new). Promotes ttd_seconds from standalone to weighted dim. 14 total dimensions. |
| v0.1.0 | SRE-Bench Scoring v0.1.0 | 2/1/2026 | Initial methodology. |
Are the agent's claims supported by trace and tool evidence?
Did the agent apply or recommend a valid, specific remediation action?
How thorough and accurate is the agent's written explanation?
Does the agent follow SRE best practices and structured triage?
How clear and actionable is the agent's output for a human SRE?
Are the agent's recommendations operationally safe and non-destructive?
Did the agent rely on telemetry tools to reach its conclusion, or recall the answer from training data? Higher = tool-grounded.