Effective March 15, 2026
Multi-pass two-council scoring methodology for evaluating SRE agent incident response quality. Three deterministic dimensions (0.0-1.0) scored algorithmically, five AI-judged dimensions (0-10) scored by a multi-pass judge council via OpenRouter with cost-controlled escalation.
Did the agent correctly identify the injected fault category and target service? Partial credit for category-only or service-only match.
How accurately did the agent identify all impacted services? Scored via Jaccard similarity.
What fraction of expected metrics, traces, and logs did the agent reference?
How correct and actionable is the agent's final diagnosis?
How logical and structured is the agent's investigation process?
Did the agent select the right tools and interpret their output correctly?
Total weight: 100% across 8 dimensions. Results scored under older methodology versions retain their original scores.
Are the agent's claims supported by trace and tool evidence?
Are the agent's recommendations operationally safe and non-destructive?