manage-slos
Assist with Harness Service Reliability Management (SRM) tasks that the MCP server currently supports: pulling recent deployments for incident correlation, generating on-call handover reports from deployment/execution history, and drafting operational runbooks. Use when asked about SLOs, error budgets, burn-rate alerts, on-call handover, runbooks, or incident triage. Do NOT use for DORA metrics (use dora-metrics instead) or chaos experiments (use chaos-experiment instead). Trigger phrases: SLO, SLI, error budget, burn rate, on-call handover, runbook, service reliability, incident response, on-call shift.
Skill body
Manage SLOs / SRM
Limitation: The MCP server does not currently expose
slo,slo_alert, ormonitored_serviceresource types. SLO definitions, burn-rate alerts, and monitored-service configuration must be created and edited via the Harness UI under Service Reliability Management. This skill covers the parts of the SRM workflow that are supported via MCP: deployment correlation, on-call handover reports, and operational runbooks.
What this skill can do via MCP
| Workflow | Supported today |
|---|---|
| Define an SLO or SLI | ❌ Use the Harness UI |
| Configure error-budget / burn-rate alerts | ❌ Use the Harness UI |
| Configure a monitored service | ❌ Use the Harness UI |
| Correlate deployments with an incident | ✅ via execution |
| Summarize recent releases for on-call handover | ✅ via execution, service, environment |
| Draft an operational runbook | ✅ (LLM-authored; pulls context from MCP) |
Instructions
Step 1: Establish Scope
Call MCP tool: harness_list
Parameters:
resource_type: "project"
org_id: "<organization>"
Step 2: Incident Triage — Correlate Deployments
When the user reports an active incident:
- Identify the affected service and environment.
- Pull recent executions that deployed the service.
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"
# filter by service or environment as needed
- Correlate incident start time with deployment timestamps.
- Pull the failing execution’s details:
Call MCP tool: harness_get
Parameters:
resource_type: "execution"
resource_id: "<execution_id>"
org_id: "<organization>"
project_id: "<project>"
- Guide the user through structured RCA: blast radius, suspected root cause, mitigation steps, rollback candidate.
Step 3: On-Call Handover Report
Gather from the user: outgoing/incoming engineers, shift window, owned services.
Pull recent executions and services:
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"
Call MCP tool: harness_list
Parameters:
resource_type: "service"
org_id: "<organization>"
project_id: "<project>"
Generate a structured handover covering: active/recent incidents the user describes, deployments during the shift, services with elevated failure counts, and items requiring attention.
For SLO burn-rate and error-budget status, direct the user to the SRM UI — the MCP server does not expose these metrics.
Step 4: Operational Runbook (LLM-Authored)
Gather from the user: service name, team, tech stack, dependencies, SLO targets, common failure modes.
Structure the runbook with:
- Service overview (purpose, owners, tech stack)
- Health checks (pointers to SRM monitored-service dashboard in the Harness UI)
- Common alerts with response procedures
- Escalation paths
- Rollback procedures (reference relevant pipelines via
harness_list resource_type: pipeline) - Dependency contacts
Defining SLOs (UI-Only Today)
When the user asks to define an SLO, burn-rate alert, or monitored service, respond:
- Gather requirements (service tier, health sources, SLO targets, rolling window, SLI type).
- Explain that SLO CRUD is not exposed via MCP today and link the user to the Harness SRM UI.
- Offer to draft the SLO spec (name, target %, SLI type, burn-rate thresholds) as text the user can paste into the UI.
- Suggested burn-rate alert windows: 14.4×/1h (page), 6×/6h (ticket), 1×/3d (log).
Examples
- “Our payment service is down, help me triage” — Pull recent
executions for the service, correlate with incident start, suggest rollback candidate. - “Generate an on-call handover report” — Pull executions and services during the shift, summarize with active issues.
- “Create a runbook for the auth-service” — Draft runbook using MCP to list pipelines/services/environments for accurate references.
- “Define SLOs for our payment-gateway service” — Draft the SLO spec as text; point to the Harness SRM UI for creation.
- “Configure burn-rate alerts” — Draft the alert config; point to the SRM UI.
Performance Notes
- When correlating incidents with deployments, pull a wide enough execution window (±30 min) to catch slow-burn failures.
- For handover reports, include both successful and failed executions — a streak of successes is useful context.
Troubleshooting
“SLO not found” or “Monitored service not found”
- These resources are not exposed by the MCP server today. Manage them in the Harness UI under Service Reliability Management.
Incident correlation missing executions
- Confirm
org_idandproject_idscope the service - Broaden the execution filter time window
- Check that the service’s pipelines actually ran (no deploy = no execution)