Agent Skill · SigNoz

signoz-investigating-alerts

Diagnose why a SigNoz alert fired by correlating the alert's own signal with neighbor signals (error rate, latency, throughput, CPU/memory), traces, and logs around the fire window — and rank likely causes. Make sure to use this skill whenever the user asks "why did this alert fire", "what caused alert X", "investigate this alert", "RCA for the alert that paged me", "what's wrong with [service]" in the context of a recent fire, or otherwise asks for a root-cause analysis of a firing or recently-fired alert. Read-only — does not modify any alert or notification.

Provider: SigNoz Path in repo: plugins/signoz/skills/signoz-investigating-alerts/SKILL.md

Skill body

Alert Investigate

Diagnose why a SigNoz alert fired. The skill correlates the alert’s own signal with neighbor signals around the fire window, and surfaces a ranked list of likely causes with supporting evidence. It is the companion to signoz-explaining-alerts — explain decodes the rule statically; investigate diagnoses a specific incident.

Prerequisites

This skill calls SigNoz MCP server tools heavily (signoz:signoz_get_alert, signoz:signoz_get_alert_history, signoz:signoz_execute_builder_query, signoz:signoz_query_metrics, signoz:signoz_search_traces, signoz:signoz_search_logs, signoz:signoz_get_trace_details, etc.). Before running the workflow, confirm the signoz:signoz_* tools are available. If they are not, the SigNoz MCP server is not installed or configured — run signoz-mcp-setup first to initialize or repair the MCP connection. The investigation depends on correlating multiple MCP queries; without the server there is no way to ground the analysis.

When to use

Use this skill when the user wants to:

Do NOT use when the user wants to:

Required inputs

Input Required Source if missing
Alert identifier (rule ID or name) yes $ARGUMENTS[0] or recent context
Time window no default to most recent fire from signoz:signoz_get_alert_history

If the alert name is fuzzy, this skill is best-effort (read-only):

  1. Call signoz:signoz_list_alert_rules, paginate, fuzzy-match the name.
  2. State the interpretation: “Investigating fire of ‘High Error Rate — Checkout’ (id 42) at 14:32 UTC. If you meant a different alert or fire, tell me.”
  3. Proceed.

If the alert has never fired in the lookback window, stop: there is nothing to investigate. Respond with:

“Alert ‘[name]’ has not fired in the last 7d, so there is no fire window to investigate. Use signoz-explaining-alerts to walk through the rule, or check whether the alert is enabled.”

Workflow

The investigation runs in three tiers with strict early-stop gates. Tier 1 always runs. Tier 2 runs only if tier 1 confirms a real fire. Tier 3 runs only if tier 2 surfaces correlated anomalies. Skipping the gates produces hundreds of unnecessary trace/log queries on quiet alerts.

Step 1: Resolve alert + fire window (Tier 0)

  1. Resolve the alert id via signoz:signoz_list_alert_rules (paginated) if not given.
  2. Call signoz:signoz_get_alert for the full rule config — needed to know what query, threshold, and resource scope the alert evaluated.
  3. Call signoz:signoz_get_alert_history with a 7d lookback. From the response:
    • Pick the fire window. Default to the most recent fire. If the user passed an explicit time window via $ARGUMENTS[1], honor it.
    • Note the fire pattern:
      • one-off → single fire with a long quiet period before/after.
      • sustained → fires that stayed firing for ≥ 1 evaluation cycle.
      • flapping → ≥ 3 fires within a 1h window, alternating fire/resolve.
      • recurring → fires at regular intervals (cron-like, e.g., every hour).
    • The pattern tells you what to expect from tiers 2/3.

Step 2: Tier 1 — what fired and how hard

This tier always runs. It establishes the fire is real (vs. transient threshold tickle or flap) and quantifies the magnitude.

  1. Re-run the alert’s primary query over a window centered on the fire start: [fire_start - 30m, fire_start + 30m].
    • Use signoz:signoz_execute_builder_query for builder/formula alerts.
    • Use signoz:signoz_query_metrics for PromQL alerts.
  2. Compute:
    • Peak value during the fire window.
    • Threshold breach magnitude: (peak - threshold) / threshold * 100 for “above” alerts, inverted for “below”.
    • Fire duration: how long the breach lasted.
    • Pre-fire baseline: average value in the 30m before fire start.
  3. Early-stop gate: if the breach magnitude is < 10% over the threshold AND the fire duration is < 1 evaluation window, classify as “marginal fire” — the alert may be too sensitive. Skip tiers 2 and 3 and go to Step 5 with a single hypothesis: “threshold may be too tight, recommend tuning.”

Step 3: Tier 2 — neighbor signals vs baseline

Run only if Tier 1 confirms a real breach. Pull related signals for the same resource scope as the alert and compare the fire window to a baseline window.

  1. Pick a baseline window. Use the same hour, previous day (fire_start - 24h, fire_start - 24h + fire_duration). If the alert fired during a known-anomalous time (deploy, weekly job), note it in the output but still proceed.

  2. Look up neighbor signals for the alert’s resource type. See references/neighbor-signals.md for the lookup table. Common cases:
    • Service-level alert (service.name = X): pull error rate, p95/p99 latency, request throughput, dependency error rates if trace data is available.
    • Host / VM alert (host.name = X): CPU, memory, disk I/O, network I/O.
    • K8s pod / namespace alert: pod restarts, container CPU/memory limits, node pressure, recent rollouts.
  3. For each neighbor signal:
    • Query both windows (fire + baseline) via signoz:signoz_execute_builder_query or signoz:signoz_query_metrics.
    • Compute the delta (% change in fire window vs baseline).
    • Rank by absolute delta.
  4. Early-stop gate: if no neighbor signal shows ≥ 25% deviation from baseline, classify as “isolated fire — the alert’s own signal moved but nothing else did.” This is unusual and worth surfacing. Skip Tier 3 and go to Step 5 with hypotheses focused on the alert’s own query (likely causes: data source change, instrumentation change, downstream silent failure that only shows in this metric).

Step 4: Tier 3 — traces and logs at the fire window

Run only if Tier 2 found correlated neighbor anomalies. Drill into specific failing operations.

  1. Traces (if the alert is service-scoped and traces are available):
    • Call signoz:signoz_search_traces for the fire window with filter: service.name = <scope> AND hasError = true. Cap at top 20.
    • Group results by name (operation) and error_message. Surface the top 3 by frequency with a representative trace ID for each.
    • Optionally call signoz:signoz_get_trace_details on one representative to extract specific span attributes (DB statement, downstream URL, status code).
  2. Logs for the fire window:
    • Call signoz:signoz_search_logs with filter: <scope_filter> AND severity_text IN ('ERROR', 'FATAL'). Cap at top 20 most recent.
    • Group by body pattern (or exception.type if present). Surface the top 3 distinct messages with counts.
  3. Cross-reference: do the traces and logs point at the same downstream service, dependency, or code path? If so, that becomes the leading hypothesis.

See references/baseline-comparison.md for query templates that pair fire-window and baseline-window calls cleanly.

Step 5: Build the structured output

Use this exact section order. Lead with a TL;DR — engineers under pressure scan the top first and stop reading once they have what they need. Compression plus proof: every claim cites the MCP query that produced it; no generic “check logs / verify connectivity” filler.

1. TL;DR — one or two sentences, no more. Leading hypothesis, overall confidence, blast radius, and the single most useful next action. Example:

“checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC — most likely cause is payments-api timing out (high confidence). Open trace 7af3a09b… to see the failing call.”

If no hypothesis reaches medium confidence, the leading line is “No clear root cause found.” rather than a low-confidence guess dressed up as the answer.

2. What fired The alert (id, name), the fire window (absolute UTC + relative), peak magnitude (“error rate hit 12.4% vs. 5% threshold — 148% over”), fire duration, and the fire pattern (one-off / sustained / flapping / recurring / marginal).

3. Investigation trail A scannable list of what was checked, with ✅ for confirmed signals and ❌ for ruled out, each followed by a one-line finding. The point is that the reader can see what work the AI did and what it found — this is where trust is earned. Example:

4. Likely causes (ranked, max 3) Each cause has three parts:

If only Tier 1 ran (marginal fire / no neighbor anomalies), output fewer hypotheses with low confidence and explicitly call out the limitation.

5. Ruled out Short but explicit. List candidates the evidence eliminated and the one-line reason why. Skip the section if there’s nothing meaningful to rule out — but if you considered something and dropped it, say so here so the user doesn’t waste time re-checking it.

6. Suggested next steps Action items the user can take. Be concrete and use SigNoz-native handles so the user can act immediately:

Do not pad with generic advice (“verify connectivity”, “check dashboards”) — that’s noise during an active incident.

Mirroring as navigation chips. Mirror up to 3 of these “Suggested next steps” as host follow-up intents — the most actionable, alert-scoped ones. Keep the rest in the report prose so the user has the full picture. The chip surface is capped; the prose is not.

Out of scope (v1)

Guardrails

Examples

User: “Why did the checkout error rate alert fire?”

Agent:

  1. Resolves alert: “High Error Rate — Checkout” (id 42).
  2. signoz:signoz_get_alert_history → most recent fire 2h ago at 14:32 UTC, sustained for 8m, single fire (not flapping).
  3. Tier 1: re-runs error-rate formula over [14:02, 15:02]. Peak error rate 12.4% (vs 5% threshold — 148% over). Pre-fire baseline 0.3%. Real fire, not marginal.
  4. Tier 2: pulls neighbor signals for service.name = checkout:
    • p99 latency: 4.1s vs 320ms baseline (+1180%).
    • Throughput: -42% (drop).
    • Downstream payments error rate: 18% vs 0.2% baseline (+8900%).
    • CPU/memory: flat (no resource pressure).
  5. Tier 3: traces for service.name = checkout, hasError = true in the fire window — top operation POST /checkout/submit, top error message “context deadline exceeded calling payments-api”. 30 traces, all hitting the same downstream URL. Logs show matching “payments client timeout” lines, 142 occurrences.
  6. Output:

    TL;DR: checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC. Most likely cause: payments-api timing out (high confidence — converging trace + log + neighbor evidence). Open trace 7af3a09b… to see the failing call.

    • What fired: alert 42 fired 2h ago at 14:32 UTC, sustained 8m. Error rate peaked at 12.4% (148% over threshold).
    • Investigation trail:
      • ✅ Tier 1 — peak 12.4% vs 5% threshold, pre-fire baseline 0.3%. Real fire.
      • ✅ Tier 2 — payments error rate +8900%, p99 latency +1180%, throughput −42%.
      • ❌ CPU / memory pressure on checkout — flat.
      • ✅ Tier 3 — 30 error traces all hit payments-api with context deadline exceeded; 142 matching timeout logs.
    • Likely causes (high confidence): payments service errors cascading into checkout. Evidence converges across topology (checkout → payments edge), temporal precedence (payments errors lead checkout p99), and shared entity (every error trace targets the payments-api URL).
    • Ruled out: checkout-side resource pressure (CPU/memory flat); upstream traffic spike (throughput dropped, didn’t spike).
    • Next steps: open trace 7af3a09b… in the SigNoz UI; if payments has its own alert, run signoz-explaining-alerts on it, otherwise signoz-generating-queries for payments error rate over the last 4h.

User: “Investigate the flapping CPU alert on prod-api-3”

Agent:

  1. Resolves alert (id 88, host.name = prod-api-3).
  2. History: 7 fires in last 1h, alternating fire/resolve every 8-12 minutes — flapping pattern.
  3. Tier 1: peak 84% (threshold 80%, only 5% over). Each fire lasted 2-4 minutes. Marginal — the value hovered near threshold. matchType = at_least_once made each blip trigger.
  4. Early-stop kicks in. Skip tiers 2/3.
  5. Output:

    TL;DR: alert 88 has been flapping on prod-api-3 — 7 fires in the last 1h, each 2–4m, all within 5% of the 80% threshold. No clear root cause; this is threshold tuning, not an incident. Switch matchType to on_average or add hysteresis to stop the noise.

    • What fired: alert 88, host prod-api-3, fire pattern flapping. 7 fires in the last 1h, each 2–4m. Peak 84% (5% over the 80% threshold).
    • Investigation trail:
      • ✅ Tier 1 — every fire was within 5% of threshold; duration short; baseline already at 70–75%. Marginal fire — early-stop triggered, Tier 2/3 skipped.
    • Likely causes (low / co-occurring signal only): threshold tuned too tight or matchType is too sensitive. Evidence: every fire was within 5% of threshold; baseline already runs at 70–75%. Not promoted to a “cause” — single signal, no corroboration.
    • Ruled out: real CPU saturation incident (peaks too small and short-lived; baseline already near threshold).
    • Next steps: change matchType to on_average (smooths transient spikes) OR raise threshold to 85% with hysteresis (recoveryTarget: 75). Use signoz:signoz_update_alert to apply.

User: “RCA for the log volume alert that fired last night”

Agent:

  1. Resolves alert: “Error Log Volume Spike” (id 14, no service filter — groupBy service.name).
  2. History: fired at 03:12 UTC, sustained 22m, broke down by service in the alert annotations: service.name = inventory was the firing series.
  3. Tier 1: re-runs log count for inventory in fire window. Peak 3,400 errors/min vs 1,000/min threshold (240% over). Pre-fire baseline 12/min. Real, large fire.
  4. Tier 2: neighbor signals for service.name = inventory:
    • Request error rate: +600%.
    • p99 latency: +30% (mild).
    • CPU: -80% (collapsed). Memory: -60%.
    • Pod restarts (k8s): 4 in fire window.
  5. Tier 3: logs for inventory in fire window. Top message: “OOMKilled restarting” (1,200 occurrences). Top trace error: graceful-shutdown exceptions.
  6. Output:

    TL;DR: log volume alert 14 fired at 03:12 UTC for service.name = inventory, sustained 22m at 240% over threshold. Most likely cause: inventory pods OOM-killed and restarted 4 times (high confidence). Check container memory limits for the inventory deployment.

    • What fired: alert 14 fired at 03:12 UTC for service inventory, sustained 22m, 240% over threshold.
    • Investigation trail:
      • ✅ Tier 1 — peak 3,400 errors/min vs 1,000/min threshold; pre-fire baseline 12/min. Real fire.
      • ✅ Tier 2 — request error rate +600%; CPU/memory collapsed (−80%/−60%); 4 pod restarts in window.
      • ❌ p99 latency — only +30%, not a latency-driven incident.
      • ✅ Tier 3 — top log message “OOMKilled restarting” (1,200 occurrences); top trace error: graceful-shutdown exceptions.
    • Likely causes (high confidence): inventory pods OOM-killed and restarted 4 times during the window. Evidence converges across topology (single service), temporal precedence (memory fell to zero before error spike), shared entity (all log lines from service.name = inventory), and a single coherent pattern (OOM → restart → graceful-shutdown noise).
    • Ruled out: a true application error-rate change (errors are restart noise, not request-path failures); upstream traffic surge (throughput unchanged).
    • Next steps: check container memory limits for inventory pods; review recent deploys; consider whether the alert should exclude restart-related error patterns or whether the underlying OOM is the real concern.

Additional resources

Skill frontmatter

argument-hint: [time window]