Agent Skill · SigNoz

signoz-explaining-alerts

Describe what an existing SigNoz alert rule does in plain language — the signal it watches, the threshold and evaluation behavior, the notification routing, and a one-line fire-frequency summary so the user knows whether the alert has been active. Make sure to use this skill whenever the user asks "what does this alert do", "explain alert X", "walk me through this rule", "how does my [Y] alert work", "is this alert configured correctly", or otherwise asks for an interpretation of an existing alert's configuration. Static explanation only — for diagnosing a specific firing incident, use `signoz-investigating-alerts`.

Provider: SigNoz Path in repo: plugins/signoz/skills/signoz-explaining-alerts/SKILL.md

Skill body

Alert Explain

Decode an existing SigNoz alert’s configuration into a plain-language explanation. The skill is read-only and stays focused on the rule itself: what it watches, when it fires, where it notifies. A single line of fire-frequency data is included to ground the explanation, but this skill does not investigate any specific fire — that is signoz-investigating-alerts’s job.

Prerequisites

This skill calls SigNoz MCP server tools (signoz:signoz_get_alert, signoz:signoz_list_alert_rules, signoz:signoz_get_alert_history). Before running the workflow, confirm the signoz:signoz_* tools are available. If they are not, run signoz-mcp-setup first to initialize or repair the MCP connection. Do not guess at alert configuration from the rule name alone.

When to use

Use this skill when the user wants to:

Do NOT use when the user wants to:

Required inputs

Input Required Source if missing
Alert identifier (rule ID or name) yes $ARGUMENTS, recent context, or fuzzy match

If the input is missing or ambiguous, this skill is best-effort (not strict — read-only operations are cheap to recover from):

  1. Call signoz:signoz_list_alert_rules, paginate through every page, and find the closest name match.
  2. State the interpretation in the response: “Interpreting your request as alert ‘High Error Rate — Checkout’ (id 42). If you meant a different one, tell me the name or id.”
  3. Proceed with the explanation. The user can correct after.

Workflow

Step 1: Resolve the alert

If the user provided a numeric id, skip to Step 2. Otherwise:

  1. Call signoz:signoz_list_alert_rules and paginate every pagepagination.hasMore is true until the full list is walked.
  2. Match by name (case-insensitive substring). If multiple match, present the candidates and ask which one (interactive) or pick the closest and flag the assumption (autonomous).

Step 2: Fetch the full configuration

Call signoz:signoz_get_alert with the rule id. This is mandatory — the list response does not include the full condition / thresholds / notification settings, and explanations based on the name alone are guesses.

Step 3: Pull a one-line fire-frequency summary

Call signoz:signoz_get_alert_history for the rule with a 7-day lookback. From the response, derive a single line:

Fired N times in the last 7d (last fire: ).

If the alert never fired in the window, say so explicitly: “Has not fired in the last 7d.” If the alert is disabled, mention that and skip the history line.

This single line grounds the explanation. Do not drill into specific fires here — that’s signoz-investigating-alerts.

Step 4: Build the explanation

The single most useful thing for the user is a tight summary. Lead with a TL;DR that directly answers the question they asked, not a generic alert summary. The TL;DR is the only thing some users will read — burying their answer under a fixed template forces them to scroll for what they wanted in the first place.

Match the TL;DR shape to the user’s question:

Always include the fire-frequency line and disabled status if non-default — those ground every kind of TL;DR. But put the answer to the user’s specific question first.

After the TL;DR, write the explanation in prose, organized into the four sections below. Skip any section that has nothing meaningful to add — empty severity labels, default notification settings, vanilla annotations don’t deserve a header. Short and skimmable beats perfunctorily complete; the user is not reading a checklist.

1. What it watches — one short paragraph. Combine signal type (metrics / logs / traces / exceptions), what the query measures, and scope. Translate the query to operational language; for formulas, name each sub-query (A, B, …) and state what F1 (or whichever selectedQueryName triggers) computes — e.g. “F1 = A × 100 / B → error percentage”. Decode filter operators (= equals, != not equals, IN / NOT IN, LIKE / ILIKE, CONTAINS, REGEXP, EXISTS / NOT EXISTS); enumerate IN / NOT IN value lists so the user can verify them. Name each groupBy dimension and its practical effect (“fires separately per service” for service.name).

For anomaly rules (ruleType: anomaly_rule), explicitly state that the threshold is in standard deviations from the learned pattern, not the raw value — this is the most common point of confusion. Include algorithm (zscore), seasonality (hourly / daily / weekly), and how lower/higher targets shift sensitivity (lower → more noise, higher → only extreme deviations).

2. When it fires — one paragraph covering threshold + timing. Decode the threshold spec into plain English using these mappings:

State each threshold tier’s name, target, targetUnit, and attached channels. Always state the threshold in targetUnit, not the native query unit (e.g. “fires when p99 exceeds 500 ms”, not “…exceeds 500 000 000 ns”). Note recoveryTarget if set (hysteresis); if absent, mention flap risk when the value hovers near the boundary. Describe timing as “checks every <frequency> over the last <evalWindow>”, and mention that with at_least_once a single-point breach triggers, while all_the_times requires the full window.

3. Where it notifies — channels per tier (resolved by name from signoz_list_notification_channels if needed), notificationSettings.groupBy (how notifications are bundled), renotify (interval + which states), usePolicy (label-based routing). Skip this section entirely if notification settings are vanilla and the user already saw the channel in the TL;DR.

4. Notable concerns — flag only what’s non-default and worth the user’s attention. Don’t list every absent field; focus on the high-leverage ones:

If none of these apply, omit the section. Better silent than padded.

If the user asked only “what does this alert do”, stop here. The audit (Step 5) is for “is it configured correctly” / “audit this” / “anything I should change” requests.

Step 5: Assess the configuration (only if asked)

The user may ask “is this alert reasonable” alongside the explanation. Only assess when asked or when the request implies it (audit, review, “is this configured correctly”). Keep assessment grounded in what’s actually in the config:

Step 6: Offer next steps

Surface up to 3 follow-up intents based on what the explanation revealed — things like investigating a recent fire, running the underlying query to see current values, adjusting a threshold, or creating a related alert for a coverage gap. Use your judgment; do not pad to 3.

Skip follow-ups entirely when the user is purely inspecting (“what does this alert do?”) and signals no further intent. No chips beat wrong chips.

Guardrails

Examples

User: “Explain my checkout error rate alert”

Agent:

  1. signoz:signoz_list_alert_rules (paginated) → finds “High Error Rate — Checkout” id 42.
  2. signoz:signoz_get_alert id=42 → traces formula (A errored spans / B total spans × 100), single critical threshold at 5%, op=1, matchType=1, channel pagerduty-oncall.
  3. signoz:signoz_get_alert_history id=42 lookback=7d → fired 3 times, last fire 2h ago.
  4. Replies:

    TL;DR: Fires when checkout error rate (errored spans / total spans) exceeds 5% at any point in the last 5 min, pages pagerduty-oncall. Fired 3 times in the last 7d (last 2h ago).

    What it watches — traces from service.name = 'checkout'. Query A counts spans with hasError = true, query B counts all spans, F1 = A × 100 / B is the error percentage; the alert triggers on F1.

    When it firesop=above, matchType=at_least_once, target 5%. Checks every 1 min over the last 5 min — a single elevated minute fires it.

    Notable — no recoveryTarget (could flap near 5%); no alertOnAbsent (a checkout outage that drops traffic to zero won’t trigger this rule).

  5. Offers: “Want me to investigate the fire from 2h ago?”

User: “Explain alert rule 17 — is it well configured?”

Agent:

  1. signoz:signoz_get_alert id=17 → metrics, system.cpu.utilization, two thresholds (warning 80%, critical 90%), matchType=3 on_average, filter deployment.environment.name = 'production', groupBy host.name.
  2. signoz:signoz_get_alert_history id=17 → has not fired in last 7d.
  3. The user asked “is it well configured?” — the TL;DR leads with the verdict, not a generic description. Then explanation + full audit:

    TL;DR: Mostly well-configured (dual-threshold + on_average is the right call for CPU), but two gaps to fix: (1) no alertOnAbsent — a crashed agent silently stops reporting and this rule won’t catch it; (2) annotations are default — a runbook URL would help on-call. Has not fired in the last 7d (could mean production is healthy or thresholds too loose).

    What it watchessystem.cpu.utilization on deployment.environment.name = 'production', grouped by host.name (fires per host).

    When it firesop=above, matchType=on_average, two tiers (warning 80%, critical 90%) over a 5-min window. Short spikes don’t fire — smoothing is correct for CPU.

    Notable concerns — same as TL;DR fixes above; nothing else non-default.

  4. Offers next steps.

User: “Help me understand this anomaly detection alert” (alert id 88)

Agent:

  1. signoz:signoz_get_alert id=88ruleType: anomaly_rule, algorithm=zscore, seasonality=daily, target 3, metric http.server.request.duration, scope service.name = 'api-gateway'.
  2. History: fired 1 time in last 7d.
  3. Replies:

    TL;DR: Fires when api-gateway request latency deviates by more than 3 standard deviations (not raw latency, not a fixed value) from its learned daily pattern. Fired once in the last 7d.

    What it watcheshttp.server.request.duration for service.name = 'api-gateway', evaluated as a Z-score anomaly with daily seasonality — the model learns the typical pattern for each hour of day, so peak-hour latency won’t false-trigger if it matches the historical norm for that hour.

    When it fires — when |Z-score| > 3, i.e. the value is more than 3 standard deviations away from the expected pattern. Lower target → more sensitive (more noise); higher → only extreme deviations. The threshold is not in seconds or milliseconds.

  4. Offers to investigate the recent fire.

Skill frontmatter

argument-hint: