signoz-creating-alerts
Create a new SigNoz alert rule from a natural-language intent — threshold, anomaly, log-volume, error-rate, latency, or absent-data alerts across metrics, logs, traces, and exceptions. Make sure to use this skill whenever the user says "alert me when…", "notify me if…", "set up monitoring for…", "page me on…", "create an alert for…", or asks for a new alert/notification rule, even if they don't say the word "alert" explicitly. Also use it when someone asks to be notified about error rates, latency spikes, log volume, CPU/memory pressure, or anomalous behavior on a service or host.
Skill body
Alert Create
Build a SigNoz alert from a user’s natural-language intent. The skill targets two consumers: an autonomous AI SRE agent that runs without a human in the loop, and a human at a Claude Code / Codex / Cursor prompt. Both go through the same flow.
Prerequisites
This skill calls SigNoz MCP server tools (signoz:signoz_create_alert,
signoz:signoz_list_alert_rules, signoz:signoz_get_field_keys, etc.). Before running the
workflow, confirm the signoz:signoz_* tools are available. If they are not,
run signoz-mcp-setup first to initialize or repair the MCP connection. Do not
try to fall back to raw HTTP calls or fabricate alert configs without the MCP
tools.
When to use
Use this skill when the user wants to:
- Create, set up, or configure a new alert rule.
- Get paged or notified when a metric, log volume, latency, or error rate crosses a threshold.
- Detect anomalous behavior on a service, host, or signal.
- Catch silent data loss (“alert if data stops arriving from X”).
Do NOT use when the user wants to:
- Understand what an existing alert monitors →
signoz-explaining-alerts. - Diagnose why an existing alert fired →
signoz-investigating-alerts. - Modify thresholds, queries, or routing on an existing alert → call
signoz:signoz_update_alertdirectly.
Required inputs (strict)
Alert creation is a write operation against a shared system. Guessing here creates noisy alerts on the wrong service that someone else has to clean up. The skill enforces a strict input contract:
| Input | Required | Source if missing |
|---|---|---|
| Alert intent (NL goal) | yes | $ARGUMENTS or recent user turn |
Resource attribute filter (e.g. service.name, k8s.namespace.name, host.name) |
yes | discover via signoz:signoz_get_field_keys + signoz:signoz_get_field_values |
| Threshold value(s) | inferred from intent | derive a sensible default and surface in the preview |
| Severity | inferred from intent | default warning; promote to critical only if user said “page”, “wake up”, “critical” |
| Notification channel | yes | signoz:signoz_list_notification_channels + offer “create new” |
If a required input is missing and cannot be discovered, stop before
calling any write tool and ask the user. The host application decides
how the question is surfaced (a structured clarification tool, inline
<assistant_question> tags, an interactive prompt, etc.) — follow the
host’s UI rendering rules.
What to include in the question:
- What is missing — name the input concretely (e.g. “which resource-attribute filter to use”).
- Candidate lists populated from your discovery calls — concrete
values per attribute the user can pick from. Example shape:
service.name→frontend,checkout,payments;host.name→prod-api-1,prod-db-1. - Allow free-form input so the user can name a value you didn’t surface.
In autonomous mode (no human), escalate to the caller or fill the gap
from upstream context. Either way, do not proceed to
signoz:signoz_create_alert with a guessed value.
Workflow
Step 1: Parse intent and check what’s missing
Extract from the user’s request:
- What to monitor — signal type (metrics / logs / traces / exceptions) and the specific condition (CPU, error rate, p99 latency, log count, …).
- Resource scope — which service, host, namespace, or environment.
- Threshold — numeric value and comparison (“above 80%”, “below 100/s”).
- Severity — implicit from urgency words (“page” → critical, default warning otherwise).
- Channel — explicit channel name if the user provided one.
Map signal phrasing to alert type:
| User says | alertType | signal |
|---|---|---|
| metric, CPU, memory, latency, request rate | METRIC_BASED_ALERT | metrics |
| log, error logs, log volume, log pattern | LOGS_BASED_ALERT | logs |
| trace, span, latency p99, slow requests | TRACES_BASED_ALERT | traces |
| exception, stack trace, crash | EXCEPTIONS_BASED_ALERT | (clickhouse_sql) |
If resource scope is missing, run discovery (Step 2). If still missing after discovery, stop and ask the user (see Required inputs above).
Step 2: Discover resource attributes and metric names
When the user does not name a service / host / namespace, the SigNoz MCP guideline applies: always prefer a resource-attribute filter. Discover candidates instead of guessing:
- Call
signoz:signoz_get_field_keyswithfieldContext=resourceto enumerate resource attributes for the chosen signal. - Call
signoz:signoz_get_field_valuesfor the most likely attribute (typicallyservice.name, thenhost.name, thenk8s.namespace.name) to get concrete values. - If the user mentioned a metric by name, call
signoz:signoz_list_metricswith a search term to verify the exact OTel metric name. Wrong names create alerts that never fire.
Surface the candidates in your clarification request (see Required inputs above). Do not pick one.
Step 3: Check for duplicate alerts
Once the scope is resolved (either provided by the user or discovered in Step 2), check for existing alerts before probing data or authoring a new config — both are wasted work if the user wants to update an existing rule instead.
Call signoz:signoz_list_alert_rules and paginate through every page —
pagination.hasMore is true until you have walked the full list. This lists
configured alert rules (the durable state); do not use signoz:signoz_list_alerts,
which returns currently triggered/active alert instances and will silently
miss rules that are configured but not firing right now. Check for existing
rules that match the user’s intent (same signal + same scope + similar
threshold). If a likely duplicate exists, surface it and ask whether to
create a new one anyway, modify the existing one (out of scope here — use
signoz:signoz_update_alert), or cancel.
Step 4: Probe data existence for the chosen filter (fail fast)
Before authoring any alert config, confirm the specific combination the alert will watch (metric × service × any other filter) actually emits data. The most common silent failure is “metric exists in the catalog and the service exists in the catalog, but the service doesn’t emit that metric” — each piece checks out in isolation, the alert saves successfully, and it silently never fires.
Run a single probe over the last 1 hour using the same filter the alert will use, but with the simplest aggregation that confirms data exists:
- Metrics:
signoz:signoz_execute_builder_querywithcount()(orcount_distinct(service.name)if scope-discovering). Usesignoz:signoz_query_metricswhen you already have a concretemetricName— it auto-applies aggregation defaults and acceptsfilter/groupBy, but requires a concretemetricName(no PromQL, no filter-only probes). - Logs:
signoz:signoz_aggregate_logswithcount()over the filter. - Traces:
signoz:signoz_aggregate_traceswithcount()over the filter.
Inspect the result:
- Probe returns rows → proceed to Step 5.
- Probe returns empty → STOP. Do not build an alert config the user
will then be asked to throw away. Stop and ask the user (see Required
inputs above), describing what was missing and offering concrete
recovery:
- Service doesn’t emit the metric → call
signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric>to list the services that do emit it; let the user pick a different service or a different metric. - Wrong attribute name (
serviceinstead ofservice.name) → suggest the semantic-convention name and re-probe. - Service emits the metric but not in the expected time range → widen the probe window once (e.g. last 24h) before declaring no-data.
- Service doesn’t emit the metric → call
Exception — log-based crash / panic / OOMKilled / FATAL alerts. These intentionally have zero matches in a healthy system. The probe will return empty by design. Do not stop; instead, surface the zero-match result and ask the user to confirm before save. Treat this exception narrowly: it applies to “alert me when bad thing happens” log queries, not to alerts that depend on continuous data flow.
This probe is cheap (one query, ~100ms), and catching the no-data case early avoids the worst UX failure mode of this skill — the user reading through a fully-authored JSON payload and only then learning the alert can never fire.
Step 5: Build the alert config
The MCP server is the source of truth for the alert JSON schema, threshold
codes, and validation rules. Read the signoz://alert/instructions and
signoz://alert/examples MCP resources for the canonical, version-current
shape.
For most user intents, the config is one of a small number of patterns:
| Pattern | Example intents |
|---|---|
| Single-metric threshold | “alert when CPU > 80%”, “p99 latency > 2s” |
| Log volume threshold | “more than N error logs/min” |
| Trace-based count or p-tile | “p99 span duration > 2s on checkout” |
| Error-rate formula (A/B*100) — see “Common query shapes” below | “error rate > 5%” |
| Anomaly detection (Z-score) | “alert me on anomalous traffic” |
| Absent-data alert | “alert if data stops arriving” |
ClickHouse SQL alert — author SQL using the schema in signoz://alert/examples |
non-trivial joins, custom aggregations the builder cannot express |
PromQL alert — delegate to signoz-generating-queries for the query, then return here |
when user already has PromQL |
Threshold op and matchType values. v2alpha1 accepts the
human-readable strings ("above", "on_average"); the legacy numeric
codes ("1", "3") are also accepted but harder to read in the UI. Prefer
the words. Anomaly rules only support op: "above" — the engine
already treats z-score breaches as two-sided when the threshold is
positive, so "above_or_below" is rejected and unnecessary.
| Comparison | op |
Evaluation behavior | matchType |
|---|---|---|---|
| above / exceeds / > | "above" |
breach at any point | "at_least_once" |
| below / under / < | "below" |
breach for entire window | "all_the_times" |
| equal / = | "equals" |
average breaches | "on_average" |
| not equal / != | "not_equals" |
sum breaches | "in_total" |
| last value breaches | "last" |
Defaults the skill applies (and surfaces in the preview):
evalWindow: 5m0s,frequency: 1m0s— change only if the intent implies a slower or faster cadence.matchType: "on_average"for CPU / memory / latency — smooths transient spikes.matchType: "at_least_once"for error counts / error rates — catches any breach.
Severity defaults — derive from the intrinsic urgency of the alert, not
just the user’s words. The user saying “alert me” doesn’t force warning
when the condition itself describes a critical event. Use this table; an
explicit user cue overrides it (“just FYI” → demote, “page me” / “wake me
up” → promote).
| Alert intent | Default severity |
|---|---|
| Pod crash / OOMKilled / CrashLoopBackOff / panic / FATAL log signals | critical |
| Service down / no-data on a production service | critical |
| Error rate above any non-trivial threshold (>1%) | critical |
| Error logs / exception spikes | warning |
| Latency degradation (p95/p99 above target) | warning |
| CPU / memory / disk pressure | warning |
| Request-rate / traffic anomaly | warning |
| SLO budget burn (info-level burn rate) | info / warning |
When the user’s intent is ambiguous on severity (no urgency cue, no
clearly-critical condition), default to warning and surface the choice
in the preview so they can adjust.
OTel attribute names — always use semantic conventions:
service.name, host.name, k8s.namespace.name, deployment.environment or deployment.environment.name. Never service, host, or env.
Annotation templates — the on-call engineer sees the notification, not the alert config. A notification that says “Pod crash detected” with no service name, no count, and no value is nearly useless at 3am. Always include the moving values:
summary— single-line headline. Include the resource scope and the numeric value:"checkoutservice error rate {{$value}}% above 3%".description— longer message. Include{{$value}},{{$threshold}}, the groupBy values (e.g.{{$labels.service_name}}), and a sentence on what to do or where to look. For count-based alerts include the count explicitly:"{{$value}} crash log lines in the last 5 minutes from service {{$labels.service_name}}".
Use {{$value}} for the breaching value, {{$threshold}} for the target,
and {{$labels.<key>}} for groupBy values (note SigNoz substitutes the
dotted attribute name with underscores: service.name → service_name).
Common query shapes — conventions
Read signoz://alert/examples for the authoritative JSON of all
patterns (error rate, p99 latency, log volume, absent-data, anomaly,
PromQL, ClickHouse SQL). The conventions that don’t live in the
schema:
- Error-rate formula: set
disabled: trueon the component queries A and B so only the formula F1 renders in the alert chart and notification. The raw counts are intermediate, not the alert signal — forgetting this clutters the preview with three series and confuses the on-call engineer reading the notification. - p99 latency: threshold target is in nanoseconds (2s →
2000000000),
targetUnit: "ns". - Log volume spike: prefer
groupBy: service.nameover a hard filter when the user said “any service” — groupBy provides the scoping AND keeps the notification useful per-service.
Step 6: Dry-run the full query and validate the threshold
Step 4 confirmed data flows. Step 6 does two things:
- Validate query shape. Run the full builder spec (with
groupBy, formulas, disabled component queries, and non-string filters) — Step 4’s barecount()probe doesn’t exercise these. The create-alert schema accepts queries that error at evaluation (numericgroupBy, unquoted bool filter, mismatched aggregation). Any HTTP 5xx or “filter type mismatch” = fix the config before proceeding to (2).disabled: trueon formula component queries (A, B inA * 100 / B) is the recommended pattern, not a failure — see Step 5. - Calibrate the threshold. Given the validated query, would the alert have fired a sensible number of times in the last hour?
Run the full primary query (or formula) over the last hour:
signoz:signoz_execute_builder_queryfor all builder, formula, and PromQL queries — setcompositeQuery.queries[].typetobuilder_query/builder_formula/promqlas appropriate. For PromQL put the query string inspec.queryand readsignoz://promql/instructionsfor the UTF-8 quoted-selector form SigNoz requires ({"metric.name.with.dots"}— not the underscored or bare-dotted forms).signoz:signoz_aggregate_logs/signoz:signoz_aggregate_traceswhen those fit better.signoz:signoz_query_metricswhen the alert query targets a single known metric bymetricName— the tool auto-applies aggregation defaults and acceptsfilter,groupBy, andformulaalongside. PromQL is not supported here; usesignoz:signoz_execute_builder_queryfor that.
Compute how many evaluation points breached the proposed threshold. Surface in the preview as “would have fired N times in the last 1h”. A 1h window is too short to grade most alerts — only the upper extreme is actionable:
- N is large (e.g. > 30) → likely alert storm. Surface and
recommend tightening or adding hysteresis (
recoveryTarget). - N = 0 → expected for a healthy system; do not nudge the user to loosen. Only flag if the user said they’d expect the alert firing right now (e.g. during an active incident).
- N is small and non-zero → report the count; the user decides
whether the threshold is right. One hour can’t distinguish “tuned
well” from “barely caught a transient”.
- Exceptions:
- Anomaly alerts — skip the breach count entirely (Z-scores aren’t directly comparable to raw values). Step 4 already verified the underlying metric × service has data; nothing more to validate here.
- Log-based crash / panic / OOMKilled / FATAL alerts — these intentionally have zero matches in a healthy system. Step 4 has already surfaced the zero-match result and obtained user confirmation; skip the breach count.
If Step 4 was somehow skipped (e.g. a downstream skill is invoking this flow mid-stream), the no-data stop rule applies here too: empty result → stop and ask the user (see Required inputs above) instead of saving an alert that will never fire.
Step 7: Resolve notification channels
The skill must resolve at least one channel before save. An alert with no channels saves successfully and silently never notifies anyone — the second most common silent failure after bad queries. Channel resolution runs after the dry-run so any threshold-driven severity changes (warning → critical) are settled before the user is asked to pick routing, and so we never create a notification channel inline for an alert that fails validation.
- Call
signoz:signoz_list_notification_channelsto enumerate existing channels. - If the user named a channel (“send to slack-infra”), use it if it exists; if not, fall through.
- Otherwise present the user with two options:
- Pick from existing — list channels with their type (Slack, PagerDuty, email, webhook) so the user can choose.
- Create new inline — call
signoz:signoz_create_notification_channelwith channel parameters the user provides (name, type, type-specific config like Slack webhook URL or PagerDuty integration key).
- If neither path resolves a channel, stop and ask the user for a notification channel (see Required inputs above).
For multi-severity alerts, attach channels per threshold:
thresholds.spec[N].channels is an array — typically warning → Slack only,
critical → Slack + PagerDuty.
Handling secret-bearing channel config
Slack webhook URLs, PagerDuty integration keys, and similar webhook tokens are secrets. When the user supplies them inline, treat them as opaque inputs and follow these rules:
- Do not echo the secret back. Never include the webhook URL,
integration key, or any password-like token in chat output, previews,
confirmation messages, summaries, or the
<navigation_suggestions>payload. Refer to the channel by itsnameonly (“Slack channelslack-infracreated”) and omit the value entirely. - Do not stash secrets in clarification context. If you need to ask the
user a follow-up question after they pasted a secret, do not include
the secret value in the clarification
message,discovered_context, or any other field that the host may persist for resume. Refer to it symbolically (e.g. “the webhook you just provided”). - One-pass only. Pass the secret directly to
signoz:signoz_create_notification_channeland do not retain it in any intermediate prose. After the create call succeeds, refer to the channel by name; after a failure, ask the user to re-paste rather than echoing what they sent. - If the user instead asks “how do I set up a Slack channel?” — that is a docs question, not a create-channel request. Answer with the docs flow (the SigNoz UI’s Notification Channels page) and do not solicit the secret in chat at all. Prefer the UI path when the user seems uncertain about exposing the token.
Step 8: Preview the prepared config
Emit a one-paragraph plain-language summary of what will be created — no raw JSON dump. The user-facing facts (what fires, on what scope, at what threshold, where it routes) are captured by the summary; clicking through the JSON does not catch query-shape errors (Step 6’s dry-run does).
Summary: This alert fires when [condition] for [resource scope], evaluated every [frequency] over the last [window]. Thresholds: warning at X, critical at Y. Notifications go to [channels]. Dry-run on the last hour: would have fired N times.
Step 9: Save and report
- Call
signoz:signoz_create_alertwith the config from Step 8. - Name collision — if
signoz:signoz_create_alertreturns a duplicate-name error, do not suffix-append or callsignoz:signoz_update_alert. Stop and tell the user the existing alert blocked creation; offer to use a different name or modify the existing alert (which is out of scope for this skill). - On success, report:
- The alert ID and name.
- What it watches and at what threshold.
- Which channels are wired up.
- The dry-run summary (“would have fired N times in last 1h”).
Guardrails
- Strict inputs over guessing Resource attribute and channel are required. If missing, stop and ask the user (see Required inputs above). Creating an alert on a guessed service is harder to undo than asking.
- Always paginate
signoz:signoz_list_alert_rulesStopping at page 1 misses duplicates and produces noise. - Dry-run is mandatory Step 4 (data probe) and Step 6 (full
query + threshold calibration) are both required before
signoz:signoz_create_alert. A never-firing alert is worse than no alert: it provides a false sense of safety. - Threshold operators use canonical words Prefer
op: "above"/"below"/"equals"/"not_equals". Numeric codes ("1"–"7") are accepted but discouraged — same goes formatchType("on_average"/"at_least_once", not"3"/"1"). - Signal must match alertType
signal: "logs"requiresLOGS_BASED_ALERT. Mismatches fail validation. - Anomaly rules are metrics-only
anomaly_rule+ non-metric alertType is rejected. - Channels must exist. Use names from
signoz:signoz_list_notification_channelsexactly, or create the channel inline first. - Never echo channel secrets. Slack webhook URLs, PagerDuty integration
keys, and similar webhook tokens are secrets. Pass them to
signoz:signoz_create_notification_channelonce and never repeat the value in chat output, previews, confirmations, summaries, clarification payloads, or navigation suggestions. Refer to the channel by name only after creation; ask the user to re-paste on failure rather than reproducing what they sent.
Examples
Four canonical alert flows — multi-severity metric threshold,
error-rate formula, log-volume groupBy, anomaly detection — live in
references/examples.md.
Additional resources
signoz://alert/instructionsandsignoz://alert/examplesMCP resources — full alert config JSON schema, threshold codes, filter expression syntax, and version-current pattern examples. Always preferred over any transcribed copy.signoz-generating-queriesskill — for authoring PromQL or testing queries before wrapping them in an alert.