Agent Skill · SigNoz

signoz-creating-alerts

Create a new SigNoz alert rule from a natural-language intent — threshold, anomaly, log-volume, error-rate, latency, or absent-data alerts across metrics, logs, traces, and exceptions. Make sure to use this skill whenever the user says "alert me when…", "notify me if…", "set up monitoring for…", "page me on…", "create an alert for…", or asks for a new alert/notification rule, even if they don't say the word "alert" explicitly. Also use it when someone asks to be notified about error rates, latency spikes, log volume, CPU/memory pressure, or anomalous behavior on a service or host.

Provider: SigNoz Path in repo: plugins/signoz/skills/signoz-creating-alerts/SKILL.md

Skill body

Alert Create

Build a SigNoz alert from a user’s natural-language intent. The skill targets two consumers: an autonomous AI SRE agent that runs without a human in the loop, and a human at a Claude Code / Codex / Cursor prompt. Both go through the same flow.

Prerequisites

This skill calls SigNoz MCP server tools (signoz:signoz_create_alert, signoz:signoz_list_alert_rules, signoz:signoz_get_field_keys, etc.). Before running the workflow, confirm the signoz:signoz_* tools are available. If they are not, run signoz-mcp-setup first to initialize or repair the MCP connection. Do not try to fall back to raw HTTP calls or fabricate alert configs without the MCP tools.

When to use

Use this skill when the user wants to:

Do NOT use when the user wants to:

Required inputs (strict)

Alert creation is a write operation against a shared system. Guessing here creates noisy alerts on the wrong service that someone else has to clean up. The skill enforces a strict input contract:

Input Required Source if missing
Alert intent (NL goal) yes $ARGUMENTS or recent user turn
Resource attribute filter (e.g. service.name, k8s.namespace.name, host.name) yes discover via signoz:signoz_get_field_keys + signoz:signoz_get_field_values
Threshold value(s) inferred from intent derive a sensible default and surface in the preview
Severity inferred from intent default warning; promote to critical only if user said “page”, “wake up”, “critical”
Notification channel yes signoz:signoz_list_notification_channels + offer “create new”

If a required input is missing and cannot be discovered, stop before calling any write tool and ask the user. The host application decides how the question is surfaced (a structured clarification tool, inline <assistant_question> tags, an interactive prompt, etc.) — follow the host’s UI rendering rules.

What to include in the question:

In autonomous mode (no human), escalate to the caller or fill the gap from upstream context. Either way, do not proceed to signoz:signoz_create_alert with a guessed value.

Workflow

Step 1: Parse intent and check what’s missing

Extract from the user’s request:

  1. What to monitor — signal type (metrics / logs / traces / exceptions) and the specific condition (CPU, error rate, p99 latency, log count, …).
  2. Resource scope — which service, host, namespace, or environment.
  3. Threshold — numeric value and comparison (“above 80%”, “below 100/s”).
  4. Severity — implicit from urgency words (“page” → critical, default warning otherwise).
  5. Channel — explicit channel name if the user provided one.

Map signal phrasing to alert type:

User says alertType signal
metric, CPU, memory, latency, request rate METRIC_BASED_ALERT metrics
log, error logs, log volume, log pattern LOGS_BASED_ALERT logs
trace, span, latency p99, slow requests TRACES_BASED_ALERT traces
exception, stack trace, crash EXCEPTIONS_BASED_ALERT (clickhouse_sql)

If resource scope is missing, run discovery (Step 2). If still missing after discovery, stop and ask the user (see Required inputs above).

Step 2: Discover resource attributes and metric names

When the user does not name a service / host / namespace, the SigNoz MCP guideline applies: always prefer a resource-attribute filter. Discover candidates instead of guessing:

  1. Call signoz:signoz_get_field_keys with fieldContext=resource to enumerate resource attributes for the chosen signal.
  2. Call signoz:signoz_get_field_values for the most likely attribute (typically service.name, then host.name, then k8s.namespace.name) to get concrete values.
  3. If the user mentioned a metric by name, call signoz:signoz_list_metrics with a search term to verify the exact OTel metric name. Wrong names create alerts that never fire.

Surface the candidates in your clarification request (see Required inputs above). Do not pick one.

Step 3: Check for duplicate alerts

Once the scope is resolved (either provided by the user or discovered in Step 2), check for existing alerts before probing data or authoring a new config — both are wasted work if the user wants to update an existing rule instead.

Call signoz:signoz_list_alert_rules and paginate through every pagepagination.hasMore is true until you have walked the full list. This lists configured alert rules (the durable state); do not use signoz:signoz_list_alerts, which returns currently triggered/active alert instances and will silently miss rules that are configured but not firing right now. Check for existing rules that match the user’s intent (same signal + same scope + similar threshold). If a likely duplicate exists, surface it and ask whether to create a new one anyway, modify the existing one (out of scope here — use signoz:signoz_update_alert), or cancel.

Step 4: Probe data existence for the chosen filter (fail fast)

Before authoring any alert config, confirm the specific combination the alert will watch (metric × service × any other filter) actually emits data. The most common silent failure is “metric exists in the catalog and the service exists in the catalog, but the service doesn’t emit that metric” — each piece checks out in isolation, the alert saves successfully, and it silently never fires.

Run a single probe over the last 1 hour using the same filter the alert will use, but with the simplest aggregation that confirms data exists:

Inspect the result:

Exception — log-based crash / panic / OOMKilled / FATAL alerts. These intentionally have zero matches in a healthy system. The probe will return empty by design. Do not stop; instead, surface the zero-match result and ask the user to confirm before save. Treat this exception narrowly: it applies to “alert me when bad thing happens” log queries, not to alerts that depend on continuous data flow.

This probe is cheap (one query, ~100ms), and catching the no-data case early avoids the worst UX failure mode of this skill — the user reading through a fully-authored JSON payload and only then learning the alert can never fire.

Step 5: Build the alert config

The MCP server is the source of truth for the alert JSON schema, threshold codes, and validation rules. Read the signoz://alert/instructions and signoz://alert/examples MCP resources for the canonical, version-current shape.

For most user intents, the config is one of a small number of patterns:

Pattern Example intents
Single-metric threshold “alert when CPU > 80%”, “p99 latency > 2s”
Log volume threshold “more than N error logs/min”
Trace-based count or p-tile “p99 span duration > 2s on checkout”
Error-rate formula (A/B*100) — see “Common query shapes” below “error rate > 5%”
Anomaly detection (Z-score) “alert me on anomalous traffic”
Absent-data alert “alert if data stops arriving”
ClickHouse SQL alert — author SQL using the schema in signoz://alert/examples non-trivial joins, custom aggregations the builder cannot express
PromQL alert — delegate to signoz-generating-queries for the query, then return here when user already has PromQL

Threshold op and matchType values. v2alpha1 accepts the human-readable strings ("above", "on_average"); the legacy numeric codes ("1", "3") are also accepted but harder to read in the UI. Prefer the words. Anomaly rules only support op: "above" — the engine already treats z-score breaches as two-sided when the threshold is positive, so "above_or_below" is rejected and unnecessary.

Comparison op Evaluation behavior matchType
above / exceeds / > "above" breach at any point "at_least_once"
below / under / < "below" breach for entire window "all_the_times"
equal / = "equals" average breaches "on_average"
not equal / != "not_equals" sum breaches "in_total"
    last value breaches "last"

Defaults the skill applies (and surfaces in the preview):

Severity defaults — derive from the intrinsic urgency of the alert, not just the user’s words. The user saying “alert me” doesn’t force warning when the condition itself describes a critical event. Use this table; an explicit user cue overrides it (“just FYI” → demote, “page me” / “wake me up” → promote).

Alert intent Default severity
Pod crash / OOMKilled / CrashLoopBackOff / panic / FATAL log signals critical
Service down / no-data on a production service critical
Error rate above any non-trivial threshold (>1%) critical
Error logs / exception spikes warning
Latency degradation (p95/p99 above target) warning
CPU / memory / disk pressure warning
Request-rate / traffic anomaly warning
SLO budget burn (info-level burn rate) info / warning

When the user’s intent is ambiguous on severity (no urgency cue, no clearly-critical condition), default to warning and surface the choice in the preview so they can adjust.

OTel attribute names — always use semantic conventions: service.name, host.name, k8s.namespace.name, deployment.environment or deployment.environment.name. Never service, host, or env.

Annotation templates — the on-call engineer sees the notification, not the alert config. A notification that says “Pod crash detected” with no service name, no count, and no value is nearly useless at 3am. Always include the moving values:

Use {{$value}} for the breaching value, {{$threshold}} for the target, and {{$labels.<key>}} for groupBy values (note SigNoz substitutes the dotted attribute name with underscores: service.nameservice_name).

Common query shapes — conventions

Read signoz://alert/examples for the authoritative JSON of all patterns (error rate, p99 latency, log volume, absent-data, anomaly, PromQL, ClickHouse SQL). The conventions that don’t live in the schema:

Step 6: Dry-run the full query and validate the threshold

Step 4 confirmed data flows. Step 6 does two things:

  1. Validate query shape. Run the full builder spec (with groupBy, formulas, disabled component queries, and non-string filters) — Step 4’s bare count() probe doesn’t exercise these. The create-alert schema accepts queries that error at evaluation (numeric groupBy, unquoted bool filter, mismatched aggregation). Any HTTP 5xx or “filter type mismatch” = fix the config before proceeding to (2). disabled: true on formula component queries (A, B in A * 100 / B) is the recommended pattern, not a failure — see Step 5.
  2. Calibrate the threshold. Given the validated query, would the alert have fired a sensible number of times in the last hour?

Run the full primary query (or formula) over the last hour:

Compute how many evaluation points breached the proposed threshold. Surface in the preview as “would have fired N times in the last 1h”. A 1h window is too short to grade most alerts — only the upper extreme is actionable:

If Step 4 was somehow skipped (e.g. a downstream skill is invoking this flow mid-stream), the no-data stop rule applies here too: empty result → stop and ask the user (see Required inputs above) instead of saving an alert that will never fire.

Step 7: Resolve notification channels

The skill must resolve at least one channel before save. An alert with no channels saves successfully and silently never notifies anyone — the second most common silent failure after bad queries. Channel resolution runs after the dry-run so any threshold-driven severity changes (warning → critical) are settled before the user is asked to pick routing, and so we never create a notification channel inline for an alert that fails validation.

  1. Call signoz:signoz_list_notification_channels to enumerate existing channels.
  2. If the user named a channel (“send to slack-infra”), use it if it exists; if not, fall through.
  3. Otherwise present the user with two options:
    • Pick from existing — list channels with their type (Slack, PagerDuty, email, webhook) so the user can choose.
    • Create new inline — call signoz:signoz_create_notification_channel with channel parameters the user provides (name, type, type-specific config like Slack webhook URL or PagerDuty integration key).
  4. If neither path resolves a channel, stop and ask the user for a notification channel (see Required inputs above).

For multi-severity alerts, attach channels per threshold: thresholds.spec[N].channels is an array — typically warning → Slack only, critical → Slack + PagerDuty.

Handling secret-bearing channel config

Slack webhook URLs, PagerDuty integration keys, and similar webhook tokens are secrets. When the user supplies them inline, treat them as opaque inputs and follow these rules:

Step 8: Preview the prepared config

Emit a one-paragraph plain-language summary of what will be created — no raw JSON dump. The user-facing facts (what fires, on what scope, at what threshold, where it routes) are captured by the summary; clicking through the JSON does not catch query-shape errors (Step 6’s dry-run does).

Summary: This alert fires when [condition] for [resource scope], evaluated every [frequency] over the last [window]. Thresholds: warning at X, critical at Y. Notifications go to [channels]. Dry-run on the last hour: would have fired N times.

Step 9: Save and report

  1. Call signoz:signoz_create_alert with the config from Step 8.
  2. Name collision — if signoz:signoz_create_alert returns a duplicate-name error, do not suffix-append or call signoz:signoz_update_alert. Stop and tell the user the existing alert blocked creation; offer to use a different name or modify the existing alert (which is out of scope for this skill).
  3. On success, report:
    • The alert ID and name.
    • What it watches and at what threshold.
    • Which channels are wired up.
    • The dry-run summary (“would have fired N times in last 1h”).

Guardrails

Examples

Four canonical alert flows — multi-severity metric threshold, error-rate formula, log-volume groupBy, anomaly detection — live in references/examples.md.

Additional resources

Skill frontmatter

argument-hint: