Agent Skill · Microsoft Azure

amg-check-cosmosdb-mongo-ru

Fleet-wide Cosmos DB for MongoDB (RU) health check — scans NormalizedRU consumption, service availability, server-side latency, throttling (429s), and replication metrics across all accounts, then deep-dives into abnormal accounts with resource logs and correlation analysis. Tracks known issues across sessions via persistent report. Uses AMG-MCP pulse check for Tier 1 triage, then batched Azure Monitor queries for Tier 2 investigation. On first run, auto-discovers datasource UID and prompts for subscription ID.

Provider: Microsoft Azure Path in repo: plugins/amg-toolkit/skills/amg-check-cosmosdb-mongo-ru/SKILL.md

Skill body

Runtime Context

Known Issues: Before presenting findings, cross-reference results against memory/amg-check-cosmosdb-mongo-ru/report.md.

Cosmos DB for MongoDB (RU) Health Check

Critical Constraints

Progress Tracking

Update checkboxes as you complete each phase:

Configuration

If Config shows NOT_CONFIGURED: Run First-Run Setup at the bottom of this file, then return here.

If Config is populated: Extract the datasource UID and subscription ID from the pre-loaded Runtime Context above and use them for all queries. Use $1 as the subscription override if provided.

Time Range

Default: 7 days for metrics, 24 hours for logs. Override with $0 (e.g., 3d). Keep log queries to 1-2 days to avoid timeouts.


Workflow

Phase 1a: Validate Datasource

Call amgmcp_datasource_list (no parameters). Find entry with type == "grafana-azure-monitor-datasource".

Phase 1b: Discover All Cosmos DB for MongoDB (RU) Accounts

azureMonitorDatasourceUid: {DATASOURCE_UID}
query: |
  resources
  | where type == 'microsoft.documentdb/databaseaccounts'
  | where kind == 'MongoDB'
  | project name, resourceGroup, location, subscriptionId, id, properties.provisioningState
  | order by location asc, name asc

If the config specifies subscription IDs (not “all”), add | where subscriptionId in ('{ID1}', '{ID2}'). Derive region summary by counting accounts per location. Flag accounts not in “Succeeded” state. Stop if zero accounts found.

Why kind == 'MongoDB'? Filters for RU-based MongoDB API accounts. vCore-based MongoDB uses microsoft.documentdb/mongoclusters.

Phase 1c: Activity Log for Non-Succeeded Accounts

If any accounts are not in “Succeeded” state, query the activity log for up to 3 of them:

azureMonitorDatasourceUid: {DATASOURCE_UID}
scope: {account's full ARM resource ID}
startTime: now-3d
endTime: now
select: eventTimestamp,operationName,status,caller,subStatus

If the response exceeds 500 KB, retry with startTime: now-1d. Summarize: operations performed, caller type, success/in-progress status, likely cause.

Phase 2: Validate Available Metrics

Call amgmcp_query_resource_metric_definition on the first account from Phase 1. Confirm expected metrics exist. Run only once — definitions are the same across all accounts.

Phase 3: Tier 1 — Fleet-Wide Pulse Check

azureMonitorDatasourceUid: {DATASOURCE_UID}
pastDays: 7
scenarios: cosmosdb_mongo

Scans all accounts across 3 scenarios: cosmosdb_mongo_ru, cosmosdb_mongo_throttling, cosmosdb_mongo_availability.

Before moving to Phase 4, verify:

  1. scanSummary.totalResourcesScanned matches Phase 1 account count.
  2. All 3 scenarios show status: "completed" in scenarioResults.
  3. If errors non-empty, retry affected scenarios individually.
  4. If >10% accounts missing, fall back to batched amgmcp_query_resource_metric for unscanned accounts.

Accounts in the findings array are abnormal. Also flag any non-Succeeded accounts from Phase 1.

Note: Sustained-high detection (>50% for 6+ hours), RU spike pattern detection (>30pp jump in 1h), and latency analysis require hourly time-series data and are performed in Phase 4 on flagged accounts only.

Phase 4: Tier 2 — Deep Metrics for Abnormal Accounts

Read reference/phase4-deep-metrics.md before starting Phase 4. It contains:

Phase 5: Resource Logs for Abnormal Accounts

Read reference/phase5-resource-logs.md before starting Phase 5. It contains:


Output

Present the report using the structure in reference/output-format.md.

Classification:

Severity Criteria
CRITICAL NormalizedRU = 100% sustained, OR ServiceAvailability < 99.9%, OR latency avg > 50ms
HIGH NormalizedRU max 85-100% with frequent spikes, OR ReplicationLatency > 1000ms
WARNING NormalizedRU max 70-85% sustained, OR sustained RU > 50% for 6h+, OR RU spike >30pp in 1h, OR ServiceAvailability < 99.99%, OR latency avg > 10ms, OR ReplicationLatency > 100ms
MODERATE NormalizedRU max 50-70%
HEALTHY All metrics within normal ranges (NormalizedRU < 50%)

Update Known Issues

After presenting findings, update memory/amg-check-cosmosdb-mongo-ru/report.md:

  1. Read the current file.
  2. Rebuild the Resource Inventory table at the end: every account, full ARM ID, region, subscription, state. Group by region, sorted alphabetically.
  3. Update existing bug status from today’s telemetry (resolved / improving / worsening / still active).
  4. Add new bugs with: severity, account name, region, metric evidence, log evidence, root cause, recommended action.
  5. Update the “Updated” date header.

Only add genuine issues: sustained throttling, availability drops, high latency patterns, or replication problems. Skip transient single-hour spikes or expected maintenance windows.

Error Handling

See reference/error-handling.md for the full recovery table.

Analysis Guidance

Reference


First-Run Setup

Run only when Config shows NOT_CONFIGURED. After completing, return to the Workflow above.

1. Discover Datasource UID: Call amgmcp_datasource_list. Filter type == "grafana-azure-monitor-datasource". Prefer uid == "azure-monitor-oob" if multiple match. Abort if zero match.

2. Discover Subscription ID: Run this Resource Graph query to list all subscriptions with Cosmos DB for MongoDB (RU) accounts, then present the results as a table and ask the user which subscription(s) to use:

resources
| where type == 'microsoft.documentdb/databaseaccounts'
| where kind == 'MongoDB'
| join kind=inner (
    resourcecontainers
    | where type == 'microsoft.resources/subscriptions'
    | project subscriptionId, subscriptionName=name
) on subscriptionId
| summarize AccountCount=count() by subscriptionId, subscriptionName
| order by AccountCount desc

Present the results as a table with columns: Subscription Name, Subscription ID, Account Count. Then ask the user: “Which subscription ID(s) should I configure for this health check? Or type ‘all’ to scan all subscriptions.”

3. Write config: Write memory/amg-check-cosmosdb-mongo-ru/config.md:

# amg-check-cosmosdb-mongo-ru Configuration

User-specific values for the Cosmos DB for MongoDB (RU) health check skill.
This file is auto-generated on first run and can be edited manually.

## Azure Monitor Datasource
- **UID**: {discovered_uid}
- **Name**: {discovered_name}

## Subscription
- {subscription_id_or_"all"}

4. Confirm: Show the resolved config and ask for confirmation before proceeding.

Skill frontmatter

argument-hint: [time-range, e.g. 7d, 1d, 3d] [subscription-id] disable-model-invocation: true effort: max allowed-tools: mcp__amg__amgmcp_pulse_check mcp__amg__amgmcp_query_resource_graph mcp__amg__amgmcp_query_resource_metric mcp__amg__amgmcp_query_resource_metric_definition mcp__amg__amgmcp_query_resource_log mcp__amg__amgmcp_datasource_list mcp__amg__amgmcp_query_activity_log Bash(node *) Glob Read Write Edit