Agent Skill · Amplitude

analyze-experiments

Designs A/B tests with proper metrics and variants, analyzes running or completed experiments, and interprets results with statistical rigor. Use when setting up experiments, checking experiment status, analyzing results, or making ship decisions.

View SKILL.md on GitHub → Source repository Provider profile

Provider: Amplitude Path in repo: analytics-skills/skills/analyze-experiment/SKILL.md

Skill body

Experiment Analyst

Perform comprehensive, detailed deep-dive analysis of experiments to make data-driven ship/no-ship decisions. This is NOT a quick summary - provide thorough insights with specific numbers and business implications.

When to Use

Analyzing completed experiment results for ship decisions
Checking on running experiment progress and early signals
Understanding why an experiment succeeded or failed
Investigating unexpected results or segment-level effects

Analysis Philosophy

Be comprehensive, not brief:

Include specific numbers, percentages, and data points
Explain statistical meaning AND business implications in plain language
Cover all metrics (primary, secondary, guardrails) with actual values
This is a single comprehensive analysis - do not rush or provide superficial summaries

Instructions

Step 0: Identify Experiment

If user provides a specific experiment:

Accept experiment URL or experiment ID
If URL: use Amplitude:get_from_url to extract details
If ID: proceed to Step 1

If user asks about experiments generally:

Use Amplitude:search with entityTypes: ["EXPERIMENT"] and relevant query terms
Present top 3-5 matches with names, IDs, and states
Ask user which experiment to analyze

If no experiment specified:

Ask explicitly for experiment URL, ID, or search terms and stop

Step 1: Retrieve and Validate Setup

Use Amplitude:get_experiments with experiment ID to capture:

Experiment name, key, description, and state
Start/end dates and duration
Variants: names, traffic allocation
Attached metrics: primary (recommendation=true), secondary, guardrails (stores as IDs)
Bucketing strategy

Get metric names:

Extract metric IDs from the experiment response (e.g., “c4pn8fkv”)
CRITICAL: Amplitude MCP cannot retrieve metric names by ID directly
Workaround options:
1. Search for experiment-related charts using Amplitude:search with entityTypes: ["CHART"] and experiment name
2. Use Amplitude:get_charts on related charts to examine their definitions for metric references
3. Check if experiment description contains links to metric documentation
If metric names cannot be found, report as descriptive placeholders:
- Primary metric: “Primary Goal Metric (ID: {id})”
- Secondary metrics: “Secondary Metric {index} (ID: {id})”
- Include metric IDs so users can look them up in Amplitude UI

Validation:

Is experiment running or completed? (not draft)
Has it run for 1+ weeks?
Are variants and metrics clearly defined?

If incomplete, explain what’s missing and stop.

Step 2: Check Data Quality (with explicit thresholds)

Use Amplitude:query_experiment (primary metric only) to assess:

Traffic Balance (SRM Check):

Report actual traffic split per variant (e.g., 48.2% control, 51.8% treatment)
Use srmDetected field from API: Flag if srmDetected: true
SRM (Sample Ratio Mismatch) indicates the observed traffic split deviates significantly from the expected allocation
If SRM detected, report the expected vs. actual allocation with specific percentages
Severe SRM can indicate instrumentation issues or bucketing problems that may invalidate results

Sample Size Analysis:

A. Current Sample Assessment:

Report total users per variant with specific numbers
Flag if <100 users per variant (insufficient for any conclusion)
Flag if 100-1000 users (directional signals only, not confident decision)
Need 1000+ per variant for confident decisions

B. Statistical Power Analysis:

Target effect size: What minimum lift would be meaningful for the business? (typically 2-5% for conversion metrics)
Achieved power: Given current sample size and observed variance, what’s the probability of detecting the target effect if it exists?
Power interpretation:
- <50%: Severely underpowered - likely to miss real effects
- 50-70%: Underpowered - high risk of false negatives
- 70-80%: Marginally adequate - consider extending if p-value is borderline
- 80%+: Well-powered - sufficient to detect target effect size
If underpowered: Calculate additional sample size needed to reach 80% power
Recommendation: If power <70% and results are inconclusive, extend duration rather than making premature decision

C. Precision Analysis (Confidence Interval Width):

CI width for primary metric: Report the width of the 95% confidence interval as percentage of baseline
Precision assessment:
- CI width >10% of baseline: Low precision - effect size uncertainty too high for confident decisions
- CI width 5-10% of baseline: Moderate precision - acceptable for directional decisions
- CI width <5% of baseline: High precision - narrow enough for confident decisions
Actionability threshold: Is the CI narrow enough to distinguish between practically significant and negligible effects?
- If lower CI bound suggests meaningful lift but upper bound is marginal, precision may be insufficient
- Example: If target is +5% lift and CI is [-2%, +12%], too wide to confidently conclude effect exceeds target
Recommendation: If CI too wide, extend duration or increase traffic allocation to improve precision

Comprehensive Data Quality Flags:

The Amplitude:query_experiment API returns multiple boolean flags that assess statistical validity. Check and document each:

statsAssumptionsMetForWholeExperiment:
- Indicates whether core statistical assumptions are satisfied (normality, independence)
- If false: Results may not be reliable; consider non-parametric approaches or longer runtime
- Impact: High - affects all statistical conclusions
hasSuspiciousUplift:
- Flags unexpectedly large effect sizes that may indicate data quality issues
- If true: Verify instrumentation, check for bot traffic, or segment anomalies
- Impact: High - may indicate measurement error rather than real effect
isVariancePositive:
- Confirms metric variance is positive (mathematically required for statistical tests)
- If false: Critical data quality issue - metric may be constant or incorrectly computed
- Impact: Critical - statistical tests invalid if false
isConfidenceIntervalNotFlipped:
- Ensures lower CI bound < upper CI bound (mathematical consistency check)
- If false: Indicates calculation error or data corruption
- Impact: Critical - results cannot be trusted
isStandardErrorLargeEnough:
- Checks if standard error is sufficient for reliable inference
- If false: High variance or very small sample may produce unreliable confidence intervals
- Impact: Medium - affects precision of estimates
isPointEstimateInsideConfidenceInterval:
- Validates that point estimate falls within its confidence interval (consistency check)
- If false: Calculation error or numerical instability
- Impact: High - indicates statistical computation issues
isMeanValid:
- Confirms mean value is a valid number (not NaN, not infinite)
- If false: Data quality issue - check for null values or computation errors
- Impact: Critical - cannot analyze if mean is invalid

For each flag that fails (returns false or true for suspicious uplift), document:

Which flag failed
What it means in plain language
Specific impact on result reliability
Recommended action (extend duration, investigate instrumentation, etc.)

If all flags pass: Note this explicitly as strong data quality signal

Temporal Stability:

Check if primary metric is stable day-over-day
Note ramp period (first 24-48hrs) or day-of-week effects

Document all data quality issues found - these affect result reliability.

Step 3: Analyze Primary Metric

Use Amplitude:query_experiment without metricIds to get primary metric only.

Use metric name from Step 1 - Report using the human-readable metric name, not the metric ID.

Extract and report:

Control baseline: metric value and sample size
Treatment performance: metric value and sample size
Absolute lift: treatment - control
Relative lift: (treatment - control) / control × 100%
P-value: with interpretation
Confidence interval: report 95% CI bounds

Interpret:

✅ Statistically significant: p < 0.05 and CI doesn’t include 0
⚠️ Trending: 0.05 < p < 0.15 (suggestive but inconclusive)
❌ No effect: p ≥ 0.15 or CI includes 0

Practical significance:

Is the lift magnitude meaningful for the business?
Small lifts (<2-3%) may not be worth complexity even if significant
Consider metric’s business impact (revenue vs. low-value engagement)

Step 4: Analyze Secondary Metrics & Guardrails

Use Amplitude:query_experiment with metricIds for all metrics.

Use metric names from Step 1 - Report using human-readable metric names, not metric IDs.

For each secondary metric:

Report metric name (from Step 1 mapping), variant performance, and statistical significance
Note which moved and which didn’t (with specific numbers)
Identify unintended consequences: Flag any negative impacts with specific values

For each guardrail:

✅ No regression: neutral or positive (p > 0.05)
⚠️ Marginal concern: small negative lift (1-5%) with p < 0.10
🚩 Significant regression: negative lift with p < 0.05 - report actual numbers

Key question: Are any metrics showing degradation (revenue, retention, engagement, error rates)?

Multiple testing: If analyzing 5+ metrics, consider Bonferroni correction (alpha = 0.05 / number of metrics)

Step 5: Comprehensive Segment Analysis

Use Amplitude:query_experiment with groupBy parameter (one at a time).

Test 3-4 high-signal segments:

Platform (iOS, Android, Web)
User tenure (new vs. established users)
Plan type (free vs. paid)
Geography (country, region)

MANDATORY: Format results as markdown breakdown tables

For each segment analysis, present results in this exact format:

Segment	Control Rate	Control Exposures	Control % of Total	Treatment Rate	Treatment Exposures	Treatment % of Total	Relative Lift	Significant?
iOS	48.7%	1,234	45.2%	55.4%	1,456	54.8%	+13.6%	Yes (p=0.02)
Android	63.9%	567	20.8%	65.1%	589	22.2%	+1.9%	No (p=0.45)
Web	51.2%	928	34.0%	50.8%	611	23.0%	-0.8%	No (p=0.89)

Calculate % of Total:

Sum all exposures across segments to get total
Show each segment’s share: (segment exposures / total exposures) × 100%
This reveals which segments drive overall results

Key insights:

Identify segments where treatment performs best (targeted rollout opportunity)
Identify segments where treatment hurts (consider exclusions)
Explain why different segments show different performance
Watch for Simpson’s Paradox: Overall result may differ from all segment results

Use groupByLimit: 10 to avoid overwhelming output.

Step 6: Assess Duration & Runtime Sufficiency

Duration Assessment: Based on the power and precision analysis from Step 2, evaluate if the experiment has run long enough:

Runtime factors:

Minimum duration: Has experiment run at least 1-2 weeks to capture full user lifecycle?
Learning effects: For feature changes, have users had time to adapt? (typically 3-7 days)
Weekly seasonality: Has experiment captured at least one complete week to account for day-of-week patterns?
Business cycles: For B2B products, has it run through full business week patterns?

Integration with Step 2 power analysis:

If Step 2 showed adequate power (>80%) AND p < 0.05: Experiment has sufficient data, duration is adequate
If Step 2 showed low power (<70%) AND p > 0.05: Inconclusive due to insufficient data, extend duration
If Step 2 showed adequate power (>80%) AND p > 0.15: Sufficient data to accept null result (no effect)
If Step 2 showed adequate power but CI width too wide: Need more data for precision, extend duration

Velocity projection (only if extending recommended):

Current daily enrollment: Calculate users per day per variant
Days to target sample: Based on Step 2 power calculation, how many more days needed?
Days to target precision: Based on Step 2 CI width calculation, how many more days to reach desired precision?
Recommendation: Provide specific date when experiment should reach sufficient power/precision

Do NOT repeat the power calculations from Step 2 - reference those findings and focus on duration and timeline recommendations.

Step 7: Understand Why (Qualitative Context)

For significant results (positive or negative):

Use Amplitude:get_feedback_insights:

Filter by experiment date range
For wins: look for ["lovedFeature", "mentionedFeature"]
For losses: look for ["bug", "complaint", "painPoint"]
Check if themes align with experiment hypothesis

Connect quantitative to qualitative:

Explain the lift with user quotes or feedback themes
Present 2-3 representative examples with specific details

Step 8: Synthesize Findings and Make Recommendation

Before finalizing, verify you have included:

✓ All primary metric data (lift, CI, p-value, interpretation)
✓ All data quality findings (SRM, sample size, power, precision, all 7 validity flags with actual values)
✓ All secondary metrics and guardrails (with actual values and significance)
✓ All segment analysis tables (formatted with % of total exposures)
✓ Statistical power assessment (current power, required sample, duration)
✓ Qualitative insights (feedback themes)

Present structured analysis:

Experiment Analysis: [Experiment Name]

Overview:

Hypothesis: [What was tested and expected impact]
Duration: [Start] to [End] ([X days])
Sample Size: Control: [N] Treatment: [N]
Link: [Experiment URL]

Data Quality Assessment:

Traffic & SRM:

Traffic Balance: Control [X%] Treatment [Y%] (Expected: [X%] [Y%])
SRM Detected: [Yes/No] [If yes, explain deviation severity]

Sample Size & Power:

Sample Size: Control: [N] Treatment: [N]
Sample Adequacy: [Adequate (>1000) / Moderate (100-1000) / Low (<100)]
Statistical Power: [X%] to detect [Y%] lift (Target: 80%+)
Achieved Precision: 95% CI width = [±X%] ([High <5% / Moderate 5-10% / Low >10%])

Statistical Validity Flags: [Only include flags that failed - if all pass, state “All statistical validity checks passed”]

❌ statsAssumptionsMetForWholeExperiment: Statistical assumptions not met - [brief impact]
❌ hasSuspiciousUplift: Unusually large effect detected - [brief recommendation]
❌ isVariancePositive: Invalid variance - [critical issue description]
❌ isConfidenceIntervalNotFlipped: CI calculation error - [critical issue description]
❌ isStandardErrorLargeEnough: Insufficient standard error - [impact on precision]
❌ isPointEstimateInsideConfidenceInterval: Statistical inconsistency - [calculation issue]
❌ isMeanValid: Invalid mean value - [data quality issue]

Duration:

Runtime: [X days] (Started: [date])
Sufficiency: [Adequate - captured full user lifecycle / Need more time - [reason]]
Recommendation: [Continue running for X more days / Sufficient data to conclude]

Overall Data Quality: [Excellent / Good / Concerns / Critical Issues] [One sentence summary of whether results can be trusted]

Primary Metric: [Metric Name]

Variant	Value	Lift	95% CI	P-value	Status
Control	[X]	—	—	—	—
Treatment	[Y]	[+Z%]	[[A, B]]	[P]	✅ Significant

Interpretation: [1-2 sentences on statistical AND practical significance]

Secondary Metrics & Guardrails:

Guardrails:

✅ Revenue per user: No regression ([+X%], p=[P])
✅ Retention D7: Slight positive ([+X%], p=[P])
🚩 Bounce rate: Regression detected ([+X%], p=[P]) ⚠️

Secondary Metrics:

[Metric]: [+X% lift, p=[P]] - [brief interpretation]
[Metric]: No significant change (p=[P])

Unintended Consequences: [List any negative impacts on secondary metrics or guardrails]

Segment Analysis:

By Platform: | Segment | Control Rate | Control Exp | Control % | Treatment Rate | Treatment Exp | Treatment % | Lift | Sig? | |———|————–|————-|———–|—————-|—————|————-|——|——| | [Data from query_experiment with groupBy] |

Key Finding: [Which segments drove results; which showed differential effects]

By User Tenure: [Similar table]

Statistical Power:

Current Power: [X%] - [Adequate/Underpowered]
Required Sample: Need [X] more users per variant for 80% power
Estimated Duration: [X] more days at current traffic to reach significance

Why This Result:

[Feedback theme] ([X mentions])
- “[Quote]” - [Source] ([Date])

Recommendation: ✅ SHIP / ⚠️ ITERATE / ❌ ABANDON / 🔄 NEED MORE DATA

Rationale:

[Primary metric result with statistical and practical significance]
[Guardrail status and any unintended consequences]
[Segment insights - opportunities or concerns]
[Power analysis - adequate data or need more time]
[Qualitative validation]

Known Risks:

[Risk 1 with mitigation if shipping]
[Risk 2 with mitigation if shipping]

Next Steps:

[Specific action based on recommendation]
[Follow-up or monitoring action]

Key Takeaways (3-5 actionable insights):

[Most important finding]
[Second most important finding]
[Third most important finding]
[Additional insight if relevant]

Key Scenarios & How to Handle

Inconclusive Results (p > 0.05)

Diagnose:

Check statistical power: Is sample size adequate? Report current power percentage
Check confidence interval: Very wide = high variance, need more data
Check segments: Effect may exist in specific subgroup

Action:

If power <60%: Extend duration or increase traffic allocation
If power >80% but p >0.15: Accept null result (no effect detected)
Check segment tables: Look for subgroups with significant effects

Guardrail Regressed

Diagnose:

Quantify trade-off with specific numbers: +10% conversion but -2% retention
Which segments drove the regression? Check segment tables
Is regression statistically significant or just noise?

Action:

Small regression + large primary win + not significant = ship with monitoring
Significant regression on critical metric = iterate to fix or abandon
Segment-specific regression = consider targeted rollout excluding affected segments

Segment Tables Show Opposite Effects

Simpson’s Paradox detected:

Overall result may be misleading if segments show opposite directions
Example: Overall +5% lift, but iOS -10%, Android +15%

Action:

Report the paradox clearly with specific segment numbers
Consider targeted rollout to segments that benefit
Exclude or iterate for segments that are harmed

Best Practices

Comprehensive analysis:

✅ Include ALL data from tool calls with specific numbers
✅ Format segment analysis as breakdown tables with % of total
✅ Check statistical power and duration adequacy
✅ Verify data quality before drawing conclusions
✅ Connect quantitative results to qualitative insights

Statistical rigor:

✅ Report confidence intervals, not just p-values
✅ Distinguish statistical vs. practical significance
✅ Apply multiple testing correction for 5+ metrics
✅ Check for Simpson’s Paradox in segment analysis

Avoid:

❌ Don’t provide brief summaries - be comprehensive
❌ Don’t omit data quality issues or negative secondary metrics
❌ Don’t ignore segments - they reveal critical insights
❌ Don’t make recommendations without adequate power
❌ Don’t stop analysis early because primary looks good

For Experiment Design

If user wants to design a new experiment, guide them through:

Define hypothesis: “We believe [change] will cause [users] to [behavior] because [reason]”
Select metrics:
- Use Amplitude:search with entityTypes: ["METRIC"] to find candidates
- Primary: directly measures hypothesis
- Guardrails: revenue, retention, core engagement (prevent unintended consequences)
Estimate sample size:
- Typical: 1-2 weeks minimum, 1000+ users per variant
- Higher variance metrics need more data
- Use Amplitude:query_chart to check metric’s historical variance
Create experiment:
- Use Amplitude:create_experiment with projectIds, variants, and metrics
- Return experiment ID, URL, and deployment key for engineering

For detailed setup guidance, consider using the setup-experiment-and-flags skill.

Skill frontmatter

suggest_when: User asks about a specific experiment, shares an experiment URL, asks "did this test win", "should we ship this", or wants statistical analysis of A/B test results.