Agent Skill · Amplitude

analyze-experiments

Designs A/B tests with proper metrics and variants, analyzes running or completed experiments, and interprets results with statistical rigor. Use when setting up experiments, checking experiment status, analyzing results, or making ship decisions.

Provider: Amplitude Path in repo: analytics-skills/skills/analyze-experiment/SKILL.md

Skill body

Experiment Analyst

Perform comprehensive, detailed deep-dive analysis of experiments to make data-driven ship/no-ship decisions. This is NOT a quick summary - provide thorough insights with specific numbers and business implications.

When to Use


Analysis Philosophy

Be comprehensive, not brief:


Instructions

Step 0: Identify Experiment

If user provides a specific experiment:

If user asks about experiments generally:

If no experiment specified:


Step 1: Retrieve and Validate Setup

Use Amplitude:get_experiments with experiment ID to capture:

Get metric names:

Validation:

If incomplete, explain what’s missing and stop.


Step 2: Check Data Quality (with explicit thresholds)

Use Amplitude:query_experiment (primary metric only) to assess:

Traffic Balance (SRM Check):

Sample Size Analysis:

A. Current Sample Assessment:

B. Statistical Power Analysis:

C. Precision Analysis (Confidence Interval Width):

Comprehensive Data Quality Flags:

The Amplitude:query_experiment API returns multiple boolean flags that assess statistical validity. Check and document each:

  1. statsAssumptionsMetForWholeExperiment:
    • Indicates whether core statistical assumptions are satisfied (normality, independence)
    • If false: Results may not be reliable; consider non-parametric approaches or longer runtime
    • Impact: High - affects all statistical conclusions
  2. hasSuspiciousUplift:
    • Flags unexpectedly large effect sizes that may indicate data quality issues
    • If true: Verify instrumentation, check for bot traffic, or segment anomalies
    • Impact: High - may indicate measurement error rather than real effect
  3. isVariancePositive:
    • Confirms metric variance is positive (mathematically required for statistical tests)
    • If false: Critical data quality issue - metric may be constant or incorrectly computed
    • Impact: Critical - statistical tests invalid if false
  4. isConfidenceIntervalNotFlipped:
    • Ensures lower CI bound < upper CI bound (mathematical consistency check)
    • If false: Indicates calculation error or data corruption
    • Impact: Critical - results cannot be trusted
  5. isStandardErrorLargeEnough:
    • Checks if standard error is sufficient for reliable inference
    • If false: High variance or very small sample may produce unreliable confidence intervals
    • Impact: Medium - affects precision of estimates
  6. isPointEstimateInsideConfidenceInterval:
    • Validates that point estimate falls within its confidence interval (consistency check)
    • If false: Calculation error or numerical instability
    • Impact: High - indicates statistical computation issues
  7. isMeanValid:
    • Confirms mean value is a valid number (not NaN, not infinite)
    • If false: Data quality issue - check for null values or computation errors
    • Impact: Critical - cannot analyze if mean is invalid

For each flag that fails (returns false or true for suspicious uplift), document:

If all flags pass: Note this explicitly as strong data quality signal

Temporal Stability:

Document all data quality issues found - these affect result reliability.


Step 3: Analyze Primary Metric

Use Amplitude:query_experiment without metricIds to get primary metric only.

Use metric name from Step 1 - Report using the human-readable metric name, not the metric ID.

Extract and report:

Interpret:

Practical significance:


Step 4: Analyze Secondary Metrics & Guardrails

Use Amplitude:query_experiment with metricIds for all metrics.

Use metric names from Step 1 - Report using human-readable metric names, not metric IDs.

For each secondary metric:

For each guardrail:

Key question: Are any metrics showing degradation (revenue, retention, engagement, error rates)?

Multiple testing: If analyzing 5+ metrics, consider Bonferroni correction (alpha = 0.05 / number of metrics)


Step 5: Comprehensive Segment Analysis

Use Amplitude:query_experiment with groupBy parameter (one at a time).

Test 3-4 high-signal segments:

  1. Platform (iOS, Android, Web)
  2. User tenure (new vs. established users)
  3. Plan type (free vs. paid)
  4. Geography (country, region)

MANDATORY: Format results as markdown breakdown tables

For each segment analysis, present results in this exact format:

Segment Control Rate Control Exposures Control % of Total Treatment Rate Treatment Exposures Treatment % of Total Relative Lift Significant?
iOS 48.7% 1,234 45.2% 55.4% 1,456 54.8% +13.6% Yes (p=0.02)
Android 63.9% 567 20.8% 65.1% 589 22.2% +1.9% No (p=0.45)
Web 51.2% 928 34.0% 50.8% 611 23.0% -0.8% No (p=0.89)

Calculate % of Total:

Key insights:

Use groupByLimit: 10 to avoid overwhelming output.


Step 6: Assess Duration & Runtime Sufficiency

Duration Assessment: Based on the power and precision analysis from Step 2, evaluate if the experiment has run long enough:

Runtime factors:

Integration with Step 2 power analysis:

Velocity projection (only if extending recommended):

Do NOT repeat the power calculations from Step 2 - reference those findings and focus on duration and timeline recommendations.


Step 7: Understand Why (Qualitative Context)

For significant results (positive or negative):

Use Amplitude:get_feedback_insights:

Connect quantitative to qualitative:


Step 8: Synthesize Findings and Make Recommendation

Before finalizing, verify you have included:

Present structured analysis:


Experiment Analysis: [Experiment Name]

Overview:


Data Quality Assessment:

Traffic & SRM:

Sample Size & Power:

Statistical Validity Flags: [Only include flags that failed - if all pass, state “All statistical validity checks passed”]

Duration:

Overall Data Quality: [Excellent / Good / Concerns / Critical Issues] [One sentence summary of whether results can be trusted]


Primary Metric: [Metric Name]

Variant Value Lift 95% CI P-value Status
Control [X]
Treatment [Y] [+Z%] [[A, B]] [P] ✅ Significant

Interpretation: [1-2 sentences on statistical AND practical significance]


Secondary Metrics & Guardrails:

Guardrails:

Secondary Metrics:

Unintended Consequences: [List any negative impacts on secondary metrics or guardrails]


Segment Analysis:

By Platform: | Segment | Control Rate | Control Exp | Control % | Treatment Rate | Treatment Exp | Treatment % | Lift | Sig? | |———|————–|————-|———–|—————-|—————|————-|——|——| | [Data from query_experiment with groupBy] |

Key Finding: [Which segments drove results; which showed differential effects]

By User Tenure: [Similar table]


Statistical Power:


Why This Result:


Recommendation: ✅ SHIP / ⚠️ ITERATE / ❌ ABANDON / 🔄 NEED MORE DATA

Rationale:

  1. [Primary metric result with statistical and practical significance]
  2. [Guardrail status and any unintended consequences]
  3. [Segment insights - opportunities or concerns]
  4. [Power analysis - adequate data or need more time]
  5. [Qualitative validation]

Known Risks:

Next Steps:

  1. [Specific action based on recommendation]
  2. [Follow-up or monitoring action]

Key Takeaways (3-5 actionable insights):

  1. [Most important finding]
  2. [Second most important finding]
  3. [Third most important finding]
  4. [Additional insight if relevant]


Key Scenarios & How to Handle

Inconclusive Results (p > 0.05)

Diagnose:

Action:


Guardrail Regressed

Diagnose:

Action:


Segment Tables Show Opposite Effects

Simpson’s Paradox detected:

Action:


Best Practices

Comprehensive analysis:

Statistical rigor:

Avoid:


For Experiment Design

If user wants to design a new experiment, guide them through:

  1. Define hypothesis: “We believe [change] will cause [users] to [behavior] because [reason]”

  2. Select metrics:
    • Use Amplitude:search with entityTypes: ["METRIC"] to find candidates
    • Primary: directly measures hypothesis
    • Guardrails: revenue, retention, core engagement (prevent unintended consequences)
  3. Estimate sample size:
    • Typical: 1-2 weeks minimum, 1000+ users per variant
    • Higher variance metrics need more data
    • Use Amplitude:query_chart to check metric’s historical variance
  4. Create experiment:
    • Use Amplitude:create_experiment with projectIds, variants, and metrics
    • Return experiment ID, URL, and deployment key for engineering

For detailed setup guidance, consider using the setup-experiment-and-flags skill.

Skill frontmatter

suggest_when: User asks about a specific experiment, shares an experiment URL, asks "did this test win", "should we ship this", or wants statistical analysis of A/B test results.