Experiment Flow Mini-Game
Why this calculator is perfect
Sample size planning is about balancing signal vs. noise. This mini-game turns that tension into a tactile exercise—every visitor you route teaches how allocation, effect size, and confidence interact.
• Arrow keys ←/→ nudge bias ±5%. Space pauses.
• Conversions trigger bursts and scoring; stay within ±5% of the target split to earn streak multipliers.
Provide parameters, then click to play.
Finish with high signal (Z-score) and balanced traffic to max your score.
Understanding A/B Test Sample Size
Why Sample Size Matters
Many A/B tests fail to detect real improvements because sample sizes are too small. With insufficient data, you risk "false negatives" (not detecting an improvement that exists) or "false positives" (claiming an improvement that's just statistical noise). This calculator determines how many visitors you need in each variant to reliably detect a meaningful improvement with specified confidence and statistical power.
Key Concepts
Baseline Conversion Rate: Your current control performance (e.g., 2.5% converts)
Minimum Detectable Effect: Smallest improvement worth detecting (e.g., 20% improvement = 2.5% → 3%)
Confidence (α): Risk of false positive (typically 5%)
Power (1-β): Probability of detecting effect (typically 80%)
Sample Size Requirements by Baseline & Effect
| Baseline CR | 10% Improvement | 20% Improvement | 50% Improvement |
|---|---|---|---|
| 1% (E-commerce) | ~39,000 per variant | ~9,900 per variant | ~1,600 per variant |
| 5% (SaaS free trial) | ~7,750 per variant | ~1,950 per variant | ~320 per variant |
| 10% (Newsletter signup) | ~3,900 per variant | ~980 per variant | ~160 per variant |
| 50% (High engagement) | ~156 per variant | ~39 per variant | ~6 per variant |
Worked Example: E-Commerce Landing Page
Scenario: Your landing page converts 2% of visitors. You want to test a new CTA button. You want 95% confidence you'd detect a 25% improvement (2% → 2.5%).
- Baseline: 2%
- Minimum Effect: 25% improvement
- Confidence: 95% (standard)
- Power: 80% (standard)
- Result: 3,100 per variant, 6,200 total
- At 1,000 daily visitors: 6.2 days test duration
How to Reduce Sample Size
- Accept larger effect sizes: If you only want to detect 50% improvements (not 10%), sample size drops dramatically
- Reduce confidence level: 90% instead of 95% cuts sample size ~25%
- Reduce power: 70% instead of 80% cuts sample size ~20%
- Increase traffic: Higher baseline conversion rates need smaller samples
- Test only what matters: Don't test trivial changes; focus on high-impact variations
Important Limitations & Assumptions
- This calculator uses simplified formulas; exact sample size depends on statistical test used
- Assumes normal distribution; low conversion rates (<1%) may need adjustment
- Does not account for multiple testing corrections or sequential testing
- Assumes constant traffic throughout test duration; actual variation may increase required time
- Does not account for novelty effects that wear off after initial exposure
Understanding Statistical Power and Type II Errors
Statistical power (typically 80%) represents the probability of detecting a true effect if it exists. In A/B testing, insufficient power means you risk missing real improvements—a Type II error. If you run a test with 60% power and see no winner, there's a 40% chance you missed a real improvement. This is why 80% is the industry standard: it balances the cost of longer tests against the risk of missing improvements. Using this calculator with 80% power ensures you're capturing real improvements with high probability. However, high-value tests (those where detecting an improvement saves significant revenue) may justify 90% power, requiring larger samples.
Multiple Testing and Statistical Corrections
A common mistake in A/B testing is running many tests without correcting for multiple comparisons. If you run 10 independent tests at 95% confidence, your overall confidence drops significantly—you're likely to see at least one false positive by chance. This is why sequential testing and corrections (like Bonferroni) exist. However, this calculator assumes a single hypothesis test. If you're conducting multiple analyses (peeking at results daily, testing multiple variants, etc.), you need stricter confidence levels or corrections. This is a primary source of A/B testing failures in practice.
Common A/B Testing Pitfalls
- Peeking at results: Looking at test results before reaching sample size inflates false positive rates. The test was designed for a specific sample size; looking early breaks those guarantees.
- Stopping early for a winner: Even if results look positive early, you must reach the calculated sample size. "Early winners" are often statistical noise that disappears with more data.
- Testing too many variants: Each additional variant requires more total traffic. Testing 5 variants simultaneously requires roughly 5× the sample size of a simple A vs B test.
- Ignoring seasonal effects: Conversion rates vary by day of week, season, and holiday. A test running only on weekends may not reflect overall performance.
- Neglecting external events: Press coverage, competitor launches, and market changes affect conversion rates. Control for external factors.
- Low baseline conversion rates: Tests with <0.5% baseline conversion rates require enormous sample sizes. Consider testing on higher-traffic pages or accepting larger minimum detectable effects.
Real-World A/B Testing Scenarios
SaaS Free Trial Sign-up: 8% baseline conversion, want to detect 15% improvement (8% → 9.2%). At 95% confidence and 80% power: ~3,850 per variant. At 1,000 daily sign-ups: ~4 days per variant = 8 days total test duration. Reasonable for feature validation.
High-Traffic E-commerce Checkout: 2% baseline conversion, want to detect 20% improvement (2% → 2.4%). With 50,000 daily visitors, you reach sample size in hours. Justifies testing small improvements on high-traffic pages.
Low-Traffic Email Campaign: 0.5% baseline click rate, want to detect 50% improvement (0.5% → 0.75%). At 95% confidence and 80% power: ~3,200 per variant. At 2,000 email subscribers per test: ~1.6 segments = 3+ weeks. Consider accepting larger effect sizes or running longer campaigns.
Moving Beyond Binary Comparisons
This calculator focuses on detecting differences between two variants. Modern A/B testing increasingly uses multivariate testing (testing multiple elements simultaneously) and continuous experimentation platforms (always-on testing infrastructure). In multivariate tests, sample size requirements grow with the number of elements tested. Continuous experimentation requires infrastructure to track test assignments and results, but reduces the time between hypothesis and decision.
Bayesian vs. Frequentist A/B Testing
This calculator uses frequentist statistics (classical hypothesis testing). Bayesian A/B testing is an alternative approach that treats effect size as a probability distribution and allows stopping decisions based on posterior probability of superiority. Bayesian methods can enable earlier stopping when results are clear or stopping when the difference is too small to matter. However, Bayesian tests still require sample size planning for proper calibration—you can't simply stop whenever results look good. Both approaches require disciplined methodology to avoid false positives.
Summary
Proper sample size planning is foundational to valid A/B testing. This calculator helps you determine the sample size needed to detect meaningful improvements with high probability. Remember: reach your calculated sample size before making decisions, avoid peeking at results, correct for multiple testing, and account for external factors. A/B testing failures are often not due to the math being wrong, but due to violations of test design assumptions—respecting those assumptions ensures your tests deliver actionable insights. Use this tool to plan your tests rigorously, then execute without deviation from the plan.
