A/B Testing Demystified - Part 2: Beyond the Basics
5 min read • July 2, 2025
#A/B Testing #Statistics #Experiments #Data Science
TL;DR
- Plan for power: choose a minimum detectable effect (MDE) and compute sample size before you launch.
- Don’t peek without protection: use group-sequential designs or alpha-spending functions.
- Correct multiplicity: when testing many variants/metrics, control FDR (BH) or FWER (Holm/Bonferroni).
- Bayesian A/B gives sequentially coherent decisions using posterior win probability and credible intervals.
- Reduce variance to win time: CUPED, stratification, and covariate adjustment shrink required sample sizes.
- Defend validity: check SRM, instrument stability, novelty/day effects, and interference/SUTVA.
Power & Sample Size - turning business goals into math
(a) Set your MDE first
- The Minimum Detectable Effect (MDE) is the smallest lift worth detecting, based on business value (e.g., +0.4pp CTR or +1.5% revenue per user).
- Smaller MDEs sound great, but they require much bigger samples (since (n \propto 1/\Delta^2)).
(b) Two-proportion z-test (conversion metrics)
For equal group sizes and baseline rate (p), approximate per-group sample size to detect an absolute lift (\Delta) (MDE):
$$ n \approx \frac{2p(1-p)(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2} $$
- α (significance level): the maximum false positive rate you’re willing to tolerate.
Example: α=0.05 means that if there’s no real effect, you’ll still (falsely) see a “significant” win about 5% of the time. - Power (1−β): the probability of catching a real effect of size Δ when it exists.
Example: 80% power means you’ll detect a true win 8 times out of 10.
Example calculation: baseline (p=0.05), MDE=0.005 (0.5pp). With α=0.05 and power=0.8:
$$ n \approx \frac{2\cdot0.05\cdot0.95\cdot(1.96+0.84)^2}{0.005^2} \approx 29{,}400 \text{ users per arm.} $$
(c) Continuous metrics (e.g., revenue/user)
With standard deviation (\sigma) and desired mean lift (\Delta):
$$ n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2} $$
How Parameters Drive Sample Size (Sensitivity Intuition)
Equations are useful, but let’s build intuition: how do changes in α, power, baseline rate, and MDE affect the required sample size?
(a) Significance level (α)
Lower α = stricter evidence bar = larger sample.
- α=0.05 (5% false alarm tolerance) → ~29,400 per arm.
- α=0.01 (1% false alarm tolerance) → ~49,000 per arm.
→ ~70% increase just from lowering α.
(b) Power (1−β)
Higher power = lower chance of missing a true effect = larger sample.
- 80% power → ~29,400 per arm.
- 90% power → ~43,800 per arm.
→ +49% more traffic.
(c) Baseline conversion rate (p)
Variance of a Bernoulli is p(1−p).
- p=0.5 is max variance → hardest to detect differences.
- p very low or very high → easier.
(d) MDE
Halving MDE = 4× required sample size.
Takeaway Table
Change | Effect on Sample Size (n) |
---|---|
Lower α (0.05 → 0.01) | +70% |
Higher power (80% → 90%) | +50% |
Baseline p from 5% → 50% | +30% |
MDE halved (1.0pp → 0.5pp) | ×4 |
Sequential Testing - peeking without p-hacking
(a) The “peeking” trap
If you repeatedly check p-values mid-experiment and stop at the first p<0.05, your false positive rate balloons far above 5%.
Example: It’s like opening the oven too early — the cake looks done but collapses.
(b) Group-sequential designs
Pre-plan “interim looks” with corrected thresholds.
- O’Brien–Fleming: conservative early, looser late.
- Pocock: same threshold at each look.
(c) Alpha-spending functions
Treat α as a budget spent over time.
(d) Always-valid inference (advanced)
Confidence sequences and e-values let you monitor continuously without inflating error.
Multiple Testing : many variants & many metrics
(a) The problem
Test more things → more chances of false positives.
(b) FWER (Family-Wise Error Rate)
Probability of at least one false discovery among all tests.
- Controlled by Bonferroni, Holm.
- Example: With 20 tests at α=0.05, Bonferroni adjusts each to α=0.0025.
(c) FDR (False Discovery Rate)
Expected proportion of false positives among the declared “wins.”
- Controlled by Benjamini–Hochberg.
- Example: If 5 out of 20 “wins” are actually false, FDR=25%.
Bayesian A/B Testing: decisions as probabilities
(a) Beta–Binomial model
Posterior distributions update beliefs about conversion rates after seeing data.
Posterior probability:
- Frequentist p-value: “If H₀ were true, how surprising is this data?”
- Bayesian posterior: “Given the data, what’s the probability B is better than A?”
Variance Reduction: get the same power faster
(a) CUPED (Controlled Using Pre-Experiment Data)
Adjust outcomes using pre-period behavior to reduce variance.
Intuition: If users who converted in the past are also more likely to convert now, using that info removes noise.
(b) Stratification / blocking
Randomize within buckets (e.g., device, geo).
(c) Regression adjustment
Add covariates in a regression framework.
Validity Threats: what breaks tests in the wild
(a) SRM (Sample Ratio Mismatch)
When actual traffic split drifts from intended (e.g., expected 50/50, got 53/47). Usually indicates routing/logging bugs.
(b) Non-stationarity & novelty
Behavior changes over time → cover full cycles (week, month).
(c) Interference & SUTVA
SUTVA = Stable Unit Treatment Value Assumption. One user’s treatment shouldn’t affect another’s outcome.
Example: In a social network, treating one user may affect their friends too.
Decision Frameworks: when to ship, hold, or stop
- Frequentist: Significant effect at pre-set α, MDE met, no guardrail regressions.
- Sequential: Boundary crossing at planned interim look.
- Bayesian: Posterior win prob ≥ 95%.
- Risk-aware: Weigh cost of false launch vs delaying a true win.
Mini Example
Baseline (p=6%), MDE=0.6pp, α=0.05, power=0.8:
$$ n \approx \frac{2\cdot0.06\cdot0.94\cdot(1.96+0.84)^2}{0.006^2} \approx 16{,}700 \text{ users per arm.} $$
At 100k daily uniques, runtime ≈ 1 day.
Part 3 Preview
Running A/B in the Wild: Heavy-tail metrics, confidence sequences, cluster randomization, and case studies.