A/B Testing Demystified - Part 2: Beyond the Basics

5 min read  •  July 2, 2025

#A/B Testing #Statistics #Experiments #Data Science

TL;DR

  • Plan for power: choose a minimum detectable effect (MDE) and compute sample size before you launch.
  • Don’t peek without protection: use group-sequential designs or alpha-spending functions.
  • Correct multiplicity: when testing many variants/metrics, control FDR (BH) or FWER (Holm/Bonferroni).
  • Bayesian A/B gives sequentially coherent decisions using posterior win probability and credible intervals.
  • Reduce variance to win time: CUPED, stratification, and covariate adjustment shrink required sample sizes.
  • Defend validity: check SRM, instrument stability, novelty/day effects, and interference/SUTVA.

Power & Sample Size - turning business goals into math

(a) Set your MDE first

  • The Minimum Detectable Effect (MDE) is the smallest lift worth detecting, based on business value (e.g., +0.4pp CTR or +1.5% revenue per user).
  • Smaller MDEs sound great, but they require much bigger samples (since (n \propto 1/\Delta^2)).

(b) Two-proportion z-test (conversion metrics)
For equal group sizes and baseline rate (p), approximate per-group sample size to detect an absolute lift (\Delta) (MDE):

$$ n \approx \frac{2p(1-p)(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2} $$

  • α (significance level): the maximum false positive rate you’re willing to tolerate.
    Example: α=0.05 means that if there’s no real effect, you’ll still (falsely) see a “significant” win about 5% of the time.
  • Power (1−β): the probability of catching a real effect of size Δ when it exists.
    Example: 80% power means you’ll detect a true win 8 times out of 10.

Example calculation: baseline (p=0.05), MDE=0.005 (0.5pp). With α=0.05 and power=0.8:

$$ n \approx \frac{2\cdot0.05\cdot0.95\cdot(1.96+0.84)^2}{0.005^2} \approx 29{,}400 \text{ users per arm.} $$

(c) Continuous metrics (e.g., revenue/user)
With standard deviation (\sigma) and desired mean lift (\Delta):

$$ n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2} $$

Power vs Sample Size for different MDEs


How Parameters Drive Sample Size (Sensitivity Intuition)

Equations are useful, but let’s build intuition: how do changes in α, power, baseline rate, and MDE affect the required sample size?

(a) Significance level (α)
Lower α = stricter evidence bar = larger sample.

  • α=0.05 (5% false alarm tolerance) → ~29,400 per arm.
  • α=0.01 (1% false alarm tolerance) → ~49,000 per arm.
    → ~70% increase just from lowering α.

(b) Power (1−β)
Higher power = lower chance of missing a true effect = larger sample.

  • 80% power → ~29,400 per arm.
  • 90% power → ~43,800 per arm.
    → +49% more traffic.

(c) Baseline conversion rate (p)
Variance of a Bernoulli is p(1−p).

  • p=0.5 is max variance → hardest to detect differences.
  • p very low or very high → easier.

(d) MDE
Halving MDE = 4× required sample size.


Takeaway Table

ChangeEffect on Sample Size (n)
Lower α (0.05 → 0.01)+70%
Higher power (80% → 90%)+50%
Baseline p from 5% → 50%+30%
MDE halved (1.0pp → 0.5pp)×4

Sequential Testing - peeking without p-hacking

(a) The “peeking” trap
If you repeatedly check p-values mid-experiment and stop at the first p<0.05, your false positive rate balloons far above 5%.
Example: It’s like opening the oven too early — the cake looks done but collapses.

(b) Group-sequential designs
Pre-plan “interim looks” with corrected thresholds.

  • O’Brien–Fleming: conservative early, looser late.
  • Pocock: same threshold at each look.

(c) Alpha-spending functions
Treat α as a budget spent over time.

Alpha Spending Functions (O’Brien–Fleming vs Pocock)

(d) Always-valid inference (advanced)
Confidence sequences and e-values let you monitor continuously without inflating error.


Multiple Testing : many variants & many metrics

(a) The problem
Test more things → more chances of false positives.

(b) FWER (Family-Wise Error Rate)
Probability of at least one false discovery among all tests.

  • Controlled by Bonferroni, Holm.
  • Example: With 20 tests at α=0.05, Bonferroni adjusts each to α=0.0025.

(c) FDR (False Discovery Rate)
Expected proportion of false positives among the declared “wins.”

  • Controlled by Benjamini–Hochberg.
  • Example: If 5 out of 20 “wins” are actually false, FDR=25%.

Bayesian A/B Testing: decisions as probabilities

(a) Beta–Binomial model
Posterior distributions update beliefs about conversion rates after seeing data.

Posterior probability:

  • Frequentist p-value: “If H₀ were true, how surprising is this data?”
  • Bayesian posterior: “Given the data, what’s the probability B is better than A?”

Variance Reduction: get the same power faster

(a) CUPED (Controlled Using Pre-Experiment Data)
Adjust outcomes using pre-period behavior to reduce variance.
Intuition: If users who converted in the past are also more likely to convert now, using that info removes noise.

(b) Stratification / blocking
Randomize within buckets (e.g., device, geo).

(c) Regression adjustment
Add covariates in a regression framework.


Validity Threats: what breaks tests in the wild

(a) SRM (Sample Ratio Mismatch)
When actual traffic split drifts from intended (e.g., expected 50/50, got 53/47). Usually indicates routing/logging bugs.

Sample Ratio Mismatch (SRM) Diagnostic

(b) Non-stationarity & novelty
Behavior changes over time → cover full cycles (week, month).

(c) Interference & SUTVA
SUTVA = Stable Unit Treatment Value Assumption. One user’s treatment shouldn’t affect another’s outcome.
Example: In a social network, treating one user may affect their friends too.


Decision Frameworks: when to ship, hold, or stop

  • Frequentist: Significant effect at pre-set α, MDE met, no guardrail regressions.
  • Sequential: Boundary crossing at planned interim look.
  • Bayesian: Posterior win prob ≥ 95%.
  • Risk-aware: Weigh cost of false launch vs delaying a true win.

Mini Example

Baseline (p=6%), MDE=0.6pp, α=0.05, power=0.8:

$$ n \approx \frac{2\cdot0.06\cdot0.94\cdot(1.96+0.84)^2}{0.006^2} \approx 16{,}700 \text{ users per arm.} $$

At 100k daily uniques, runtime ≈ 1 day.


Part 3 Preview

Running A/B in the Wild: Heavy-tail metrics, confidence sequences, cluster randomization, and case studies.

What do you do from here on?