A/B Testing Demystified - Part 2: Beyond the Basics

5 min read • July 2, 2025

#A/B Testing #Statistics #Experiments #Data Science

TL;DR

Plan for power: choose a minimum detectable effect (MDE) and compute sample size before you launch.
Don’t peek without protection: use group-sequential designs or alpha-spending functions.
Correct multiplicity: when testing many variants/metrics, control FDR (BH) or FWER (Holm/Bonferroni).
Bayesian A/B gives sequentially coherent decisions using posterior win probability and credible intervals.
Reduce variance to win time: CUPED, stratification, and covariate adjustment shrink required sample sizes.
Defend validity: check SRM, instrument stability, novelty/day effects, and interference/SUTVA.

Power & Sample Size - turning business goals into math

(a) Set your MDE first

The Minimum Detectable Effect (MDE) is the smallest lift worth detecting, based on business value (e.g., +0.4pp CTR or +1.5% revenue per user).
Smaller MDEs sound great, but they require much bigger samples (since (n \propto 1/\Delta^2)).

(b) Two-proportion z-test (conversion metrics)
For equal group sizes and baseline rate (p), approximate per-group sample size to detect an absolute lift (\Delta) (MDE):

$$ n \approx \frac{2p(1-p)(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2} $$

α (significance level): the maximum false positive rate you’re willing to tolerate.
Example: α=0.05 means that if there’s no real effect, you’ll still (falsely) see a “significant” win about 5% of the time.
Power (1−β): the probability of catching a real effect of size Δ when it exists.
Example: 80% power means you’ll detect a true win 8 times out of 10.

Example calculation: baseline (p=0.05), MDE=0.005 (0.5pp). With α=0.05 and power=0.8:

$$ n \approx \frac{2\cdot0.05\cdot0.95\cdot(1.96+0.84)^2}{0.005^2} \approx 29{,}400 \text{ users per arm.} $$

(c) Continuous metrics (e.g., revenue/user)
With standard deviation (\sigma) and desired mean lift (\Delta):

$$ n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2} $$

Power vs Sample Size for different MDEs

How Parameters Drive Sample Size (Sensitivity Intuition)

Equations are useful, but let’s build intuition: how do changes in α, power, baseline rate, and MDE affect the required sample size?

(a) Significance level (α)
Lower α = stricter evidence bar = larger sample.

α=0.05 (5% false alarm tolerance) → ~29,400 per arm.
α=0.01 (1% false alarm tolerance) → ~49,000 per arm.
→ ~70% increase just from lowering α.

(b) Power (1−β)
Higher power = lower chance of missing a true effect = larger sample.

80% power → ~29,400 per arm.
90% power → ~43,800 per arm.
→ +49% more traffic.

(c) Baseline conversion rate (p)
Variance of a Bernoulli is p(1−p).

p=0.5 is max variance → hardest to detect differences.
p very low or very high → easier.

(d) MDE
Halving MDE = 4× required sample size.

Takeaway Table

Change	Effect on Sample Size (n)
Lower α (0.05 → 0.01)	+70%
Higher power (80% → 90%)	+50%
Baseline p from 5% → 50%	+30%
MDE halved (1.0pp → 0.5pp)	×4

Sequential Testing - peeking without p-hacking

(a) The “peeking” trap
If you repeatedly check p-values mid-experiment and stop at the first p<0.05, your false positive rate balloons far above 5%.
Example: It’s like opening the oven too early — the cake looks done but collapses.

(b) Group-sequential designs
Pre-plan “interim looks” with corrected thresholds.

O’Brien–Fleming: conservative early, looser late.
Pocock: same threshold at each look.

(c) Alpha-spending functions
Treat α as a budget spent over time.

Alpha Spending Functions (O’Brien–Fleming vs Pocock)

(d) Always-valid inference (advanced)
Confidence sequences and e-values let you monitor continuously without inflating error.

Multiple Testing : many variants & many metrics

(a) The problem
Test more things → more chances of false positives.

(b) FWER (Family-Wise Error Rate)
Probability of at least one false discovery among all tests.

Controlled by Bonferroni, Holm.
Example: With 20 tests at α=0.05, Bonferroni adjusts each to α=0.0025.

(c) FDR (False Discovery Rate)
Expected proportion of false positives among the declared “wins.”

Controlled by Benjamini–Hochberg.
Example: If 5 out of 20 “wins” are actually false, FDR=25%.

Bayesian A/B Testing: decisions as probabilities

(a) Beta–Binomial model
Posterior distributions update beliefs about conversion rates after seeing data.

Posterior probability:

Frequentist p-value: “If H₀ were true, how surprising is this data?”
Bayesian posterior: “Given the data, what’s the probability B is better than A?”

Variance Reduction: get the same power faster

(a) CUPED (Controlled Using Pre-Experiment Data)
Adjust outcomes using pre-period behavior to reduce variance.
Intuition: If users who converted in the past are also more likely to convert now, using that info removes noise.

(b) Stratification / blocking
Randomize within buckets (e.g., device, geo).

(c) Regression adjustment
Add covariates in a regression framework.

Validity Threats: what breaks tests in the wild

(a) SRM (Sample Ratio Mismatch)
When actual traffic split drifts from intended (e.g., expected 50/50, got 53/47). Usually indicates routing/logging bugs.

Sample Ratio Mismatch (SRM) Diagnostic

(b) Non-stationarity & novelty
Behavior changes over time → cover full cycles (week, month).

(c) Interference & SUTVA
SUTVA = Stable Unit Treatment Value Assumption. One user’s treatment shouldn’t affect another’s outcome.
Example: In a social network, treating one user may affect their friends too.

Decision Frameworks: when to ship, hold, or stop

Frequentist: Significant effect at pre-set α, MDE met, no guardrail regressions.
Sequential: Boundary crossing at planned interim look.
Bayesian: Posterior win prob ≥ 95%.
Risk-aware: Weigh cost of false launch vs delaying a true win.

Mini Example

Baseline (p=6%), MDE=0.6pp, α=0.05, power=0.8:

$$ n \approx \frac{2\cdot0.06\cdot0.94\cdot(1.96+0.84)^2}{0.006^2} \approx 16{,}700 \text{ users per arm.} $$

At 100k daily uniques, runtime ≈ 1 day.

Part 3 Preview

Running A/B in the Wild: Heavy-tail metrics, confidence sequences, cluster randomization, and case studies.

What do you do from here on?