Chapter 2

Designing Experiments

Sample sizes, randomization schemes, and how not to fool yourself

Most experiments fail before a single data point is collected. They fail because of small sample sizes that can't detect real effects, randomization schemes that introduce bias, outcome metrics that weren't specified upfront, or stopping rules applied at the wrong time.

This chapter is about building experiments that work.


How Many Participants Do You Need?

This is the question everyone asks and most people answer too casually. "Let's run it for two weeks and see what happens" is not a sample size calculation.

You need to think about three things before you start:

1. Effect size: How big of an effect are you looking for?

If you're testing whether a button color change improves click-through rate, a realistic effect might be a 1–2 percentage point improvement. If you're testing a complete redesign, maybe 10 percentage points. These require very different sample sizes.

2. Statistical power: How confident do you want to be that you'll detect the effect if it's real?

The conventional standard is 80% power: if your change truly has the effect you hypothesized, you want an 80% chance of detecting it (i.e., getting a statistically significant result). Some people use 90% for high-stakes decisions.

3. Significance level: How often are you willing to be wrong when there's no real effect?

The conventional standard is α=0.05\alpha = 0.05: you're willing to accept a 5% false positive rate.

The Formula

For a two-sample comparison with equal group sizes, the required sample per group is approximately:

n2(zα/2+zβ)2σ2δ2n \approx \frac{2(z_{\alpha/2} + z_{\beta})^2 \sigma^2}{\delta^2}

where:

  • δ\delta = the minimum effect size you want to detect (in the same units as your outcome)
  • σ\sigma = the standard deviation of your outcome
  • zα/2=1.96z_{\alpha/2} = 1.96 for α=0.05\alpha = 0.05
  • zβ=0.84z_{\beta} = 0.84 for 80% power

This simplifies to something like: you need ~16 × (σ/δ)² observations per group.

Cohen's d

In practice, effect sizes are often expressed as Cohen's d: the difference between means divided by the pooled standard deviation.

d=μ1μ2σd = \frac{\mu_1 - \mu_2}{\sigma}

A rule of thumb: d=0.2d = 0.2 is small, d=0.5d = 0.5 is medium, d=0.8d = 0.8 is large. For A/B testing on consumer products, effects of d=0.2d = 0.20.30.3 are common. This is why you need hundreds or thousands of users — not dozens.

tiny effectsmall gapneedslots of trialsPower is how hard you squint before believing the bars are different.
Small effects are shy. To spot them reliably, you need more observations instead of more optimism.
Mini-Sim

Power Calculator Toy

Smaller effects need more observations. Slide the effect size down and watch the sample size grow teeth.

131observations per group
262total observations
🧪Caffeine experiment power calculation

You're running a 14-day crossover experiment on caffeine timing. Your sleep score has a standard deviation of about 8 points (on a 0–100 scale). You believe the true effect is around 4 points (half a standard deviation, i.e., d ≈ 0.5).

Required n per group ≈ 2 × (1.96 + 0.84)² × 64 / 16 ≈ 64 trials

With 14 trials per block in a crossover design, you'd need about 4–5 full blocks to have adequate power. Steady Practice calculates this for you.


Types of Randomization

Not all randomization is equal. There are several ways to assign people to conditions, each with trade-offs.

Simple Randomization

Flip a coin for each participant. Easy, but can produce imbalanced groups by chance — especially in small studies.

Block Randomization

Divide participants into blocks and randomize within each block. For example, in blocks of 4, exactly 2 get treatment and 2 get control. This guarantees balance.

Example: If you're doing a 14-day caffeine experiment with blocks of 7, days 1–7 have 3–4 days of each condition (randomly arranged), and so do days 8–14.

Stratified Randomization

If you know that a variable will strongly affect your outcome (e.g., age, baseline health), you can stratify: randomize separately within each stratum. This ensures balance on that variable.

Adaptive Randomization

Instead of fixing the allocation upfront, you update the allocation probability based on incoming data. We'll cover this extensively in the Multi-Armed Bandits chapter — it's powerful but comes with statistical complications.

Which to Use?

For small experiments (< 50 participants), use block randomization — the balanced allocation prevents unlucky imbalance. For large experiments, simple randomization is fine. For web A/B tests with thousands of users, simple randomization is standard.

⚠️Don't use convenience splits

"The first 100 users vs. the next 100" is not randomization. Early adopters behave differently from later users. Time-of-week effects matter. Platform differences matter. Always use a proper randomization mechanism — even just hashing user IDs.


Crossover vs. Parallel Designs

There are two main experimental structures:

Parallel (Between-Subject) Design

Each participant gets exactly one condition. Control group and treatment group are separate people.

Pros: Simple. No carryover effects.
Cons: Needs more participants. Between-person variation dilutes your signal.

Example: Website A/B test — each visitor sees either the old or new design.

Crossover (Within-Subject) Design

Each participant gets multiple conditions in sequence. You're comparing the same person across conditions.

Pros: More efficient. Eliminates between-person variation (every person is their own control).
Cons: Carryover effects. Order effects.

Example: You spend 7 days drinking coffee before noon (treatment), then 7 days with no restriction (control), then repeat.

Carryover Effects

The main risk with crossover designs: the effect of condition A "carries over" into condition B. If you sleep better during the caffeine-restricted condition and it takes a week to return to baseline, your "normal caffeine" condition will be artificially elevated.

Solutions:

  • Washout periods: add buffer days between conditions (the system handles this in Steady Practice)
  • Check for carryover: test whether the order (A-then-B vs B-then-A) matters
Acoffee latecarryoverwashoutbuffer daysBno late coffeefuture you thanks you
A washout is the awkward pause between conditions where yesterday's intervention stops photobombing today's data.

Stopping Rules and p-Hacking

Here's a trap that catches almost everyone:

You run your experiment. After two weeks, the p-value is 0.07. Slightly above 0.05. You run it for one more week. Now it's 0.04. You declare victory.

You've been p-hacking, even if unintentionally. Here's why that's a problem:

If you repeatedly test your data and stop when p < 0.05, your actual false positive rate is much higher than 5%. With 5 peeks, the real false positive rate is closer to 23%.

Solutions

Pre-register your analysis: Before collecting data, write down your sample size, outcome metric, and stopping rule. Then stick to it.

Sequential testing with corrections: Methods like the O'Brien-Fleming boundary or always-valid p-values let you peek at data without inflating false positives. These are built into modern A/B testing tools.

Bayesian stopping rules: Instead of a p-value cutoff, you can stop when you're sufficiently confident that one variant is better. We'll cover this in the Bayesian chapter.

💡The one valid exception

You can stop early if your interim result is so clear that continuing would be wasteful or unethical — but this needs to be specified in advance with an appropriate statistical correction (Bonferroni, Pocock, or O'Brien-Fleming boundaries).


What to Measure (and What Not To)

Choosing the right outcome metric is harder than it sounds.

Primary outcome: One pre-specified metric that determines success or failure. Don't use multiple primary metrics — if you test 10 metrics, you'll find a significant result just by chance.

Secondary outcomes: Things you want to understand but won't use to make the primary decision.

Proxy metrics vs. real outcomes: Click-through rate is a proxy for purchase intent. Purchase rate is a proxy for lifetime value. Try to measure what you actually care about, or at least validate that your proxy moves in the direction of the real outcome.

Steady Practice Examples

Experiment TypePrimary OutcomeWhat to Watch
Caffeine timingSleep quality scoreSubjective energy, mood
Exercise timingHRV / recovery scoreSleep quality, soreness
Website CTA changeConversion rateBounce rate, time on page
Pricing testRevenue per visitorConversion rate, avg order value

Compliance and Intent-to-Treat

In the real world, people don't always follow their assigned condition. You randomize someone to the "no caffeine after noon" group and they have a coffee at 3pm. What do you do?

There are two approaches:

Per-Protocol Analysis (PP): Only include participants who actually followed their assigned condition.

Problem: this re-introduces selection bias. People who comply with treatment might be systematically different.

Intent-to-Treat Analysis (ITT): Analyze everyone based on what they were assigned, regardless of what they actually did.

The ITT estimate is the effect of "being assigned to treatment" — which may be less than the effect of "actually taking treatment." But it's unbiased, and it's the right answer to the practical question of "what happens if I roll this out to all users?"

In Steady Practice, when you log a trial where your condition_followed ≠ condition_assigned, this gets recorded and the analysis notes the compliance rate.


Latin Square and Other Advanced Designs

For experiments with multiple factors (e.g., caffeine timing AND exercise timing), you have options beyond a simple 2×2 factorial:

Factorial design: Test all combinations. 2 factors × 2 levels = 4 conditions. Efficient but requires more trials.

Latin square: A clever design where each treatment appears exactly once per row and column. Useful when you have time effects and multiple conditions.

Response surface designs: Used in engineering and chemistry to find optimal settings. Beyond scope here, but Box, Hunter & Hunter is excellent if you want to go deep.


Summary

  • Calculate required sample size before starting — use power analysis with a realistic effect size
  • Use block randomization for small experiments; simple randomization for large ones
  • Crossover designs are more efficient but require washout periods to handle carryover
  • Pre-register your analysis to avoid p-hacking
  • Use intent-to-treat analysis to preserve the benefits of randomization

Next up: Causal Graphs — the visual language for reasoning about what to control for, and what to leave alone.