You're a product manager at a software company. You notice that users who watch your tutorial video have a 3× higher retention rate than users who don't. Should you force everyone to watch the video?
Hold on. Before you do anything, ask yourself: what kind of person watches a 10-minute tutorial video voluntarily?
Probably someone who's more motivated, more patient, and more likely to stick around anyway. The video might not be doing anything at all — you might just be measuring who shows up.
This is confounding, and it's the core problem that all of causal inference is trying to solve.
The Fundamental Problem of Causal Inference
Here's an uncomfortable truth: we can never directly observe a causal effect.
Imagine you drink coffee before giving a presentation. Your presentation goes well. Did the coffee help? You'll never know for certain, because you can't go back in time and give the same presentation without coffee.
The philosopher David Hume articulated this in 1748: causation isn't something we observe directly — it's a story we tell to explain regularities. We see A happen before B, repeatedly, and we infer that A causes B. But that inference can be wrong.
What we really want to know is the counterfactual: what would have happened in the other world where you didn't drink coffee?
Formally, let's say:
- = your outcome if you receive the treatment
- = your outcome if you don't
The causal effect for person is .
The problem: you can only ever observe one of these. Either you drank coffee or you didn't. The other value is forever missing. This is called the Fundamental Problem of Causal Inference.
The notation above — and — comes from Donald Rubin's potential outcomes framework (1974). It's the foundation of most modern causal inference. The idea: every unit has potential outcomes under every possible treatment, but we can only observe the one that actually happened. Everything else is counterfactual.
Why Correlation Isn't Causation
You've heard "correlation isn't causation" a thousand times. Let's actually understand why.
The ice cream example
Ice cream sales and drowning deaths are strongly correlated. Does eating ice cream cause drowning?
Of course not. Both are driven by a third variable: hot weather. People eat more ice cream in summer. People swim more in summer. Hot weather causes both.
Hot weather is a confounder — a variable that influences both the thing you're studying (ice cream sales / swimming) and the outcome (drowning), making them look related when they're not.
The hospital example
Patients who go to the hospital are more likely to die than people who don't. Should we close the hospitals?
Again, no. Sick people go to hospitals. Sick people are more likely to die. Being sick is the confounder.
The caffeine example
You run a study and find that people who drink coffee before noon have better sleep scores than people who drink it in the afternoon. Does cutting off caffeine before noon improve sleep?
Maybe. But people who voluntarily restrict their caffeine intake might also be more health-conscious in other ways — better diet, more exercise, consistent bedtimes. All of those affect sleep too.
This is exactly the scenario in Steady Practice. When you set up an experiment like "No caffeine after noon vs. caffeine any time," you're trying to isolate the causal effect of caffeine timing on sleep — not just find people who already sleep well and happen to restrict caffeine. The experiment handles this by randomly assigning the condition each day, so the confounders average out.
The Solution: Randomization
The genius of randomized experiments is that they solve the confounding problem without requiring you to measure or even know what all the confounders are.
Here's how it works:
- Take 200 users.
- Flip a coin for each one: heads gets the tutorial video, tails doesn't.
- Measure retention after 30 days.
Now the "watches tutorial" group and the "doesn't watch" group are identical in expectation — same age distribution, same motivation levels, same prior behavior — because you assigned them randomly. The only systematic difference is whether they saw the video.
So when you compare their retention rates, you're actually measuring the effect of the video.
Formally, randomization ensures:
which means: the potential outcomes are independent of the treatment assignment. Whether you got treatment was decided by a coin flip, not by anything about you. This is called ignorability or unconfoundedness.
Average Treatment Effects
Since we can never observe both and for the same person, we settle for the Average Treatment Effect (ATE):
With randomization, we can estimate this by comparing group averages:
This is just the difference in means — one of the simplest statistics there is. The entire power of randomized experiments is that this simple formula gives us an unbiased estimate of a causal effect.
When Randomization Isn't Possible
Sometimes you can't randomize. You can't randomly assign people to smoke or not smoke. You can't randomly assign people to grow up in poverty. You can't randomly assign countries to different economic policies.
In those cases, you need other tools:
- Causal graphs (Chapter 3) to reason about what to control for
- Regression discontinuity — exploiting arbitrary cutoffs (Chapter 8)
- Instrumental variables — using a "nudge" to mimic randomization (Chapter 9)
- Difference-in-differences — comparing changes over time (Chapter 10)
These are the tools of observational causal inference — and they're genuinely clever. But they all come with assumptions, and none of them are as clean as a good randomized experiment.
In medicine, there's a "hierarchy of evidence": randomized controlled trials at the top, observational studies in the middle, expert opinion at the bottom. The same hierarchy applies everywhere. A randomized experiment is always preferable when feasible — but feasibility matters, and clever observational designs can get you surprisingly far.
The Two Kinds of Causal Questions
Before running any experiment, it's worth being clear about what kind of causal question you're asking.
Causal discovery: Does X cause Y at all? "Does caffeine affect sleep quality?"
Causal estimation: How much does X affect Y? "If I cut off caffeine at noon, how much better will my sleep score be?"
Most practical questions are about estimation, not discovery. You probably already believe caffeine affects sleep — the question is by how much, under what conditions, and for which people.
The distinction matters because the tools are different. For discovery, you might use DAG-based methods or interventional studies. For estimation, you need experiments or clever observational designs that give you quantitative effect sizes with confidence intervals.
What Makes a Good Experiment?
Not all experiments are created equal. Here's what separates rigorous experiments from ones that waste your time:
- Clear treatment and control: what exactly is the difference between conditions?
- Pre-specified outcome metric: decide what you're measuring before you look at the data
- Adequate sample size: too small and you can't detect real effects
- Proper randomization: not just "the first 100 users vs. the next 100"
- Blinding where possible: people behave differently when they know which group they're in
- Protocol compliance: what happens when people don't follow their assigned condition?
We'll spend the rest of this book going deeper on all of these.
Summary
- Correlation doesn't imply causation because of confounders — variables that affect both the cause and the effect
- We can never directly observe a causal effect; we can only see one potential outcome per person
- The Average Treatment Effect is what we estimate:
- Randomization solves the confounding problem by making treatment assignment independent of everything else
- When randomization isn't possible, we need smarter methods — which is most of what this book is about
In the next chapter, we'll look at how to actually design an experiment: how many participants do you need, how do you assign conditions, and how do you analyze the results without fooling yourself.