Chapter 3

Causal Graphs

Drawing the invisible: how to see confounders, mediators, and colliders

What if there were a visual language for causation? A way to draw a picture that captures everything you believe about how variables influence each other — and then use that picture to figure out exactly what to control for in an analysis?

There is. It's called a DAG (Directed Acyclic Graph), and it was largely developed by Judea Pearl starting in the 1980s. It's one of the most powerful tools in modern causal inference.


What Is a DAG?

A DAG is a graph where:

  • Nodes represent variables (age, caffeine intake, sleep quality, exercise)
  • Arrows represent direct causal effects (X → Y means "X directly causes Y")
  • Directed means the arrows point in one direction
  • Acyclic means there are no loops (you can't follow arrows back to where you started)

Here's the simplest possible DAG:

X ——→ Y

X causes Y. That's it. With randomization, you can estimate this effect cleanly.

Now add a confounder:

    Z
   ↙ ↘
  X   Y

Z causes both X and Y. If you observe that X and Y are correlated, you can't tell if it's because X→Y or because Z→X and Z→Y (the "backdoor path").

The interactive visualization below lets you switch between scenarios:

Z(confounder)X(treatment)Y(outcome)

Z (e.g. age) affects both X (caffeine) and Y (sleep). Naive comparison of X→Y is biased.


ZTreatment (X)Outcome (Y)causescausesspurious correlation
The classic fork: Z causes both X and Y, creating a spurious correlation between them.

Three DAG Structures You Must Know

1. Fork (Confounder)

Z → X
Z → Y
X → Y (maybe)

Z is a confounder: it causes both X and Y, creating a spurious correlation between them.

What to do: Control for Z (include it in your regression, stratify by it, or match on it). This "blocks" the backdoor path Z←→X.

Real example: Income confounds the relationship between education and health. Higher-income families both invest more in education AND have better health access. Controlling for income is essential.

2. Chain (Mediator)

X → M → Y

M is a mediator: X causes M, which causes Y. X affects Y through M.

What to do: Don't control for M if you want the total effect of X on Y. Controlling for M blocks the path you're trying to measure.

Real example: Exercise (X) improves mood (M) which improves productivity (Y). If you control for mood when estimating the effect of exercise on productivity, you'll block the main pathway and underestimate the effect.

Quality (X)Marketing (Y)Famous?conditioning here opens a spurious path!
The collider trap: conditioning on a shared effect opens a spurious path.

3. Collider

X → Z
Y → Z

Z is a collider: both X and Y cause Z. There's no backdoor path between X and Y here — they're genuinely independent.

What to do: Do NOT control for Z. Controlling for a collider opens a spurious association between X and Y. This is the collider bias trap.

Real example (the restaurant puzzle): Imagine quality (X) and marketing (Y) both affect whether a restaurant becomes famous (Z). Among famous restaurants, you might observe that high-quality restaurants tend to have low marketing budgets (and vice versa) — not because quality causes less marketing, but because conditioning on "fame" (the collider) induces a spurious negative correlation. This is called Berkson's paradox.

⚠️The Collider Trap Is Everywhere

Collider bias is underappreciated and causes many research errors. Any time you're analyzing a selected sample (people who got into college, patients who were hospitalized, users who churned), you're conditioning on a collider. The correlations you observe in that selected sample can be opposite to what's true in the general population.


The Backdoor Criterion

Judea Pearl formalized the conditions under which a set of variables Z is sufficient to identify the causal effect of X on Y. It's called the backdoor criterion:

A set of variables Z satisfies the backdoor criterion relative to (X, Y) if:

  1. No node in Z is a descendant of X
  2. Z blocks every "backdoor path" from X to Y (paths that enter X through an arrow pointing into X)

If Z satisfies the backdoor criterion, then:

P(Ydo(X))=zP(YX,Z=z)P(Z=z)P(Y | \text{do}(X)) = \sum_z P(Y | X, Z=z) \cdot P(Z=z)

In plain English: condition on Z, and your observational data gives you the same answer as a randomized experiment would.

The do() Operator

The notation do(X=x)\text{do}(X = x) means "we intervene to set X to the value x" — as opposed to merely observing that X = x. This is Pearl's way of distinguishing correlation from causation.

P(YX=x)P(Y | X = x) = probability of Y given we observe X = x
P(Ydo(X=x))P(Y | \text{do}(X = x)) = probability of Y given we force X to be x (randomize)

These are different. The first includes backdoor paths. The second doesn't.


d-Separation

To determine whether two variables are independent given a set of conditioning variables, there's a graphical algorithm called d-separation (directional separation).

The rules:

  1. A fork (Z → X, Z → Y): Z is a common cause. X and Y are dependent unless you condition on Z. Conditioning on Z makes them independent.
  2. A chain (X → M → Y): M mediates. X and Y are dependent unless you condition on M. Conditioning on M blocks the path.
  3. A collider (X → Z ← Y): Z is a common effect. X and Y are independent unless you condition on Z. Conditioning on Z makes them dependent.

The collider rule is counterintuitive — conditioning on something can create dependence where none existed. But it follows logically: if you know a restaurant is famous (conditioning on the collider), then learning it has low marketing tells you it must be high quality.


Building a DAG in Practice

How do you construct a DAG for your problem? You can't run an algorithm — you have to think carefully about your domain.

Step 1: List all variables that might be relevant: the treatment, the outcome, and everything that might affect either.

Step 2: For each pair of variables, ask: "Does A directly cause B?" Draw an arrow if yes.

Step 3: Check for cycles. If you find a cycle (X → Y → X), you've either made an error or need to add time subscripts (XtYtXt+1X_t \to Y_t \to X_{t+1}).

Step 4: Apply the backdoor criterion. Which variables do you need to control for? Which should you not control for (descendants, colliders)?

🧪Building a caffeine DAG

Variables: caffeine timing (T), sleep score (Y), exercise (E), stress (S), chronotype (C — morning person vs. night owl)

My beliefs:

  • Chronotype → caffeine timing (morning people stop caffeine earlier)
  • Chronotype → sleep score (directly affects sleep quality)
  • Stress → caffeine timing (stressed people drink more late coffee)
  • Stress → sleep score (stress disrupts sleep)
  • Exercise → sleep score (exercise improves sleep)
  • Exercise → stress (exercise reduces stress)
  • Caffeine timing → sleep score (what we want to estimate)

DAG: C → T, C → Y, S → T, S → Y, E → Y, E → S, T → Y

Backdoor paths from T to Y: T ← C → Y (through chronotype), T ← S → Y (through stress) Adjustment set: {C, S} blocks both. Controlling for chronotype and stress gives us the causal effect.

Or better: randomize T. Then no adjustment needed.


Causal Graphs in Reinforce OS

The app has a built-in DAG editor. You can:

  1. Draw your causal graph (treatment node, outcome node, confounders, mediators)
  2. The system analyzes it and identifies the minimum adjustment set
  3. When you log trials, you can record confounder values
  4. The analysis engine applies regression adjustment automatically

This turns a philosophical framework (DAGs) into something practical: you specify what you believe about the data-generating process, and the system does the rest.


The Front-Door Criterion

What if the backdoor criterion can't be satisfied? Sometimes you can't measure all the confounders (imagine: genetics affects both treatment and outcome, and you don't have genetic data).

There's an alternative: the front-door criterion. If you can identify a mediator M such that:

  • All paths from X to Y go through M (X affects Y only via M)
  • There are no unblocked backdoor paths from X to M
  • All backdoor paths from M to Y are blocked by conditioning on X

Then you can still identify the causal effect, even with unmeasured confounders.

The canonical example: smoking (X) → tar in lungs (M) → cancer (Y). Even if we can't measure all the genetic confounders between smoking and cancer, we can identify the causal effect through the tar-deposition pathway.


Summary

  • A DAG is a picture of your causal assumptions
  • Three key structures: forks (confounders), chains (mediators), colliders
  • Control for confounders; do not control for mediators or colliders
  • The backdoor criterion tells you which variables to condition on
  • d-separation lets you read conditional independences directly from the graph
  • In observational data, conditioning on the right set of variables gives you the same answer as an experiment

Next: Counterfactuals and Potential Outcomes — the mathematical machinery behind what we've been reasoning about.