Counterfactuals | DoOperator Research

Here's the fundamental problem with causal inference, stated plainly: you can never observe the same person in two states at the same time.

Did caffeine improve your focus today? To know for certain, you'd need two versions of today — one where you had caffeine, one where you didn't — with everything else identical. You can't have both. This is the fundamental problem of causal inference, and it's why this field is hard.

The framework that makes sense of this impossibility is called potential outcomes (also known as the Rubin Causal Model). It's the mathematical language of "what would have happened."

The fundamental problem: you can only observe one world. The other is the counterfactual.

Potential Outcomes

For any individual $i$ and binary treatment $T_i \in \{0, 1\}$ , we define two potential outcomes:

$Y_i(1)$ : the outcome individual $i$ would experience if treated
$Y_i(0)$ : the outcome individual $i$ would experience if not treated

The individual treatment effect is:

\tau_i = Y_i(1) - Y_i(0)

But here's the catch: we only ever observe one of these. If you took caffeine today ( $T_i = 1$ ), we observe $Y_i(1)$ . The value $Y_i(0)$ — what would have happened without caffeine — is a counterfactual. It doesn't exist in the data.

The unobserved potential outcome is called the missing counterfactual, and estimating it is what causal inference is all about.

From Individual to Average

Since we can't know $\tau_i$ for any individual, we aim for averages across a population.

The Average Treatment Effect (ATE) is:

\text{ATE} = \mathbb{E}[\tau_i] = \mathbb{E}[Y_i(1) - Y_i(0)]

This is the average effect across everyone — treated and untreated alike.

Sometimes we care about a more specific quantity:

ATT (Average Treatment Effect on the Treated): the effect among people who actually received treatment. Useful when treatment is selective.
ATC (Average Treatment Effect on the Control): the effect among those who weren't treated. Useful for policy targeting — "would these people benefit if we gave them treatment?"
CATE (Conditional Average Treatment Effect): the ATE within a subgroup defined by covariates $X$ . More on this below.

Why randomization solves the problem

Under random assignment, $T_i$ is independent of potential outcomes:

(Y_i(1), Y_i(0)) \perp T_i

This means the average outcome in the treatment group equals $\mathbb{E}[Y_i(1)]$ , and the average in the control group equals $\mathbb{E}[Y_i(0)]$ . The ATE is simply the difference in observed means:

\hat{\text{ATE}} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}

Randomization makes the missing counterfactual problem disappear — not by observing the counterfactual, but by creating a group whose average stands in for it.

Violations: When the Estimate Goes Wrong

The potential outcomes framework also makes it clear exactly what can go wrong.

Confounding

In observational data, people who choose treatment differ from those who don't. Their potential outcomes $Y_i(0)$ might differ even if neither had been treated. The naive difference in means picks up both the treatment effect and these pre-existing differences.

Example: people who take vitamins tend to exercise more and eat better. If you compare vitamin-takers to non-takers, you're partly measuring the effect of a healthier lifestyle, not just vitamins.

The solution: control for confounders (see Chapter 3 on DAGs) or use a design that achieves effective randomization (instrumental variables, regression discontinuity, difference-in-differences).

Non-compliance

In an experiment, some people assigned to treatment don't take it, and some assigned to control find treatment anyway. This is non-compliance.

ITT (Intent-to-Treat): compare groups as assigned, ignoring compliance. Estimates the effect of offering treatment, which is often what you care about in policy.
LATE (Local Average Treatment Effect): the effect specifically for compliers — people who take treatment if and only if assigned to it. Requires an instrumental variable.

LATE is often more policy-relevant than ATE for programs with imperfect uptake: "what's the effect on people who respond to the nudge?"

Spillovers (SUTVA violations)

The potential outcomes framework assumes SUTVA: the Stable Unit Treatment Value Assumption. This has two parts:

No interference — your outcome doesn't depend on whether your neighbor was treated
No hidden versions of treatment — treatment is the same for everyone who receives it

Violations matter in: social networks (information spreads), marketplaces (treating one seller affects buyers and other sellers), and herd immunity (vaccinating enough people protects even the unvaccinated).

Treatment Effect Heterogeneity

The ATE is an average. But the effect might be large for some people and zero (or even negative) for others. This is heterogeneous treatment effects (HTE).

Example: caffeine might substantially improve focus for people who normally have low alertness, have no effect on those already alert, and disrupt sleep for those who are sensitive to it. Same treatment, three different effects.

Understanding heterogeneity matters for:

Personalization: who should receive treatment?
Subgroup analysis: does the treatment work better for men vs. women, younger vs. older?
Policy design: is a blanket policy right, or should we target?

The Conditional Average Treatment Effect (CATE) captures this:

\tau(x) = \mathbb{E}[Y_i(1) - Y_i(0) \mid X_i = x]

Here $x$ is a vector of individual characteristics. The CATE as a function of $x$ tells you the heterogeneous treatment effect surface.

Machine Learning for CATE Estimation

Estimating CATE is where machine learning connects to causal inference. The standard approaches:

X-Learner

Fit a model $\hat{\mu}_1(x)$ predicting $Y$ for treated units
Fit a model $\hat{\mu}_0(x)$ predicting $Y$ for control units
For each treated unit, estimate their counterfactual: $\tilde{\tau}_i = Y_i - \hat{\mu}_0(X_i)$
Fit a model to predict $\tilde{\tau}_i$ from $X_i$ — this is your CATE estimate

The X-Learner handles class imbalance well (useful when treatment is rare).

Causal Forest

Wager & Athey (2018) extended random forests to directly estimate CATE. Causal forests split on features that maximize treatment effect heterogeneity (not just prediction accuracy). They're honest (use separate data for splitting and estimation), which gives valid confidence intervals.

Causal forests are now practical through the grf R package and econml Python library.

T-Learner, S-Learner, DR-Learner

T-Learner: fit separate models for treatment and control, then take the difference. Simple but can overfit to noise.
S-Learner: fit a single model with treatment as a feature. Can underfit if the model doesn't capture interactions.
DR-Learner: combines outcome modeling and propensity weighting for doubly-robust CATE estimation (see Chapter 8).

Counterfactuals in Steady Practice

When Steady Practice shows you "estimated effect: +12 minutes of deep focus," that's an estimate of your personal ATE — averaged across your trials, not any single one.

The app can't observe both conditions simultaneously on the same day. Instead it uses the crossover design (alternating treatment and control periods) to make your control days a reasonable counterfactual for your treatment days, while controlling for time trends and carryover effects.

As you run more trials, the posterior over your individual treatment effect narrows. With enough data, heterogeneous treatment effects become visible: "caffeine timing matters more on high-stress days."

Summary

The potential outcomes framework defines causation precisely: $\tau_i = Y_i(1) - Y_i(0)$
We can never observe both potential outcomes for the same person — this is the fundamental problem of causal inference
Randomization solves it by making the control group a valid stand-in for the counterfactual
ATE is an average; CATE captures how effects vary across people
Machine learning (causal forests, X-Learner, DR-Learner) can estimate heterogeneous treatment effects
ITT and LATE handle the real-world messiness of non-compliance

Next: Bayesian Thinking — updating beliefs from evidence, and why it's the natural language for experiments.