Effect Estimation | DoOperator Research

You have a treatment, an outcome, and some confounders you want to control for. How do you actually compute the causal effect?

This chapter covers the main estimation strategies — from simple regression to the doubly-robust AIPW estimator that Reinforce OS uses under the hood.

Regression Adjustment

The simplest approach: run a regression of the outcome on treatment and confounders.

Y_i = \alpha + \tau T_i + \beta X_i + \varepsilon_i

The coefficient $\hat{\tau}$ is the estimated ATE, holding $X$ fixed. This works well when:

The confounders $X$ are measured correctly
The relationship between $X$ and $Y$ is roughly linear
You haven't omitted any important confounders

Regression is fast, interpretable, and the workhorse of empirical research. Its weakness: it relies on correctly specifying the outcome model. If the true relationship between $X$ and $Y$ is nonlinear, a linear regression will give biased estimates.

Limitations

The most common mistake with regression adjustment: using it with high-dimensional confounders without regularization. If you have 100 confounders and 200 observations, OLS will overfit. Use regularized regression (Lasso, Ridge) or a machine learning model instead.

Propensity Score Methods

The propensity score is the probability of receiving treatment given covariates:

e(x) = P(T = 1 \mid X = x)

Rosenbaum & Rubin (1983) showed a remarkable result: conditioning on the propensity score is sufficient to remove confounding, even though $e(x)$ is a single number rather than the full covariate vector $X$ .

This dimension reduction property makes propensity scores powerful when you have many confounders.

Propensity Score Matching

Match each treated unit to a control unit with a similar propensity score. Then estimate the ATE as the average difference in outcomes between matched pairs.

Treated unit: e(x) = 0.73  →  match with  Control unit: e(x) = 0.71
Treated unit: e(x) = 0.41  →  match with  Control unit: e(x) = 0.40
...

Matching removes imbalance in observed confounders. After matching, the treated and control groups should look similar on all covariates — similar to what randomization achieves.

Inverse Probability Weighting (IPW)

Instead of matching, weight each observation by the inverse of its probability of receiving the treatment it actually received:

\hat{\text{ATE}}_{\text{IPW}} = \frac{1}{n}\sum_i \left[\frac{T_i Y_i}{e(X_i)} - \frac{(1-T_i) Y_i}{1 - e(X_i)}\right]

Treated units with low propensity scores (they were unlikely to be treated) get high weight — they're informative precisely because they were treated despite the odds. Control units with high propensity scores (they "could have been" treated) also get high weight.

IPW creates a pseudo-population where treatment is independent of covariates — mimicking randomization.

Weakness of IPW: extreme propensity scores ( $e(x)$ near 0 or 1) create extreme weights, inflating variance. Common fixes: weight trimming, stabilized weights.

The Doubly-Robust AIPW Estimator

The cleanest solution to the limitations of both regression and IPW is to combine them. The Augmented Inverse Probability Weighted (AIPW) estimator does exactly this:

\hat{\text{ATE}}_{\text{AIPW}} = \frac{1}{n}\sum_i \left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i(Y_i - \hat{\mu}_1(X_i))}{e(X_i)} - \frac{(1-T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - e(X_i)}\right]

This looks complex. Let's break it apart:

$\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)$ : the regression-adjusted estimate (outcome model)
The remaining terms: IPW corrections for how well the outcome model fits

Why "Doubly Robust"?

AIPW has a remarkable property: it gives a consistent estimate of the ATE if either the outcome model or the propensity model is correctly specified — not necessarily both.

If you get the outcome model right but the propensity model wrong → consistent
If you get the propensity model right but the outcome model wrong → consistent
If you get both right → efficient (lowest possible variance)
If you get both wrong → biased

You get two chances to be right. This "double robustness" is why AIPW is now standard in modern causal inference.

Doubly robust estimation: two statistical ropes, one slightly nervous effect estimate.

🔍Why Reinforce OS uses AIPW

Reinforce OS uses AIPW as its primary estimator for observational analysis. When you run an experiment without randomization — or when you want to adjust for confounders measured during a randomized experiment — the AIPW estimator gives you the most reliable effect estimates.

The outcome model $\hat{\mu}(X)$ is fit using regularized regression. The propensity model $e(X)$ uses logistic regression. Both models are cross-fit (trained on held-out data) to avoid overfitting bias.

Cross-Fitting: Making ML-Based Estimation Valid

When you use machine learning for the outcome model or propensity model inside AIPW, a subtle problem arises: if you use the same data to fit the model and evaluate the estimator, the in-sample fit is too good, creating bias.

The solution is cross-fitting (also called sample splitting):

Split data into $K$ folds
For each fold $k$ : fit $\hat{\mu}$ and $e$ on the other $K-1$ folds
Evaluate the AIPW estimator using the predictions for fold $k$
Average across all folds

This ensures predictions are always out-of-sample. The resulting estimator is called DML (Double Machine Learning, Chernozhukov et al. 2018) and achieves the semiparametric efficiency bound — the lowest possible variance for this class of estimators.

Fold 1: train on folds 2-5, evaluate on fold 1
Fold 2: train on folds 1,3-5, evaluate on fold 2
...
Fold 5: train on folds 1-4, evaluate on fold 5
Average the estimates

Regression Discontinuity

When you can't randomize and don't have enough covariates to use AIPW, regression discontinuity (RD) is a powerful alternative — if your assignment has a sharp threshold.

The idea: people just above and just below the threshold are essentially comparable. The discontinuity in outcome at the threshold estimates the causal effect.

Classic example: students scoring just above vs. just below the cutoff for a scholarship program. Near the cutoff, assignment is essentially random, even though globally it's determined by score.

In an RD design, the estimator is:

\hat{\tau}_{\text{RD}} = \lim_{x \downarrow c} \mathbb{E}[Y \mid X = x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X = x]

The limits from above and below the cutoff, estimated by local linear regression near the threshold.

Difference-in-Differences

When you have panel data (the same units observed over time), difference-in-differences (DiD) is a workhorse design.

Setup: some units receive treatment at time $t_0$ , others never do. Compare the change in outcomes for treated units to the change for control units:

\hat{\tau}_{\text{DiD}} = (\bar{Y}_{\text{treated, after}} - \bar{Y}_{\text{treated, before}}) - (\bar{Y}_{\text{control, after}} - \bar{Y}_{\text{control, before}})

The key assumption: parallel trends — absent treatment, treated and control units would have moved in parallel. This is untestable in principle, but you can check it pre-treatment.

DiD is everywhere in economics: evaluating minimum wage laws (Card & Krueger 1994), policy changes, and natural experiments of all kinds.

Choosing Your Estimator

Situation	Recommended approach
Randomized experiment	Simple difference in means (or regression for precision)
Randomized + covariates	AIPW for efficiency gain
Observational + rich covariates	AIPW with ML outcome and propensity models
Sharp assignment threshold	Regression discontinuity
Panel data, parallel trends plausible	Difference-in-differences
Instrument available	Instrumental variables (IV)
Can't satisfy any of the above	Sensitivity analysis (Chapter 9)

What Reinforce OS Does

When you run an observational analysis in Reinforce OS — for example, analyzing logged behavioral data rather than a controlled experiment — the engine:

Fits an outcome model $\hat{\mu}(T, X)$ using regularized regression
Fits a propensity model $e(X)$ using logistic regression
Applies 5-fold cross-fitting
Computes the AIPW estimate with bootstrap confidence intervals
Reports the result with a plain-English interpretation and the full posterior distribution

For controlled experiments (where you randomized assignment), the simple difference in means is used, with Bayesian updating to give you a posterior over the effect size.

Summary

Regression adjustment is simple and interpretable but relies on correct model specification
Propensity score methods (matching, IPW) reduce confounding by balancing covariate distributions
AIPW combines both: doubly robust (consistent if either model is correct), efficient if both are
Cross-fitting lets you use flexible ML models inside AIPW without bias
Regression discontinuity and DiD are powerful designs for specific data structures
Reinforce OS uses AIPW with cross-fitting as its default observational estimator

Next: When Experiments Fail — what to do when you can't randomize and you're not sure your assumptions hold.