You have a treatment, an outcome, and some confounders you want to control for. How do you actually compute the causal effect?
This chapter covers the main estimation strategies — from simple regression to the doubly-robust AIPW estimator that Reinforce OS uses under the hood.
Regression Adjustment
The simplest approach: run a regression of the outcome on treatment and confounders.
The coefficient is the estimated ATE, holding fixed. This works well when:
- The confounders are measured correctly
- The relationship between and is roughly linear
- You haven't omitted any important confounders
Regression is fast, interpretable, and the workhorse of empirical research. Its weakness: it relies on correctly specifying the outcome model. If the true relationship between and is nonlinear, a linear regression will give biased estimates.
Limitations
The most common mistake with regression adjustment: using it with high-dimensional confounders without regularization. If you have 100 confounders and 200 observations, OLS will overfit. Use regularized regression (Lasso, Ridge) or a machine learning model instead.
Propensity Score Methods
The propensity score is the probability of receiving treatment given covariates:
Rosenbaum & Rubin (1983) showed a remarkable result: conditioning on the propensity score is sufficient to remove confounding, even though is a single number rather than the full covariate vector .
This dimension reduction property makes propensity scores powerful when you have many confounders.
Propensity Score Matching
Match each treated unit to a control unit with a similar propensity score. Then estimate the ATE as the average difference in outcomes between matched pairs.
Treated unit: e(x) = 0.73 → match with Control unit: e(x) = 0.71
Treated unit: e(x) = 0.41 → match with Control unit: e(x) = 0.40
...
Matching removes imbalance in observed confounders. After matching, the treated and control groups should look similar on all covariates — similar to what randomization achieves.
Inverse Probability Weighting (IPW)
Instead of matching, weight each observation by the inverse of its probability of receiving the treatment it actually received:
Treated units with low propensity scores (they were unlikely to be treated) get high weight — they're informative precisely because they were treated despite the odds. Control units with high propensity scores (they "could have been" treated) also get high weight.
IPW creates a pseudo-population where treatment is independent of covariates — mimicking randomization.
Weakness of IPW: extreme propensity scores ( near 0 or 1) create extreme weights, inflating variance. Common fixes: weight trimming, stabilized weights.
The Doubly-Robust AIPW Estimator
The cleanest solution to the limitations of both regression and IPW is to combine them. The Augmented Inverse Probability Weighted (AIPW) estimator does exactly this:
This looks complex. Let's break it apart:
- : the regression-adjusted estimate (outcome model)
- The remaining terms: IPW corrections for how well the outcome model fits
Why "Doubly Robust"?
AIPW has a remarkable property: it gives a consistent estimate of the ATE if either the outcome model or the propensity model is correctly specified — not necessarily both.
- If you get the outcome model right but the propensity model wrong → consistent
- If you get the propensity model right but the outcome model wrong → consistent
- If you get both right → efficient (lowest possible variance)
- If you get both wrong → biased
You get two chances to be right. This "double robustness" is why AIPW is now standard in modern causal inference.
Reinforce OS uses AIPW as its primary estimator for observational analysis. When you run an experiment without randomization — or when you want to adjust for confounders measured during a randomized experiment — the AIPW estimator gives you the most reliable effect estimates.
The outcome model is fit using regularized regression. The propensity model uses logistic regression. Both models are cross-fit (trained on held-out data) to avoid overfitting bias.
Cross-Fitting: Making ML-Based Estimation Valid
When you use machine learning for the outcome model or propensity model inside AIPW, a subtle problem arises: if you use the same data to fit the model and evaluate the estimator, the in-sample fit is too good, creating bias.
The solution is cross-fitting (also called sample splitting):
- Split data into folds
- For each fold : fit and on the other folds
- Evaluate the AIPW estimator using the predictions for fold
- Average across all folds
This ensures predictions are always out-of-sample. The resulting estimator is called DML (Double Machine Learning, Chernozhukov et al. 2018) and achieves the semiparametric efficiency bound — the lowest possible variance for this class of estimators.
Fold 1: train on folds 2-5, evaluate on fold 1
Fold 2: train on folds 1,3-5, evaluate on fold 2
...
Fold 5: train on folds 1-4, evaluate on fold 5
Average the estimates
Regression Discontinuity
When you can't randomize and don't have enough covariates to use AIPW, regression discontinuity (RD) is a powerful alternative — if your assignment has a sharp threshold.
The idea: people just above and just below the threshold are essentially comparable. The discontinuity in outcome at the threshold estimates the causal effect.
Classic example: students scoring just above vs. just below the cutoff for a scholarship program. Near the cutoff, assignment is essentially random, even though globally it's determined by score.
In an RD design, the estimator is:
The limits from above and below the cutoff, estimated by local linear regression near the threshold.
Difference-in-Differences
When you have panel data (the same units observed over time), difference-in-differences (DiD) is a workhorse design.
Setup: some units receive treatment at time , others never do. Compare the change in outcomes for treated units to the change for control units:
The key assumption: parallel trends — absent treatment, treated and control units would have moved in parallel. This is untestable in principle, but you can check it pre-treatment.
DiD is everywhere in economics: evaluating minimum wage laws (Card & Krueger 1994), policy changes, and natural experiments of all kinds.
Choosing Your Estimator
| Situation | Recommended approach |
|---|---|
| Randomized experiment | Simple difference in means (or regression for precision) |
| Randomized + covariates | AIPW for efficiency gain |
| Observational + rich covariates | AIPW with ML outcome and propensity models |
| Sharp assignment threshold | Regression discontinuity |
| Panel data, parallel trends plausible | Difference-in-differences |
| Instrument available | Instrumental variables (IV) |
| Can't satisfy any of the above | Sensitivity analysis (Chapter 9) |
What Reinforce OS Does
When you run an observational analysis in Reinforce OS — for example, analyzing logged behavioral data rather than a controlled experiment — the engine:
- Fits an outcome model using regularized regression
- Fits a propensity model using logistic regression
- Applies 5-fold cross-fitting
- Computes the AIPW estimate with bootstrap confidence intervals
- Reports the result with a plain-English interpretation and the full posterior distribution
For controlled experiments (where you randomized assignment), the simple difference in means is used, with Bayesian updating to give you a posterior over the effect size.
Summary
- Regression adjustment is simple and interpretable but relies on correct model specification
- Propensity score methods (matching, IPW) reduce confounding by balancing covariate distributions
- AIPW combines both: doubly robust (consistent if either model is correct), efficient if both are
- Cross-fitting lets you use flexible ML models inside AIPW without bias
- Regression discontinuity and DiD are powerful designs for specific data structures
- Reinforce OS uses AIPW with cross-fitting as its default observational estimator
Next: When Experiments Fail — what to do when you can't randomize and you're not sure your assumptions hold.