Chapter 9

When Experiments Fail

Sensitivity analysis, E-values, and what to do when you can't randomize

You ran an observational study. You controlled for everything you could measure. The effect looks real.

But how confident should you be? An unmeasured confounder could explain away the entire result. You can never prove it doesn't exist — but you can quantify how strong it would have to be to matter.

This is sensitivity analysis: the art of stress-testing causal claims.


The Problem: Unmeasured Confounding

Every observational study rests on an assumption: no unmeasured confounders. You've controlled for everything that both predicts treatment and affects the outcome.

This assumption is untestable. You can't prove you've measured everything relevant. What you can do is ask: how bad would unmeasured confounding have to be to change my conclusion?

If the answer is "an implausibly strong unmeasured confounder," your result is robust. If "a very weak confounder would do it," you should be worried.

treatmentoutcomeobserved associationhidden U??E-value stress meterHow strong would the hidden thing need to be?
Sensitivity analysis asks whether a hidden variable would need to be a mouse, a bear, or a bear with a clipboard.

E-Values

The E-value (VanderWeele & Ding, 2017) is the most practical tool for this question. It answers:

What is the minimum strength of association that an unmeasured confounder would need to have with both the treatment and the outcome — on the risk ratio scale — to fully explain away the observed association?

For a relative risk (RR) estimate, the E-value formula is:

E-value=RR+RR(RR1)\text{E-value} = \text{RR} + \sqrt{\text{RR} \cdot (\text{RR} - 1)}

For a risk difference, odds ratio, or hazard ratio, there are analogous formulas (converting to an approximate RR first).

Interpreting E-values

An E-value of 3.0 means: an unmeasured confounder would need to be associated with both treatment and outcome by a risk ratio of at least 3.0 to explain away the result. If the strongest known unmeasured confounder you can think of has a RR of 1.5 with each, you're safe.

An E-value of 1.2 means almost any unmeasured confounder could explain the result. Be very skeptical.

E-values for the confidence interval: you can also compute the E-value for the lower bound of your confidence interval. This tells you how strong confounding would need to be to push the entire confidence interval below 1 (for RR) or 0 (for risk differences).

🧪E-values in Reinforce OS

When you analyze observational data in Reinforce OS, the results panel shows the E-value alongside the effect estimate. An estimated 20% improvement in focus (RR ≈ 1.20) has an E-value of about 1.6 — meaning unmeasured confounding would need to more than double the risk of both the behavior and the outcome to explain the finding. For most personal habit experiments, that's a plausible threshold.

Mini-Sim

E-Value Stress Meter

Increase the observed risk ratio and see how strong a hidden confounder would need to be to erase the finding.

1.69E-value
interesting, but not armor-platedplain-English read

Rosenbaum Sensitivity Analysis

A complementary approach, developed by Paul Rosenbaum for matched observational studies, asks: how much could two matched individuals differ in their odds of treatment due to an unmeasured confounder?

The parameter Γ\Gamma represents the maximum odds ratio for unmeasured confounding between matched units. Γ=1\Gamma = 1 means no unmeasured confounding (equivalent to random assignment). Γ=2\Gamma = 2 means matched units could differ by up to a factor of 2 in their treatment odds due to hidden bias.

For each value of Γ\Gamma, you compute the range of p-values consistent with that level of confounding. If the p-value stays significant even at Γ=3\Gamma = 3 or higher, your result is robust to substantial hidden bias.

If significance breaks at Γ=1.2\Gamma = 1.2, a very small amount of unmeasured confounding could explain everything.


Placebo Tests

A placebo test (or falsification test) checks your causal claim by testing something that should not be affected by treatment.

Temporal placebo: if treatment started in month 6, test whether treatment predicts outcomes in months 1–5. If it does, the "effect" is probably a pre-existing trend, not a causal effect.

Outcome placebo: test an outcome that treatment mechanically cannot affect. If you're studying the effect of a new checkout flow on revenue, test whether it also affects customer service calls (which the checkout flow can't directly cause). An effect suggests your analysis is picking up something confounded.

Treatment placebo: randomly assign a fake treatment to a group that didn't receive the real one. The estimated "effect" of the fake treatment should be zero. If it's not, your control group is contaminated.

Passing placebo tests doesn't prove your causal claim — but failing them is a strong signal that something is wrong.


Instrumental Variables

When you have an unmeasured confounder but also have an instrument — a variable that affects treatment but affects the outcome only through treatment — you can still identify the causal effect.

The instrument ZZ must satisfy:

  1. Relevance: ZZ is correlated with treatment TT
  2. Exclusion restriction: ZZ affects outcome YY only through TT
  3. Independence: ZZ is independent of unmeasured confounders

The IV estimator is:

τ^IV=Cov(Y,Z)Cov(T,Z)\hat{\tau}_{\text{IV}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(T, Z)}

This is a ratio: the effect of the instrument on outcome, divided by the effect of the instrument on treatment.

Classic examples:

  • Distance to college as an instrument for education (Card 1995)
  • Draft lottery as an instrument for military service (Angrist 1990)
  • Rainfall as an instrument for conflict (Miguel et al. 2004)
  • Genetic variants as instruments for exposures in Mendelian randomization

IV gives you the LATE (Local Average Treatment Effect) — the effect for compliers, people whose treatment status is changed by the instrument.

Weak instruments: if ZZ has only a weak relationship with TT, the IV estimate is highly sensitive to even small violations of the exclusion restriction. Check the first-stage F-statistic (rule of thumb: > 10, ideally > 20).


Negative Controls

A negative control is a variable that you know is not causally affected by treatment (negative control outcome) or is not causally related to the outcome (negative control exposure).

If your analysis finds an effect on a negative control outcome, something is wrong with your design or assumptions. If a negative control exposure shows an "effect" that mirrors your main finding, it's likely a confounder.

Negative controls are a powerful diagnostic tool that's underused in practice. They're especially valuable in large observational datasets where fishing for significant results is easy.


Difference-in-Differences: Checking Parallel Trends

For DiD designs (Chapter 8), the parallel trends assumption is critical and partially testable. If you have multiple pre-treatment periods:

  • Plot trends for treated and control groups before treatment
  • Run a regression testing whether pre-treatment trends differ
  • A significant pre-trend is evidence against parallel trends

Even if pre-trends look parallel, post-treatment trends might diverge for reasons unrelated to treatment. Event study plots — showing the treatment effect coefficient for each time period relative to treatment — are standard practice for visualizing this.


Combining Methods: A Robustness Checklist

For any causal claim from observational data, a thorough analysis includes:

  1. Primary estimate: AIPW with cross-fitting (Chapter 8)
  2. E-value: how strong would unmeasured confounding need to be?
  3. Rosenbaum bounds (for matched designs): at what Γ\Gamma does significance break?
  4. Placebo tests: temporal, outcome, or treatment placebos
  5. Sensitivity to specification: does the estimate change when you add/remove covariates?
  6. Alternative estimators: does OLS, IPW, and matching give similar answers?
  7. Subgroup analysis: is the effect consistent across subgroups, or concentrated in one suspicious subgroup?

An effect that survives all seven is substantially more credible than one that only passes the primary estimate.


The Limits of Sensitivity Analysis

Sensitivity analysis tells you the minimum strength of confounding needed to overturn a result. It doesn't tell you whether that level of confounding exists.

Ultimately, observational evidence is always uncertain in a way that randomized experiments are not. Sensitivity analysis quantifies the uncertainty and forces intellectual honesty — but it's no substitute for a well-designed experiment when one is possible.

If you can randomize, randomize. If you can't, use sensitivity analysis to know how much to trust what you found.


Summary

  • E-values quantify how strong unmeasured confounding would need to be to explain away a result
  • Rosenbaum bounds extend this to matched designs, parameterizing hidden bias via odds ratios
  • Placebo tests check whether the "effect" appears where it shouldn't — a powerful diagnostic
  • Instrumental variables provide identification under unmeasured confounding, at the cost of estimating LATE only
  • Negative controls are underused but valuable
  • A robust observational claim survives multiple sensitivity checks

Next: Causal AI — how machine learning and causal reasoning are converging.