Multi-armed bandits (Chapter 6) handle one decision repeated many times. But what about decisions that have long-term consequences — where the reward from an action today depends on what you do tomorrow?
This is reinforcement learning (RL): the study of sequential decision-making under uncertainty. An agent takes actions in an environment, receives rewards, and learns a policy that maximizes cumulative reward over time.
RL is the framework behind AlphaGo, ChatGPT's RLHF training, robotics, and increasingly, personalized health and behavior optimization.
The Basic Setup
An RL problem has five components:
- State : a description of the current situation
- Action : what the agent can do
- Transition : how the state evolves after an action
- Reward : immediate feedback from the environment
- Policy : the agent's decision rule — what action to take in each state
The agent's goal: find a policy that maximizes expected cumulative reward:
The discount factor downweights future rewards — a dollar today is worth more than a dollar tomorrow, and certainty today is worth more than uncertainty in the future.
Markov Decision Processes
The formal model is a Markov Decision Process (MDP). The "Markov" property says the future depends only on the current state — not on history:
This is a simplification, but a powerful one. It means you only need to track the current state, not your entire history.
Example: a personal behavior optimization MDP.
- State: (today's sleep score, stress level, exercise done this week, day of week)
- Actions: take caffeine / no caffeine, exercise / rest, 10pm bedtime / flexible bedtime
- Transition: how these choices affect tomorrow's state
- Reward: focus score, mood, energy level
The goal isn't to maximize today's focus — it's to maximize focus + mood + energy over the next month, accounting for the fact that today's choices constrain tomorrow's options.
State, Action, Reward
Pick an action. The reward is not just what happens now; it nudges the next state too.
Value Functions
The core tool for solving MDPs is the value function: the expected cumulative reward from state under policy .
The Q-function (action-value function) extends this to state-action pairs:
answers: "if I'm in state and take action now, then follow policy afterwards, what's my expected cumulative reward?"
The optimal policy follows directly:
The Bellman Equation
The Q-function satisfies a recursive relationship called the Bellman equation:
This says: the value of taking action in state equals the immediate reward plus the (discounted) value of the best action in the next state. It's a self-consistency condition — and the basis for all Q-learning algorithms.
Q-Learning
Q-learning is an algorithm that learns directly from experience, without a model of the transition dynamics .
The update rule:
The term in brackets is the TD error (temporal difference error): the gap between the current estimate and the updated estimate after seeing the actual reward and next state. The learning rate controls how fast you update.
Q-learning is off-policy: it learns the optimal Q regardless of how it collected experience. This is useful because you can learn from historical data or an exploratory policy without biasing the value estimate.
Deep Q-Networks (DQN)
When the state space is large (like images in Atari games), you can't store a table of Q-values. Deep Q-Networks use a neural network to approximate .
DQN added two tricks to stabilize training:
- Experience replay: store transitions in a buffer, sample random batches to break correlation
- Target network: use a slowly updated copy of the network for the Bellman target
These tricks made RL on complex environments practical for the first time (DeepMind's Atari paper, 2013).
Policy Gradient Methods
Q-learning learns the value function and derives the policy from it. Policy gradient methods directly optimize the policy.
The objective is:
The policy gradient theorem gives the gradient:
where is the return from time . The policy is updated by gradient ascent.
REINFORCE is the simplest policy gradient algorithm: collect a trajectory, compute the return for each step, update the policy parameters by gradient ascent weighted by the return.
Actor-Critic methods combine both ideas: a critic estimates (reducing variance) while an actor updates the policy. This is the basis for modern methods like PPO (Proximal Policy Optimization), which underlies ChatGPT's RLHF training.
Exploration vs. Exploitation (Again)
RL faces the same exploration-exploitation tradeoff as bandits (Chapter 6), but harder: an exploratory action today might have consequences many steps into the future.
Common exploration strategies:
- -greedy: take the best known action with probability , random action with probability . Decay over time.
- Boltzmann exploration: . High temperature = more random; low temperature = more greedy.
- UCB-style: add exploration bonuses based on uncertainty in Q-estimates. Principled but expensive.
- Intrinsic motivation: reward the agent for visiting novel states — useful when external rewards are sparse.
The Connection to Causal Inference
Here's where RL and causal inference meet: both are fundamentally about interventions.
In causal inference, asks: what happens when we intervene to set ? In RL, the policy is an intervention rule: given state , take action .
The connections run deep:
Counterfactual reasoning in RL: to evaluate a new policy from historical data (offline RL), you need to estimate what would have happened if the agent had acted differently. This is exactly the missing counterfactual problem from Chapter 4. Importance weighting and doubly-robust methods (Chapter 8) are standard tools here.
Causal models improve generalization: an agent that learns a causal model of the world (how actions cause state transitions) can reason about interventions it's never tried. A purely associative model can't generalize to new environments. This is the core of model-based RL.
Reward shaping: the reward function implicitly encodes the goal. But reward functions can be misspecified — teaching an RL agent to maximize a proxy can lead to unexpected behavior (the agent "gaming" the metric). Causal models of the outcome help specify rewards that reflect what you actually care about.
RL for Personal Optimization
The vision for personal health optimization: an RL agent that learns your personal MDP and suggests actions to maximize long-run wellbeing.
Today's Steady Practice runs individual A/B experiments — estimating the effect of caffeine on focus in isolation. A Reinforce OS extension would:
- Model how your behaviors interact: sleep quality depends on caffeine timing and exercise and stress
- Learn the transition dynamics: how today's choices affect tomorrow's state
- Optimize a long-run objective: not just "does caffeine help?" but "what's the optimal morning routine for sustained focus over a month?"
This is an active research area. The challenges:
- State space: what variables describe your "state"? Hundreds of potential features.
- Sparse rewards: wellbeing metrics take time to manifest; credit assignment is hard.
- Non-stationarity: your physiology changes (stress, age, season) — the MDP isn't fixed.
- Confounding: observational data about your own behavior is confounded by the same person choosing both actions and outcomes.
These are exactly the problems where causal inference and RL need each other.
Summary
- RL formalizes sequential decision-making as a Markov Decision Process: states, actions, transitions, rewards, policy
- The Bellman equation recursively defines the value of a state or state-action pair
- Q-learning learns optimal action values directly from experience; policy gradient methods directly optimize the policy
- RL and causal inference are closely related: both reason about interventions and counterfactuals
- The future of personal optimization combines both: causal models for identifying effects + RL for sequential decision-making
Next: Effect Estimation — getting the numbers right with regression, propensity scores, and doubly-robust methods.