Reinforcement Learning | DoOperator Research

Multi-armed bandits (Chapter 6) handle one decision repeated many times. But what about decisions that have long-term consequences — where the reward from an action today depends on what you do tomorrow?

This is reinforcement learning (RL): the study of sequential decision-making under uncertainty. An agent takes actions in an environment, receives rewards, and learns a policy that maximizes cumulative reward over time.

RL is the framework behind AlphaGo, ChatGPT's RLHF training, robotics, and increasingly, personalized health and behavior optimization.

The Basic Setup

An RL problem has five components:

State $s$ : a description of the current situation
Action $a$ : what the agent can do
Transition $P(s' \mid s, a)$ : how the state evolves after an action
Reward $r(s, a)$ : immediate feedback from the environment
Policy $\pi(a \mid s)$ : the agent's decision rule — what action to take in each state

The agent's goal: find a policy $\pi^*$ that maximizes expected cumulative reward:

\mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]

The discount factor $\gamma \in [0, 1)$ downweights future rewards — a dollar today is worth more than a dollar tomorrow, and certainty today is worth more than uncertainty in the future.

Markov Decision Processes

The formal model is a Markov Decision Process (MDP). The "Markov" property says the future depends only on the current state — not on history:

P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)

This is a simplification, but a powerful one. It means you only need to track the current state, not your entire history.

Example: a personal behavior optimization MDP.

State: (today's sleep score, stress level, exercise done this week, day of week)
Actions: take caffeine / no caffeine, exercise / rest, 10pm bedtime / flexible bedtime
Transition: how these choices affect tomorrow's state
Reward: focus score, mood, energy level

The goal isn't to maximize today's focus — it's to maximize focus + mood + energy over the next month, accounting for the fact that today's choices constrain tomorrow's options.

RL is not one decision with a trophy. It is a loop: state, action, reward, next state, repeat until the whiteboard looks worried.

Mini-Sim

State, Action, Reward

Pick an action. The reward is not just what happens now; it nudges the next state too.

statesleepy, busy, slightly dramatic

calmernext state

+3 focus laterreward signal

Value Functions

The core tool for solving MDPs is the value function: the expected cumulative reward from state $s$ under policy $\pi$ .

V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid s_0 = s\right]

The Q-function (action-value function) extends this to state-action pairs:

Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid s_0 = s, a_0 = a\right]

$Q^\pi(s, a)$ answers: "if I'm in state $s$ and take action $a$ now, then follow policy $\pi$ afterwards, what's my expected cumulative reward?"

The optimal policy follows directly:

\pi^*(s) = \arg\max_a Q^*(s, a)

The Bellman Equation

The Q-function satisfies a recursive relationship called the Bellman equation:

Q^*(s, a) = \mathbb{E}\left[r(s, a) + \gamma \max_{a'} Q^*(s', a')\right]

This says: the value of taking action $a$ in state $s$ equals the immediate reward plus the (discounted) value of the best action in the next state. It's a self-consistency condition — and the basis for all Q-learning algorithms.

Q-Learning

Q-learning is an algorithm that learns $Q^*$ directly from experience, without a model of the transition dynamics $P(s' \mid s, a)$ .

The update rule:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]

The term in brackets is the TD error (temporal difference error): the gap between the current estimate and the updated estimate after seeing the actual reward and next state. The learning rate $\alpha$ controls how fast you update.

Q-learning is off-policy: it learns the optimal Q regardless of how it collected experience. This is useful because you can learn from historical data or an exploratory policy without biasing the value estimate.

Deep Q-Networks (DQN)

When the state space is large (like images in Atari games), you can't store a table of Q-values. Deep Q-Networks use a neural network to approximate $Q(s, a; \theta)$ .

DQN added two tricks to stabilize training:

Experience replay: store transitions in a buffer, sample random batches to break correlation
Target network: use a slowly updated copy of the network for the Bellman target

These tricks made RL on complex environments practical for the first time (DeepMind's Atari paper, 2013).

Policy Gradient Methods

Q-learning learns the value function and derives the policy from it. Policy gradient methods directly optimize the policy.

The objective is:

J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_t r_t\right]

The policy gradient theorem gives the gradient:

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot G_t\right]

where $G_t = \sum_{k=t}^{\infty} \gamma^k r_k$ is the return from time $t$ . The policy is updated by gradient ascent.

REINFORCE is the simplest policy gradient algorithm: collect a trajectory, compute the return for each step, update the policy parameters by gradient ascent weighted by the return.

Actor-Critic methods combine both ideas: a critic estimates $V(s)$ (reducing variance) while an actor updates the policy. This is the basis for modern methods like PPO (Proximal Policy Optimization), which underlies ChatGPT's RLHF training.

Exploration vs. Exploitation (Again)

RL faces the same exploration-exploitation tradeoff as bandits (Chapter 6), but harder: an exploratory action today might have consequences many steps into the future.

Common exploration strategies:

$\varepsilon$ -greedy: take the best known action with probability $1 - \varepsilon$ , random action with probability $\varepsilon$ . Decay $\varepsilon$ over time.
Boltzmann exploration: $\pi(a \mid s) \propto \exp(Q(s,a) / \tau)$ . High temperature $\tau$ = more random; low temperature = more greedy.
UCB-style: add exploration bonuses based on uncertainty in Q-estimates. Principled but expensive.
Intrinsic motivation: reward the agent for visiting novel states — useful when external rewards are sparse.

The Connection to Causal Inference

Here's where RL and causal inference meet: both are fundamentally about interventions.

In causal inference, $P(Y \mid \text{do}(X))$ asks: what happens when we intervene to set $X$ ? In RL, the policy $\pi(a \mid s)$ is an intervention rule: given state $s$ , take action $a$ .

The connections run deep:

Counterfactual reasoning in RL: to evaluate a new policy from historical data (offline RL), you need to estimate what would have happened if the agent had acted differently. This is exactly the missing counterfactual problem from Chapter 4. Importance weighting and doubly-robust methods (Chapter 8) are standard tools here.

Causal models improve generalization: an agent that learns a causal model of the world (how actions cause state transitions) can reason about interventions it's never tried. A purely associative model can't generalize to new environments. This is the core of model-based RL.

Reward shaping: the reward function implicitly encodes the goal. But reward functions can be misspecified — teaching an RL agent to maximize a proxy can lead to unexpected behavior (the agent "gaming" the metric). Causal models of the outcome help specify rewards that reflect what you actually care about.

RL for Personal Optimization

The vision for personal health optimization: an RL agent that learns your personal MDP and suggests actions to maximize long-run wellbeing.

Today's Steady Practice runs individual A/B experiments — estimating the effect of caffeine on focus in isolation. A Reinforce OS extension would:

Model how your behaviors interact: sleep quality depends on caffeine timing and exercise and stress
Learn the transition dynamics: how today's choices affect tomorrow's state
Optimize a long-run objective: not just "does caffeine help?" but "what's the optimal morning routine for sustained focus over a month?"

This is an active research area. The challenges:

State space: what variables describe your "state"? Hundreds of potential features.
Sparse rewards: wellbeing metrics take time to manifest; credit assignment is hard.
Non-stationarity: your physiology changes (stress, age, season) — the MDP isn't fixed.
Confounding: observational data about your own behavior is confounded by the same person choosing both actions and outcomes.

These are exactly the problems where causal inference and RL need each other.

Summary

RL formalizes sequential decision-making as a Markov Decision Process: states, actions, transitions, rewards, policy
The Bellman equation recursively defines the value of a state or state-action pair
Q-learning learns optimal action values directly from experience; policy gradient methods directly optimize the policy
RL and causal inference are closely related: both reason about interventions and counterfactuals
The future of personal optimization combines both: causal models for identifying effects + RL for sequential decision-making

Next: Effect Estimation — getting the numbers right with regression, propensity scores, and doubly-robust methods.