Imagine you walk into a casino with 10 slot machines. Each machine pays out at a different rate, but you don't know the rates. You have 1,000 coins to spend. How do you maximize your winnings?
This is the multi-armed bandit problem — "armed" because each machine is a "one-armed bandit," and "multi" because there are many of them.
It's also a nearly perfect model for a dozen real problems:
- Which version of a webpage gets more clicks?
- Which drug dosage works best?
- Which ad creative converts?
- Which caffeine strategy gives you the best sleep?
The key insight: you have to make decisions while you're still learning. You don't get to gather all the data first, analyze it, then decide. Every trial costs you something, and the better you learn, the better you do.
Why A/B Tests Leave Money on the Table
Classic A/B testing has a fundamental problem: during the experiment, you know which version is performing better, but you keep splitting traffic 50/50 anyway.
If Version B is clearly winning after 200 visits, you're still sending 50% of users to Version A — and losing conversions — for the next 1,000 visits while you wait for statistical significance.
Bandits fix this. They automatically route more traffic to better-performing variants, so you make better decisions during the experiment, not just after.
The trade-off: you're making decisions based on less data per arm, so your estimates are noisier. Bandits are great when:
- You want to maximize a metric during the learning period (e.g., an ongoing ad campaign)
- The "regret" of sending people to an inferior variant has real costs (e.g., a medical treatment)
- The environment changes over time, making a fixed A/B test outdated
They're worse than A/B tests when:
- You need rigorous statistical inference (confidence intervals, p-values)
- You care about understanding why one variant is better, not just which one
- You need to test rare events where fast allocation would be based on noise
The Exploration-Exploitation Trade-Off
At the heart of every bandit algorithm is a tension:
Exploitation: play the arm that looks best based on your current data. Maximize short-term reward.
Exploration: try arms you haven't played much, to learn if they might be better. Give up some reward now to potentially gain more later.
Go too far toward exploitation: you get stuck on a mediocre arm because you never explored enough. Go too far toward exploration: you waste time on arms you've already learned are bad.
The perfect balance is mathematically impossible to achieve in general, but several algorithms come close.
ε-Greedy: The Naive Approach
The simplest algorithm: with probability (say, 10%), pick a random arm. Otherwise, pick the best arm you've seen so far.
def epsilon_greedy(arms, epsilon=0.1):
if random.random() < epsilon:
return random.choice(arms)
return max(arms, key=lambda a: a.estimated_rate)
It works but it's dumb: it treats all arms equally during exploration, even if some have been clearly ruled out.
Upper Confidence Bound (UCB)
A smarter approach: be optimistic about uncertainty. For each arm, compute an upper confidence bound on its true reward rate, and pick the arm with the highest UCB.
where is the estimated reward rate for arm , is the total number of trials so far, and is how many times arm has been tried.
The term is the "exploration bonus" — it's large for arms you haven't tried much and shrinks as you learn more. UCB naturally explores uncertain arms and exploits known good ones.
UCB1 has a nice theoretical property: its cumulative regret grows at most logarithmically with . In other words, over time, the gap between "what UCB gets" and "what an oracle who knew all rates would get" grows slowly.
Thompson Sampling: The Bayesian Bandit
Thompson sampling is the algorithm used in Steady Practice's adaptive experiments. It's simple, elegant, and works remarkably well in practice.
The idea: maintain a probability distribution over each arm's true reward rate. Sample from each distribution, and pick the arm that gives the highest sample.
For binary outcomes (success/failure), use a Beta distribution:
- Start with Beta(1, 1) = uniform prior (completely uncertain)
- After each trial: update alpha (number of successes + 1) and beta (number of failures + 1)
- To choose an arm: draw one sample from each arm's Beta distribution; pick the arm with the highest sample
def thompson_step(arms):
# Sample from each arm's posterior
samples = [beta.rvs(a.alpha, a.beta) for a in arms]
# Choose arm with highest sample
chosen = argmax(samples)
# Simulate outcome
reward = random() < arms[chosen].true_rate
# Update posterior
if reward:
arms[chosen].alpha += 1
else:
arms[chosen].beta += 1
return chosen, reward
Why does this work? Because the probability of sampling arm as the best equals the probability that arm actually is the best, given the data so far. The algorithm naturally self-calibrates.
Try It Yourself
The simulation below runs Thompson sampling across three variants in real time. The true conversion rates are hidden — can you figure them out from the data?
Three variants competing for clicks. The algorithm learns which arm converts best and routes more traffic to it — automatically.
The Beta Distribution: A Quick Intuition
The Beta distribution represents your beliefs about a probability .
- : uniform — you have no idea what is
- : peaked around 0.5 — you've seen 5 successes and 5 failures
- : very sharply peaked at 0.5 — strong evidence that is near 0.5
- : peaked around 0.8 — 80 successes, 20 failures, strong evidence that is high
The mean of Beta(α, β) is . As you collect more data, the distribution tightens around the true rate.
This is also a conjugate prior for the Bernoulli likelihood — meaning that if you start with Beta(α, β) and observe one success, you update to Beta(α+1, β) exactly. No approximation needed.
Regret: The Currency of Bandit Algorithms
Regret measures how much you lost by not always playing the best arm:
where is the true best arm's reward rate and is your actual reward at time .
Good algorithms have sublinear regret: the regret grows more slowly than T. This means the per-trial regret goes to zero as you learn. You're converging on the optimal strategy.
UCB1 achieves regret. Thompson sampling achieves the same, and often performs better in practice even though the theoretical analysis is harder.
Contextual Bandits
The bandits above are context-free: every visitor gets the same treatment probabilities, regardless of who they are.
Contextual bandits incorporate user features. You might show different content to mobile vs. desktop users, first-time visitors vs. returning users, morning vs. evening browsers.
The algorithm:
- Observe context (user features)
- For each arm , predict reward using a model
- Choose arm based on Thompson sampling or UCB over model predictions
- Observe reward and update the model
This is halfway to reinforcement learning — which we'll cover in Chapter 7.
Popular algorithms: LinUCB, LinThompson (for linear models), neural contextual bandits (for deep learning).
When to Stop: Bayesian Stopping Rules
A bandit doesn't need a fixed end date — it can run until you're confident enough.
In Steady Practice's adaptive experiments, the stopping criterion is:
- Run for a minimum number of trials (to stabilize the estimates)
- At each snapshot, check if the leading condition's posterior mean is more than 2σ above the next-best
- If this is true for 3 consecutive snapshots, stop and declare a winner
This is a form of probability of being best: .
Alternatively, some systems use the expected loss of choosing a suboptimal arm — stop when the expected loss is below a threshold (e.g., < 1% regret remaining).
| Situation | Use |
|---|---|
| You want to maximize outcomes during the test | Bandit |
| You need rigorous p-values for a stakeholder report | A/B test |
| The effect size is tiny and you need maximum precision | A/B test |
| The environment changes over time (prices, seasons) | Bandit |
| You have multiple variants (> 2) | Bandit |
| You want to understand treatment effect heterogeneity | A/B test |
Summary
- Multi-armed bandits solve the exploration-exploitation trade-off — learning and maximizing reward at the same time
- ε-greedy is simple but dumb; UCB is principled and theoretically optimal
- Thompson sampling is the practical favorite: sample from posteriors, pick the best sample
- Beta distributions track beliefs about conversion rates and update simply with new data
- Contextual bandits extend this to personalized decisions — a step toward full reinforcement learning
- Bandits aren't always better than A/B tests; choose based on whether you care about regret or statistical inference
Next: Reinforcement Learning — the full generalization of bandits to sequential decisions over time.