Multi-Armed Bandits | DoOperator Research

Imagine you walk into a casino with 10 slot machines. Each machine pays out at a different rate, but you don't know the rates. You have 1,000 coins to spend. How do you maximize your winnings?

This is the multi-armed bandit problem — "armed" because each machine is a "one-armed bandit," and "multi" because there are many of them.

It's also a nearly perfect model for a dozen real problems:

Which version of a webpage gets more clicks?
Which drug dosage works best?
Which ad creative converts?
Which caffeine strategy gives you the best sleep?

The key insight: you have to make decisions while you're still learning. You don't get to gather all the data first, analyze it, then decide. Every trial costs you something, and the better you learn, the better you do.

Why A/B Tests Leave Money on the Table

Classic A/B testing has a fundamental problem: during the experiment, you know which version is performing better, but you keep splitting traffic 50/50 anyway.

If Version B is clearly winning after 200 visits, you're still sending 50% of users to Version A — and losing conversions — for the next 1,000 visits while you wait for statistical significance.

Bandits fix this. They automatically route more traffic to better-performing variants, so you make better decisions during the experiment, not just after.

The trade-off: you're making decisions based on less data per arm, so your estimates are noisier. Bandits are great when:

You want to maximize a metric during the learning period (e.g., an ongoing ad campaign)
The "regret" of sending people to an inferior variant has real costs (e.g., a medical treatment)
The environment changes over time, making a fixed A/B test outdated

They're worse than A/B tests when:

You need rigorous statistical inference (confidence intervals, p-values)
You care about understanding why one variant is better, not just which one
You need to test rare events where fast allocation would be based on noise

The Exploration-Exploitation Trade-Off

At the heart of every bandit algorithm is a tension:

Exploitation: play the arm that looks best based on your current data. Maximize short-term reward.

Exploration: try arms you haven't played much, to learn if they might be better. Give up some reward now to potentially gain more later.

Go too far toward exploitation: you get stuck on a mediocre arm because you never explored enough. Go too far toward exploration: you waste time on arms you've already learned are bad.

The perfect balance is mathematically impossible to achieve in general, but several algorithms come close.

ε-Greedy: The Naive Approach

The simplest algorithm: with probability $\varepsilon$ (say, 10%), pick a random arm. Otherwise, pick the best arm you've seen so far.

def epsilon_greedy(arms, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(arms)
    return max(arms, key=lambda a: a.estimated_rate)

It works but it's dumb: it treats all arms equally during exploration, even if some have been clearly ruled out.

Upper Confidence Bound (UCB)

A smarter approach: be optimistic about uncertainty. For each arm, compute an upper confidence bound on its true reward rate, and pick the arm with the highest UCB.

\text{UCB}_i(t) = \hat{\mu}_i + \sqrt{\frac{2 \ln t}{n_i}}

where $\hat{\mu}_i$ is the estimated reward rate for arm $i$ , $t$ is the total number of trials so far, and $n_i$ is how many times arm $i$ has been tried.

The $\sqrt{2 \ln t / n_i}$ term is the "exploration bonus" — it's large for arms you haven't tried much and shrinks as you learn more. UCB naturally explores uncertain arms and exploits known good ones.

UCB1 has a nice theoretical property: its cumulative regret grows at most logarithmically with $T$ . In other words, over time, the gap between "what UCB gets" and "what an oracle who knew all rates would get" grows slowly.

Thompson sampling pulls Arm B more often — it has the highest estimated conversion rate.

Thompson Sampling: The Bayesian Bandit

Thompson sampling is the algorithm used in Steady Practice's adaptive experiments. It's simple, elegant, and works remarkably well in practice.

The idea: maintain a probability distribution over each arm's true reward rate. Sample from each distribution, and pick the arm that gives the highest sample.

For binary outcomes (success/failure), use a Beta distribution:

Start with Beta(1, 1) = uniform prior (completely uncertain)
After each trial: update alpha (number of successes + 1) and beta (number of failures + 1)
To choose an arm: draw one sample from each arm's Beta distribution; pick the arm with the highest sample

def thompson_step(arms):
    # Sample from each arm's posterior
    samples = [beta.rvs(a.alpha, a.beta) for a in arms]
    # Choose arm with highest sample
    chosen = argmax(samples)
    # Simulate outcome
    reward = random() < arms[chosen].true_rate
    # Update posterior
    if reward:
        arms[chosen].alpha += 1
    else:
        arms[chosen].beta += 1
    return chosen, reward

Why does this work? Because the probability of sampling arm $i$ as the best equals the probability that arm $i$ actually is the best, given the data so far. The algorithm naturally self-calibrates.

Try It Yourself

The simulation below runs Thompson sampling across three variants in real time. The true conversion rates are hidden — can you figure them out from the data?

🎰 Thompson Sampling

Three variants competing for clicks. The algorithm learns which arm converts best and routes more traffic to it — automatically.

Control

Traffic share

0.0%

Conversions

0/0

Est. rate

—

True rate

10%

Version B

Traffic share

0.0%

Conversions

0/0

Est. rate

—

True rate

14%

Version C

Traffic share

0.0%

Conversions

0/0

Est. rate

—

True rate

0 total visits

The Beta Distribution: A Quick Intuition

The Beta distribution $\text{Beta}(\alpha, \beta)$ represents your beliefs about a probability $p \in [0, 1]$ .

$\text{Beta}(1, 1)$ : uniform — you have no idea what $p$ is
$\text{Beta}(5, 5)$ : peaked around 0.5 — you've seen 5 successes and 5 failures
$\text{Beta}(50, 50)$ : very sharply peaked at 0.5 — strong evidence that $p$ is near 0.5
$\text{Beta}(80, 20)$ : peaked around 0.8 — 80 successes, 20 failures, strong evidence that $p$ is high

The mean of Beta(α, β) is $\alpha / (\alpha + \beta)$ . As you collect more data, the distribution tightens around the true rate.

\text{Beta}(\alpha, \beta) \propto p^{\alpha-1}(1-p)^{\beta-1}

This is also a conjugate prior for the Bernoulli likelihood — meaning that if you start with Beta(α, β) and observe one success, you update to Beta(α+1, β) exactly. No approximation needed.

Regret: The Currency of Bandit Algorithms

Regret measures how much you lost by not always playing the best arm:

R(T) = T \cdot \mu^* - \sum_{t=1}^{T} r_t

where $\mu^*$ is the true best arm's reward rate and $r_t$ is your actual reward at time $t$ .

Good algorithms have sublinear regret: the regret grows more slowly than T. This means the per-trial regret goes to zero as you learn. You're converging on the optimal strategy.

UCB1 achieves $O(\log T)$ regret. Thompson sampling achieves the same, and often performs better in practice even though the theoretical analysis is harder.

Contextual Bandits

The bandits above are context-free: every visitor gets the same treatment probabilities, regardless of who they are.

Contextual bandits incorporate user features. You might show different content to mobile vs. desktop users, first-time visitors vs. returning users, morning vs. evening browsers.

The algorithm:

Observe context $x$ (user features)
For each arm $a$ , predict reward $\hat{r}(a, x)$ using a model
Choose arm based on Thompson sampling or UCB over model predictions
Observe reward and update the model

This is halfway to reinforcement learning — which we'll cover in Chapter 7.

Popular algorithms: LinUCB, LinThompson (for linear models), neural contextual bandits (for deep learning).

When to Stop: Bayesian Stopping Rules

A bandit doesn't need a fixed end date — it can run until you're confident enough.

In Steady Practice's adaptive experiments, the stopping criterion is:

Run for a minimum number of trials (to stabilize the estimates)
At each snapshot, check if the leading condition's posterior mean is more than 2σ above the next-best
If this is true for 3 consecutive snapshots, stop and declare a winner

This is a form of probability of being best: $P(\mu_i > \mu_j \text{ for all } j \neq i)$ .

Alternatively, some systems use the expected loss of choosing a suboptimal arm — stop when the expected loss is below a threshold (e.g., < 1% regret remaining).

💡Bandits vs. A/B Tests: When to Use Each

Situation	Use
You want to maximize outcomes during the test	Bandit
You need rigorous p-values for a stakeholder report	A/B test
The effect size is tiny and you need maximum precision	A/B test
The environment changes over time (prices, seasons)	Bandit
You have multiple variants (> 2)	Bandit
You want to understand treatment effect heterogeneity	A/B test

Summary

Multi-armed bandits solve the exploration-exploitation trade-off — learning and maximizing reward at the same time
ε-greedy is simple but dumb; UCB is principled and theoretically optimal
Thompson sampling is the practical favorite: sample from posteriors, pick the best sample
Beta distributions track beliefs about conversion rates and update simply with new data
Contextual bandits extend this to personalized decisions — a step toward full reinforcement learning
Bandits aren't always better than A/B tests; choose based on whether you care about regret or statistical inference

Next: Reinforcement Learning — the full generalization of bandits to sequential decisions over time.