Chapter 5

Bayesian Thinking

Updating beliefs with evidence, from Bayes' theorem to posterior distributions

Before we had data, we had beliefs. After we got data, we had better beliefs. Bayesian statistics is the mathematics of that update.

This isn't just a philosophical stance — it's a practical framework that produces more interpretable results, handles small samples gracefully, and lets you stop an experiment as soon as you have enough evidence (without the p-hacking problem that plagues classical statistics).

Reinforce OS uses Bayesian analysis throughout. This chapter explains why.


The Core Idea: Beliefs as Probabilities

In classical (frequentist) statistics, parameters are fixed unknown constants. You either accept or reject hypotheses; there's no probability attached to a hypothesis being true.

Bayesian statistics takes a different view: parameters are uncertain, and uncertainty is measured with probability. Before seeing data, you have a prior distribution over the parameter. After seeing data, you update to a posterior distribution.

The update rule is Bayes' theorem:

P(θdata)=P(dataθ)P(θ)P(data)P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}

In words:

posteriorlikelihood×prior\text{posterior} \propto \text{likelihood} \times \text{prior}

The posterior P(θdata)P(\theta \mid \text{data}) is everything you know about the parameter after seeing the data. It's a full probability distribution, not a point estimate.


A Concrete Example: Coin Flipping

Suppose you're testing whether a new morning routine improves your mood. You rate each day as "good" or "not good." You've had 7 good days out of 10 trials.

Let pp = the true probability of a good day. You want to know: is p>0.5p > 0.5?

Prior: before your experiment, you believe pp is somewhere between 0.3 and 0.8, with no strong view. You encode this as pBeta(1,1)p \sim \text{Beta}(1, 1) — a uniform distribution.

Likelihood: with 7 successes out of 10 Bernoulli trials, the likelihood is proportional to p7(1p)3p^7(1-p)^3.

Posterior: the Beta distribution has a magical property — a Beta prior combined with Bernoulli data gives a Beta posterior:

pdataBeta(1+7,1+3)=Beta(8,4)p \mid \text{data} \sim \text{Beta}(1 + 7, 1 + 3) = \text{Beta}(8, 4)

That's it. Your posterior belief about pp is Beta(8,4)\text{Beta}(8, 4): peaked around 0.67, with uncertainty quantified.

🔍The Beta Distribution

The Beta distribution Beta(α,β)\text{Beta}(\alpha, \beta) is defined on [0, 1] — perfect for probabilities and rates.

  • α\alpha = prior successes + 1
  • β\beta = prior failures + 1
  • Mean = α/(α+β)\alpha / (\alpha + \beta)
  • As α\alpha and β\beta grow, the distribution narrows (more certainty)

Starting with Beta(1,1)\text{Beta}(1, 1) is a "flat" prior — you have no strong belief. Beta(3,3)\text{Beta}(3, 3) says "I expect roughly 50%, but with some uncertainty."

Mini-Sim

Prior Meets Data

A stronger prior moves less when the same new data walks in wearing muddy boots.

55%posterior estimate after combining prior and data

Priors and Posteriors

The prior encodes what you believe before seeing data. People sometimes worry that priors are "subjective" and therefore unscientific. A few responses to this:

  1. Everyone has priors — frequentist analysis implicitly uses flat priors (all parameter values equally likely), which is also a choice.
  2. Priors can be weakly informative — you don't need a strong prior. A prior that rules out physically impossible values (negative probabilities, effects larger than the universe) while being otherwise uncertain is perfectly defensible.
  3. Data dominates with enough samples — with 1,000 trials, a reasonable prior barely matters; the likelihood swamps it. Priors matter most when data is scarce, which is exactly when you need guardrails.
  4. Science accumulates — the posterior from one study becomes the prior for the next. This is how knowledge should build.

Prior sensitivity

A good practice: run your analysis with a few different priors (a flat prior, a weakly informative prior, a skeptical prior) and check whether conclusions change substantially. If they don't, your results are robust. If they do, you need more data before drawing strong conclusions.


Credible Intervals vs. Confidence Intervals

A 95% credible interval contains the true parameter with probability 95%, given your prior and data. You can say: "there is a 95% probability that the true effect is between 2 and 8 minutes."

A 95% confidence interval is different: if you repeated the experiment many times, 95% of the resulting intervals would contain the true parameter. For any single interval, the true parameter either is or isn't in it — there's no probability statement about the specific interval you computed.

The credible interval says what you want it to say. The confidence interval requires a mental contortion that most people skip, leading them to misinterpret confidence intervals as if they were credible intervals.

Reinforce OS reports credible intervals. When it says "HDI: [+2, +8] minutes," that means: given what you've observed, there's a 95% probability the true effect is in that range.


The Bayesian Approach to Stopping Rules

One of the most practically important advantages of Bayesian analysis: you can look at your data whenever you want.

In classical hypothesis testing, "optional stopping" — peeking at results and stopping when p<0.05p < 0.05 — inflates your false positive rate dramatically. The pp-value is only valid if the sample size was fixed in advance.

In Bayesian analysis, you can update the posterior after every new trial. Stopping when the posterior probability that p>0.5p > 0.5 exceeds 95% is completely valid. The posterior probability at any moment is an honest statement about your current state of knowledge.

This is why Reinforce OS can show you live updating results without compromising statistical validity. The analysis is always a correct summary of what the data says, not a test that depends on when you decided to look.

⚠️Bayesian ≠ license for p-hacking

Bayesian optional stopping is valid because the posterior is always a correct summary of evidence. But "stop when you like what you see" is still a bias — you'd be sampling from the distribution of experiments where you got lucky. Good practice: decide your stopping criterion in advance (e.g., "I'll stop when the 95% HDI excludes zero") and stick to it.


Bayesian vs. Frequentist: A Practical Comparison

QuestionFrequentist answerBayesian answer
Is the effect real?p = 0.03 (reject H₀)P(effect > 0) = 97%
How big is the effect?Point estimate ± SEPosterior distribution
Can I stop early?No (inflates Type I error)Yes, with proper criterion
What does the interval mean?95% of intervals contain truth95% probability truth is here
Prior knowledge?IgnoredExplicit prior
Small samples?UnreliableRegularized by prior

Neither framework is always better. Frequentist methods are well-understood, easy to communicate, and required by many regulatory contexts. Bayesian methods are more natural for sequential decision-making, personalization, and communicating uncertainty to non-statisticians.

For personal experiments — where you're making decisions about your own behavior with limited data — Bayesian methods are almost always the right choice.


Population prior: μ = +5 min focusYou (3 trials):borrowed from 5 othersOthers (50 trials):own data dominates
Hierarchical models: sparse individual data borrows strength from everyone else.

Hierarchical Models: Borrowing Strength

Here's a powerful extension: suppose you're running the same caffeine experiment across 50 users of Steady Practice. Each user has their own true effect τi\tau_i. You could estimate each person separately, but with only 10 trials per person, the estimates are noisy.

A hierarchical model says: each person's effect τi\tau_i is drawn from a population distribution τiN(μ,σ2)\tau_i \sim \mathcal{N}(\mu, \sigma^2). You estimate μ\mu and σ\sigma jointly with all the individual effects.

This lets individual estimates borrow strength from the population. Someone with only 3 trials gets pulled toward the population mean — appropriately skeptical. Someone with 50 trials has their own estimate dominate.

Hierarchical models are the statistical engine behind Steady Practice's pooled analysis feature, which uses data from multiple users running similar experiments to improve everyone's estimates — especially early on, when individual data is sparse.


MCMC: When the Math Gets Hard

The Beta-Binomial conjugate model has a clean closed-form posterior. Most real models don't.

For complex models, we sample from the posterior using Markov Chain Monte Carlo (MCMC). The idea: construct a random walk through parameter space that spends more time in high-probability regions. After many steps, the samples approximate the posterior.

Modern MCMC (especially NUTS — No-U-Turn Sampler, used in Stan and PyMC) is remarkably efficient. Models that would have taken hours in 2000 now run in seconds.

You don't need to understand MCMC in detail to use Bayesian analysis — just know that when you see "posterior samples" or "credible intervals" in Reinforce OS, they came from a sampler running efficiently in the background.


Summary

  • Bayesian statistics treats parameters as uncertain and updates beliefs using Bayes' theorem
  • Priors encode what you knew before; posteriors encode what you know after
  • Credible intervals have the natural interpretation that confidence intervals lack
  • Bayesian optional stopping is valid — look at your data whenever you want
  • Hierarchical models let you borrow strength across users or experiments
  • Reinforce OS uses Bayesian analysis throughout: live updating, credible intervals, and pooled priors

Next: Multi-Armed Bandits — what if you want to learn and earn at the same time?