Keywords

1 Introduction

It was the beginning of the twentieth century, and Andrey Andreyevich Markov was not happy. The field of mathematical probability was flourishing, with seminal contributions from de Moivre and Fermat through to Gauss, Laplace, Cauchy, Poisson, Chebyshev, and many others. And yet, most of the key results – laws of large numbers, the central limit theorem, and so on – were always about independent random variables, i.e., random quantities which have no effect on each other. This assumption seemed very limiting and unrealistic. However, without this assumption, the random variables could behave however they wanted, and it seemed impossible to make mathematical progress. How could the independence assumption be generalized, while still allowing for valid theorems to be proven?

Markov eventually settled on a model in which each random variable depended on the others, but only through the adjacent ones. That is, a sequence of jointly defined random variables X0, X1, X2,… form a Markov chain if the conditional distribution of Xn, conditional on all the other variables, is completely determined by Xn–1 and Xn+1. This setup was more general than independence, but still tractable enough to analyze mathematically.

Markov published his results in 1906 (Markov 1906), adding a new model to the world of probability theory. He proved a version of the law of large numbers, and later of the central limit theorem, for his new concept of Markov chains. In the decades following, many mathematical papers about Markov chains were published. Moreover, Markov chains had some success as models of card shuffling, diffusions, branching processes, and later to stock prices. Further yet, genuine applications remained somewhat limited. In fact, Markov’s original example had been the pattern of consonants and vowels in written language, but they surely do not really follow a Markov chain. For example, a vowel is much less likely to occur of the last three letters were all vowels, than if just the last one letter were a vowel. It was challenging to find real-world examples of random sequences in which Xn was indeed influenced by Xn–1, but completely conditionally independent of Xn–2.

Then, in the latter half of the twentieth century, a whole new application of Markov chains emerged, in the burgeoning field of computer simulation. In 1953, Metropolis et al. (1953) used a simple iterative computer simulation to study the behavior of interacting molecular systems, for up to 224 particles. This was an early application of a Monte Carlo algorithm which uses random simulation to estimate complicated quantities. In doing so, they created the Metropolis algorithm, the world’s first Markov chain Monte Carlo (MCMC) algorithm. Their algorithm was later generalized to the Metropolis-Hastings algorithm (Hastings 1970), the Gibbs sampler (Gelfand and Smith 1990; Geman and Geman 1984; Tanner and Wong 1987), and many other variations; see e.g (Tierney 1994). The term “Markov chain Monte Carlo (MCMC),” emphasizing that these new algorithms applied Markov’s model to the Monte Carlo paradigm, originated in (Geyer 1992). MCMC algorithms are now hugely popular in many branches of statistics, especially in Bayesian inference, as well as many other fields from molecular dynamics to engineering to machine learning to financial modeling. The volume (Brooks et al. 2011) contains thousands of references, and a casual Google search for “Markov chain Monte Carlo” (in quotation marks) returns over two million web pages, indicating just how far Markov’s little idea has come.

In this paper, we review Markov chains and MCMC algorithms and explain how they have been so useful in modern statistical analysis.

2 Random Variables

A random variable is a mathematical model of an uncertain quantity. It is defined on some underlying probability space \( \left(\varOmega, \mathbf{\mathcal{F}},\mathbf{P}\right) \), where Ω is an underlying sample space, \( \mathbf{\mathcal{F}} \) is a σ-algebra of subsets of Ω, and P is a probability measure on \( \left(\varOmega, \mathbf{\mathcal{F}}\right) \). A random variable is then a measurable function from \( \left(\varOmega, \mathbf{\mathcal{F}}\right) \) to a measurable range space \( \left(\mathcal{X},\mathbf{\mathcal{G}}\right) \), most commonly to \( \left(\mathbf{R},\mathbf{\mathcal{B}}\right) \) (the real numbers with the Borel subsets). The probability that a random variable X takes a value within a subset \( A\in \mathbf{\mathcal{G}} \) is then given by

$$ \mathbf{P}\left[X\in A\right]:= \mathbf{P}\left\{\omega \in \varOmega :X\left(\omega \right)\in A\right\}. $$
(1)

(If Ω is finite or countable, then we can let \( \mathbf{\mathcal{F}} \) be the collection of all subsets of Ω, and simply take P[X ∈ A] =   {P(ω) : X(ω) ∈ A}, i.e., define probabilities by a simple sum over individual points.)

Individual random variables can be studied in terms of their probability distributions \( \left\{\mathbf{P}\left[X\in A\right]:A\in \mathbf{\mathcal{G}}\right\} \), including such familiar notions as binomial, normal, exponential, Poisson, and other univariate distributions, perhaps extended to distributions of functions h of random variables as \( \left\{\mathbf{P}\left[h(X)\in A\right]:A\in \mathbf{\mathcal{G}}\right\} \).

The situation gets more intricate when there are multiple random variables defined jointly on the same underlying space \( \left(\varOmega, \mathbf{\mathcal{F}},\mathbf{P}\right) \). If X1 and X2 are two such random variables, then we can define their joint probability distribution by

$$ \mathbf{P}\left[{X}_1\in A,{X}_2\in B\right]=\sum \left\{\mathbf{P}\left(\omega \right):{X}_1\left(\omega \right)\in A\ \mathrm{and}\ {X}_2\left(\omega \right)\in B\right\}, $$
(2)

for any \( A,B\in \mathbf{\mathcal{G}} \). This leads to questions about the relationship between the random variables.

The simplest case is independent random variables, meaning random quantities which do not affect each other, or formally for which

$$ \mathbf{P}\left[{X}_1\in A,{X}_2\in B\right]=\mathbf{P}\left[{X}_1\in A\right]\ \mathbf{P}\left[{X}_2\in B\right] $$
(3)

for all \( A,B\in \mathbf{\mathcal{G}} \). More generally, a collection {Xi}i ∈ I of random variables are all independent if for all subsets I0 ⊆ I, and all choices \( {\left\{{A}_i\right\}}_{i\in {I}_0} \) of elements of \( \mathbf{\mathcal{G}}, \) we have

$$ \mathbf{P}\left[{X}_i\in {A}_i\ \mathrm{for}\ \mathrm{all}\ i\in {I}_0\right]=\prod \limits_{i\in {I}_0}\mathbf{P}\left[{X}_i\in {A}_i\right]. $$
(4)

Related, a collection {Xi}i ∈ I is identically distributed if for each fixed \( A\in \mathbf{\mathcal{F}}, \) the value P[Xi ∈ A] is the same for all i ∈ I.

If a sequence X1, X2,… of random variables is both independent and identically distributed (i.i.d.), then many results are known. For example, the law of large numbers says that the sample averages \( \frac{1}{n}{\sum}_{i=1}^nh\left({X}_i\right) \) converge to their common mean μ ≔ E[h(Xi)], and the central limit theorem says \( \frac{1}{\sqrt{n}}{\sum}_{i=1}^n\left[h\left({X}_i\right)-\mu \right] \) converges in distribution to a Normal (0, σ2) distribution if σ2 ≔ E[(h(Xi) – μ)2] < ∞. These two fundamental theorems, and many others, give a fairly rich and detailed picture of the behavior of independent random variables.

3 Markov Chains

As mentioned, Markov provided a generalization of the independence assumption. In particular, a sequence X0, X1, X2,… forms a time-homogeneous Markov chain if there is a fixed transition kernel \( {\left\{P\left(x,A\right)\right\}}_{x\in \mathcal{X},A\in \mathbf{\mathcal{G}}} \) with the property that

$$ \mathbf{P}\left[{X}_n\in A|{X}_0,{X}_1,\dots, {X}_{n\hbox{--} 1}\right]=P\left({X}_{n\hbox{--} 1},A\right),\kern1em A\in \mathbf{\mathcal{G}}. $$
(5)

That is, given the history up to time n – 1, the probabilities for the value at time n depend only on the most recent value Xn–1, in an explicit way given by the transition kernel P.

For a concrete example, suppose X = {1, 2, 3,…}, with \( \mathbf{\mathcal{G}} \) the collection of all subsets of \( \mathcal{X}, \) and with transition kernel P defined by P(x, {x + 1}) = 1/4 for all \( x\in \mathcal{X} \), P(x, {x – 1}) = 1/2) for all x ≥ 2, P(x, {x}) = 1/4 for all x ≥ 2, and P(1, {1}) = 3/4, together with additivity. This Markov chain has, at each time, a probability 1/4 of increasing by 1, a probability 1/4 of staying the same, and a 1/2 probability of decreasing by 1 (unless it is already at the state 1, in which case the probability 1/2 is added to staying at 1); see Fig. 1.

Fig. 1
figure 1

The transition probabilities of the Running Example Markov chain

Over time, a Markov chain might settle into a “stable pattern,” or more precisely a stationary distribution. A probability distribution π on \( \left(\mathcal{X},\mathbf{\mathcal{G}}\right) \) is called stationary for the Markov chain {Xn} if it has the property that whenever Xn has the distribution π, then Xn+1 will also have the distribution π. That is,

$$ {\int}_{x\in \mathcal{X}}\pi (dx)P\left(x,A\right)=\pi (A). $$
(6)

On a discrete space, this can be written as

$$ \sum \limits_{x\in \mathcal{X}}\pi \left\{x\right\}P\left(x,A\right)=\pi (A). $$
(7)

Intuitive, the probability distribution π is invariant for the chain, i.e., its probabilities do not change while the chain runs.

The importance of stationary distributions is the Markov Chain Convergence Theorem which says that, under the mild assumptions of irreducibility and aperiodicity (nearly always satisfied and not considered further here), a chain will eventually converge to its stationary distribution. That is, if a Markov chain has a stationary distribution π, and is irreducible and aperiodic, then for all \( A\in \mathbf{\mathcal{G}} \),

$$ \underset{n\to \infty }{\lim}\mathbf{P}\left[{X}_n\in A\right]=\pi (A). $$
(8)

Informally, the chain will settle down into its stationary distribution in the long run.

To see this more concretely, consider the above Running Example. Figure 2 shows a typical realization of that Markov chain. Although it starts with larger values, from approximately time n = 40 onwards, it appears to be in a stable pattern, visiting approximately the same states with approximately the same probabilities.

Fig. 2
figure 2

A typical realization of the Running Example Markov chain

This “settling down” Markov chain convergence property is enough to prove a law of large numbers and central limit theorem for Markov chains, similar to the i.i.d. case but under stronger assumptions; for a modern treatment, see, e.g., the monograph (Meyn and Tweedie 1993). Thus, the time-homogeneous Markov assumption helps to re-capture some of the power and utility of i.i.d. sequences, but in a more general context.

4 Markov Chain Monte Carlo (MCMC)

When they were first introduced, Markov chains served as models of certain natural scientific (and even social) phenomenon. However, in the latter half of the twentieth century, it was slowly realized that Markov chains could be used in an entirely different way, as computational algorithms.

The basic idea of a Monte Carlo algorithm is simple. Suppose π is a “target” probability distribution on \( \left(\mathcal{X},\mathbf{\mathcal{G}}\right) \), and \( h:\mathcal{X}\to \mathbf{R} \) is some measurable function, and you want to estimate the expected value

$$ \mu := {\mathbf{E}}_{\pi }(h):= {\int}_{x\in \mathcal{X}}h(x)\pi (dx) $$
(9)

(so, on a discrete space, \( \mu ={\mathbf{\sum}}_{x\in \mathcal{X}}h(x)\pi (x) \)). In principle, μ could be calculated directly by an integral or sum, but in a complicated application on a high-dimensional state space that might be quite infeasible. However, if you could simulate random variables X1, X2,…, XM which are i.i.d. with distribution π, then you could estimate μ by

$$ E:= \frac{1}{M}\sum \limits_{i=1}^Mh\left({X}_i\right). $$
(10)

Indeed, the law of large numbers says that as M → ∞, the estimator E will converge to μ.

The problem is that, if \( \mathcal{X} \) is high-dimensional and π is complicated, there is no feasible computer program to simulate independent random variables X1, X2,… which each have the distribution π. However, it turns out to be surprisingly easy to simulate a Markov chain X0, X1, X2,… which has π as a stationary distribution and will thus converge to π in distribution. This gives rise to Markov chain Monte Carlo (MCMC) algorithms which simulate a Markov chain X0, X1, X2,… which converges to π and then estimate μ by

$$ E:= \frac{1}{M\hbox{--} B}\sum \limits_{i=B+1}^Mh\left({X}_i\right). $$
(11)

If the burn-in period B is large enough that the distribution of XB is approximately π, and MB is large enough that the Markov chain law of large numbers approximately holds, then this estimator E will again provide a good approximation to μ.

5 The Metropolis Algorithm

While there are many ways to construct Markov chains which leave π stationary, the oldest and simplest one is the Metropolis algorithm (Metropolis et al. 1953). Assume the state space \( \left(\mathcal{X},\mathbf{\mathcal{G}}\right) \) has a reference measure ρ (e.g., counting measure on a discrete state space, or Lebesgue measure on R), with respect to which π has a density function f so that π(dx) = f (x)ρ(dx). The algorithm uses proposal distributions Q(x, dy) = q(x, y)ρ(dy) for \( x,y\in \mathcal{X}, \) which are symmetric (i.e., q(x, y) = q(y, x)). (For example, on a discrete state space, Q(x, ·) might give probability 1/2 to each of x + 1 and x – 1. Or, on the real line, Q(x, ·) might correspond to a Normal(x, v) distribution for some fixed variance parameter v > 0. Importantly, Q(x, ·) should be easy to sample from on a computer.)

The algorithm then proceeds as follows. First choose some initial state \( {X}_0\in \mathcal{X} \). Then, iteratively for n = 1, 2,…, given \( {X}_{n\hbox{--} 1}\in \mathcal{X} \), generate Xn as follows. First, generate a proposal state Yn sampled from the probability distribution Q(Xn–1, ·) and independently generate a random variable Un sampled from the uniform (Lebesgue) measure on the interval [0,1]. Then define Xn by

$$ {X}_n=\left\{\begin{array}{ll}{Y}_n& {U}_n<f\left({Y}_n\right)/f\left({X}_{n\hbox{--} 1}\right)\\ {}{X}_{n\hbox{--} 1}& \mathrm{otherwise}\end{array}\right. $$
(12)

That is, if Un < f(Yn)/f(Xn), then the proposed new state Yn is “accepted” so we set Xn = Yn; otherwise Yn is “rejected” so we keep Xn = Xn–1. Incredibly, this simple algorithm, which only barely mentions π via its density f in the accept/reject rule, actually makes π a stationary distribution to which it will converge:

Theorem 1

The above Metropolis algorithm defines a Markov chain {Xn} which has π as a stationary distribution. Furthermore, if the chain is irreducible and aperiodic, then it will converge in distribution to π.

Proof. For y ≠ x, the only way for the chain to move from x to y (or to a neighborhood dy of y) is to first propose such a move and then accept it. Thus, for y ≠ x,

$$ {\displaystyle \begin{array}{c}P\left(x, dy\right):= \mathbf{P}\left[{X}_n\in dy|{X}_{n\hbox{--} 1}=x\right]\\ {}=\mathbf{P}\left[{Y}_n\in dy|{X}_{n\hbox{--} 1}=x\right]\ \mathbf{P}\left[{U}_n<f\left({Y}_n\right)/f(x)\right]\\ {}=Q\left(x, dy\right)\ \min \left[1,f\left({Y}_n\right)/f(x)\right]\\ {}=q\left(x,y\right)\rho (dy)\ \min \left[1,f(y)/f(x)\right].\end{array}} $$
(13)

It follows that the bivariate measure

$$ {\displaystyle \begin{array}{c}\pi (dx)P\left(x, dy\right)=\pi (dx)\ \mathbf{P}\left[{X}_n\in dy|{X}_{n-1}=x\right]\\ {}=f(x)\rho (dx)q\left(x,y\right)\rho (dy)\ \min \left[1,f(y)/f(x)\right]\\ {}=q\left(x,y\right)\ \min \left[f(x),f(y)\right]\rho (dx)\rho (dy).\end{array}} $$
(14)

Since q(x, y) = q(y, x), this measure is symmetric upon swapping x and y, i.e., π(dx) P(x, dy) = π(dy) P(y, dx), which says that {Xn} is a reversible Markov chain. (The above argument only shows this for y ≠ x, but it is immediately true for y = x too.) It follows that if Xn–1 has the distribution π, then

$$ {\displaystyle \begin{array}{c}\mathbf{P}\left[{X}_n\in A\right]={\int}_{x\in \mathcal{X}}\pi (dx)\ \mathbf{P}\left[{X}_n\in A|{X}_{n-1}=x\right]\\ {}={\int}_{x\in \mathcal{X}}{\int}_{y\in A}\pi (dx)\ P\left(x, dy\right)\\ {}={\int}_{x\in \mathcal{X}}{\int}_{y\in A}\pi (dy)\ P\left(y, dx\right)\\ {}={\int}_{y\in A}\pi (dy)\ P\left(y,\mathcal{X}\right)\\ {}={\int}_{y\in A}\pi (dy)\kern0.5em (1)=\pi (A),\end{array}} $$
(15)

so Xn also has the distribution π. Therefore, π is a stationary distribution.

The second statement follows immediately from the first one and the Markov Chain Convergence Theorem. □.

In summary, the Metropolis algorithm provides a simple method – propose a new state symmetrically, then accept or reject it according to a simple rule – which guarantees that π is a stationary distribution, to which (under very mild conditions) the chain will converge. This is what makes MCMC possible. For a graphical simulation of the Metropolis algorithm in action, see the web page at (Rosenthal 2020).

6 MCMC in Practice

To see how the Metropolis algorithm actually works, suppose the state space is \( \mathcal{X}=\left\{1,2,3,\dots \right\} \), so ρ is counting measure, and suppose the target probability distribution given by π({x}) = 2x for \( x\in \mathcal{X} \). Suppose we choose the simple proposal distributions Q(x, {x – 1}) = Q(x, {x + 1}) = 1/2. Then, at each iteration, our algorithm proposes a new state Y = Xn–1 ± 1 which increases or decreases the current state by one (with probability 1/2 each), and then with probability \( \min \left[1,\kern0.5em \pi (Y)/\pi \left({X}_{n-1}\right)\right]=\min \left[1,\kern0.5em {2}^{\hbox{--} Y}/{2}^{\hbox{--} {X}_{n\hbox{--} 1}}\right] \), it accepts the proposal and sets Xn = Y; otherwise it rejects the proposal and sets Xn = Xn–1.

So, what are the resulting transition probabilities? Well, since π({0}) = 0, proposed moves from state 1 to state 0 are always rejected. However, for x ≥ 2, since π({x – 1}) > π({x}), proposed moves from x – 1 to x are always accepted. Meanwhile, for any \( x\in \mathcal{X} \), proposed moves from x to x + 1 are accepted with probability min[1, 2–(x + 1)/2x] = 1/2. Further, whenever a move is rejected, then the chain stays at x. A careful inspection shows that these transition rules are exactly the same as those in the diagram of Fig. 1. That is, this Metropolis algorithm exactly coincides with our Running Example, which therefore by Theorem 1 has stationary distribution π, given by π({x}) = 2x for \( x\in \mathcal{X} \) as above.

Hence, a typical realization of this Metropolis algorithm will be something like that in Fig. 2. In particular, it might start at some larger state (e.g., X0 = 10), but after some burn-in period B, it will send to settle into a pattern of sampling the different states with probabilities roughly equal to the target stationary distribution π. So, as above, if the burn-in period B and total run length M are both large enough, then \( \frac{1}{M-B}{\sum}_{i=B+1}^Mh\left({X}_i\right) \) will be a good estimate of the expected value μ:= Eπ (h).

For MCMC algorithms to be successful, we need to choose the burn-in value B large enough that the distribution of XB is approximately equal to π. While there have been some attempts to provide precise mathematical bounds on this burn-in time (see, e.g., (Rosenthal 2002) and the references therein), such bounds cannot usually be obtained. Instead, convergence is usually assessed empirically, by monitoring the output from the Markov chain and using statistical or time-series approaches to decide if the chain seems to have reached a steady state. For example, for chain output as in Fig. 2, it might be determined empirically that roughly B = 40 iterations (the orange dotted line) are sufficient to approximately reach stationarity.

Of course, the distribution π in our Running Example is so simple that it can easily be sampled without using MCMC. However, MCMC is indeed very commonly required in more complicated situations. In particular, it is very often needed for Bayesian inference. The basic Bayesian paradigm is as follows. We wish to estimate some unknown parameters θ, given some observed data D = d0, a prior distribution {P[θ ∈ A]}A, and a statistical likelihood model {P[D ∈ A| θ]}θ, A. The all-important posterior distribution, which tells us the final probability distribution of the unknown parameters θ given the observed data and the model, is then given (using Bayes’ Rule) by

$$ \pi \left( d\theta \right):= \mathbf{P}\left[\theta \in d\theta |D={d}_0\right]=\frac{\mathbf{P}\left[\theta \in d\theta \right]\ \mathbf{P}\left[D={d}_0|\theta \right]}{\mathbf{P}\left[D={d}_0\right]}, $$
(16)

which is proportional to the prior P[θ ∈ ] times the likelihood P[D = d0 | θ].

In principle, the formula (16) gives precise information about the nature of the posterior distribution, usually expressible as a posterior density function. That should be sufficient to compute posterior probabilities π(A), or more generally posterior expectations Eπ (h), by computing an integral (or sum) with respect to π as in (9). However, in practical statistical applications, the posterior distribution π is often very complicated and high-dimensional, so that these integrals cannot be computed directly, nor even well approximated by numerical integration, so their computation was thought to be infeasible. This challenge presented a major obstacle to the use of Bayesian inference in practical statistical problems and limited it for most of the twentieth century primarily to theoretical and philosophical and foundational interest rather than to genuine application. However, it was realized around 1990 (Gelfand and Smith 1990; Tanner and Wong 1987; Tierney 1994) that MCMC algorithms such as the Metropolis algorithm can provide good estimates of Eπ (h) as in (11), often quite easily with good accuracy and limited computational effort.

These MCMC algorithms are now routinely used in Bayesian statistics, which accounts for their extreme popularity. They have transformed the entire field of Bayesian statistics, from a theoretical curiosity to one of the most influential areas of statistical analysis, with significant applications to physics and finance and machine learning and so much more.

7 Conclusion

This paper has explained how Markov chains, introduced in 1906 as a tractable generalization of independent random variables, found tremendous application many years later as a tool for statistical computation, which completely transformed statistics by making the application of Bayesian inference computationally feasible, thus allowing its application to so many modern fields of study.

Andrey Andreyevich Markov died one hundred years ago, in 1922. If he were alive today, he would surely be astonished and thrilled by the incredible amount of progress that has been achieved in statistical computation and so many other areas, as a direct result of that little idea that he proposed so long ago.