How Markov’s Little Idea Transformed Statistics

Rosenthal, Jeffrey S.

doi:10.1007/978-3-030-19071-2_104-1

Jeffrey S. Rosenthal²

39 Accesses

Abstract

We discuss Andrey Andreyevich Markov’s early twentieth century idea of a Markov chain, which generalized the traditional notion of independent random variables to a model that was more general but still mathematical tractable. We then describe how that led to the hugely popular modern statistical approach of using Markov chain Monte Carlo algorithms to estimate complicated quantities such as Bayesian posterior distributions.

Download reference work entry PDF

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Algorithms for Bayesian Computation, a Survey and Some Generalisation

Probabilistic Modeling in Machine Learning

Keywords

1 Introduction

It was the beginning of the twentieth century, and Andrey Andreyevich Markov was not happy. The field of mathematical probability was flourishing, with seminal contributions from de Moivre and Fermat through to Gauss, Laplace, Cauchy, Poisson, Chebyshev, and many others. And yet, most of the key results – laws of large numbers, the central limit theorem, and so on – were always about independent random variables, i.e., random quantities which have no effect on each other. This assumption seemed very limiting and unrealistic. However, without this assumption, the random variables could behave however they wanted, and it seemed impossible to make mathematical progress. How could the independence assumption be generalized, while still allowing for valid theorems to be proven?

Markov eventually settled on a model in which each random variable depended on the others, but only through the adjacent ones. That is, a sequence of jointly defined random variables X₀, X₁, X₂,… form a Markov chain if the conditional distribution of X_n, conditional on all the other variables, is completely determined by X_n–1 and X_n+1. This setup was more general than independence, but still tractable enough to analyze mathematically.

Markov published his results in 1906 (Markov 1906), adding a new model to the world of probability theory. He proved a version of the law of large numbers, and later of the central limit theorem, for his new concept of Markov chains. In the decades following, many mathematical papers about Markov chains were published. Moreover, Markov chains had some success as models of card shuffling, diffusions, branching processes, and later to stock prices. Further yet, genuine applications remained somewhat limited. In fact, Markov’s original example had been the pattern of consonants and vowels in written language, but they surely do not really follow a Markov chain. For example, a vowel is much less likely to occur of the last three letters were all vowels, than if just the last one letter were a vowel. It was challenging to find real-world examples of random sequences in which X_n was indeed influenced by X_n–1, but completely conditionally independent of X_n–2.

Then, in the latter half of the twentieth century, a whole new application of Markov chains emerged, in the burgeoning field of computer simulation. In 1953, Metropolis et al. (1953) used a simple iterative computer simulation to study the behavior of interacting molecular systems, for up to 224 particles. This was an early application of a Monte Carlo algorithm which uses random simulation to estimate complicated quantities. In doing so, they created the Metropolis algorithm, the world’s first Markov chain Monte Carlo (MCMC) algorithm. Their algorithm was later generalized to the Metropolis-Hastings algorithm (Hastings 1970), the Gibbs sampler (Gelfand and Smith 1990; Geman and Geman 1984; Tanner and Wong 1987), and many other variations; see e.g (Tierney 1994). The term “Markov chain Monte Carlo (MCMC),” emphasizing that these new algorithms applied Markov’s model to the Monte Carlo paradigm, originated in (Geyer 1992). MCMC algorithms are now hugely popular in many branches of statistics, especially in Bayesian inference, as well as many other fields from molecular dynamics to engineering to machine learning to financial modeling. The volume (Brooks et al. 2011) contains thousands of references, and a casual Google search for “Markov chain Monte Carlo” (in quotation marks) returns over two million web pages, indicating just how far Markov’s little idea has come.

In this paper, we review Markov chains and MCMC algorithms and explain how they have been so useful in modern statistical analysis.

2 Random Variables

A random variable is a mathematical model of an uncertain quantity. It is defined on some underlying probability space $ \left(\varOmega, \mathbf{\mathcal{F}},\mathbf{P}\right) $, where Ω is an underlying sample space, $ \mathbf{\mathcal{F}} $ is a σ-algebra of subsets of Ω, and P is a probability measure on $ \left(\varOmega, \mathbf{\mathcal{F}}\right) $. A random variable is then a measurable function from $ \left(\varOmega, \mathbf{\mathcal{F}}\right) $ to a measurable range space $ \left(\mathcal{X},\mathbf{\mathcal{G}}\right) $, most commonly to $ \left(\mathbf{R},\mathbf{\mathcal{B}}\right) $ (the real numbers with the Borel subsets). The probability that a random variable X takes a value within a subset $ A\in \mathbf{\mathcal{G}} $ is then given by

$$ \mathbf{P}\left[X\in A\right]:= \mathbf{P}\left\{\omega \in \varOmega :X\left(\omega \right)\in A\right\}. $$

(1)

(If Ω is finite or countable, then we can let $ \mathbf{\mathcal{F}} $ be the collection of all subsets of Ω, and simply take P[X ∈ A] = ∑ {P(ω) : X(ω) ∈ A}, i.e., define probabilities by a simple sum over individual points.)

Individual random variables can be studied in terms of their probability distributions $ \left\{\mathbf{P}\left[X\in A\right]:A\in \mathbf{\mathcal{G}}\right\} $, including such familiar notions as binomial, normal, exponential, Poisson, and other univariate distributions, perhaps extended to distributions of functions h of random variables as $ \left\{\mathbf{P}\left[h(X)\in A\right]:A\in \mathbf{\mathcal{G}}\right\} $.

The situation gets more intricate when there are multiple random variables defined jointly on the same underlying space $ \left(\varOmega, \mathbf{\mathcal{F}},\mathbf{P}\right) $. If X₁ and X₂ are two such random variables, then we can define their joint probability distribution by

$$ \mathbf{P}\left[{X}_1\in A,{X}_2\in B\right]=\sum \left\{\mathbf{P}\left(\omega \right):{X}_1\left(\omega \right)\in A\ \mathrm{and}\ {X}_2\left(\omega \right)\in B\right\}, $$

(2)

for any $ A,B\in \mathbf{\mathcal{G}} $. This leads to questions about the relationship between the random variables.

The simplest case is independent random variables, meaning random quantities which do not affect each other, or formally for which

$$ \mathbf{P}\left[{X}_1\in A,{X}_2\in B\right]=\mathbf{P}\left[{X}_1\in A\right]\ \mathbf{P}\left[{X}_2\in B\right] $$

(3)

for all $ A,B\in \mathbf{\mathcal{G}} $. More generally, a collection {X_i}_i ∈ I of random variables are all independent if for all subsets I₀ ⊆ I, and all choices $ {\left\{{A}_i\right\}}_{i\in {I}_0} $ of elements of $ \mathbf{\mathcal{G}}, $ we have

$$ \mathbf{P}\left[{X}_i\in {A}_i\ \mathrm{for}\ \mathrm{all}\ i\in {I}_0\right]=\prod \limits_{i\in {I}_0}\mathbf{P}\left[{X}_i\in {A}_i\right]. $$

(4)

Related, a collection {X_i}_i ∈ I is identically distributed if for each fixed $ A\in \mathbf{\mathcal{F}}, $ the value P[X_i ∈ A] is the same for all i ∈ I.

If a sequence X₁, X₂,… of random variables is both independent and identically distributed (i.i.d.), then many results are known. For example, the law of large numbers says that the sample averages $ \frac{1}{n}{\sum}_{i=1}^nh\left({X}_i\right) $ converge to their common mean μ ≔ E[h(X_i)], and the central limit theorem says $ \frac{1}{\sqrt{n}}{\sum}_{i=1}^n\left[h\left({X}_i\right)-\mu \right] $ converges in distribution to a Normal (0, σ²) distribution if σ² ≔ E[(h(X_i) – μ)²] < ∞. These two fundamental theorems, and many others, give a fairly rich and detailed picture of the behavior of independent random variables.

3 Markov Chains

As mentioned, Markov provided a generalization of the independence assumption. In particular, a sequence X₀, X₁, X₂,… forms a time-homogeneous Markov chain if there is a fixed transition kernel $ {\left\{P\left(x,A\right)\right\}}_{x\in \mathcal{X},A\in \mathbf{\mathcal{G}}} $ with the property that

$$ \mathbf{P}\left[{X}_n\in A|{X}_0,{X}_1,\dots, {X}_{n\hbox{--} 1}\right]=P\left({X}_{n\hbox{--} 1},A\right),\kern1em A\in \mathbf{\mathcal{G}}. $$

(5)

That is, given the history up to time n – 1, the probabilities for the value at time n depend only on the most recent value X_n–1, in an explicit way given by the transition kernel P.

For a concrete example, suppose X = {1, 2, 3,…}, with $ \mathbf{\mathcal{G}} $ the collection of all subsets of $ \mathcal{X}, $ and with transition kernel P defined by P(x, {x + 1}) = 1/4 for all $ x\in \mathcal{X} $, P(x, {x – 1}) = 1/2) for all x ≥ 2, P(x, {x}) = 1/4 for all x ≥ 2, and P(1, {1}) = 3/4, together with additivity. This Markov chain has, at each time, a probability 1/4 of increasing by 1, a probability 1/4 of staying the same, and a 1/2 probability of decreasing by 1 (unless it is already at the state 1, in which case the probability 1/2 is added to staying at 1); see Fig. 1.

Over time, a Markov chain might settle into a “stable pattern,” or more precisely a stationary distribution. A probability distribution π on $ \left(\mathcal{X},\mathbf{\mathcal{G}}\right) $ is called stationary for the Markov chain {X_n} if it has the property that whenever X_n has the distribution π, then X_n+1 will also have the distribution π. That is,

$$ {\int}_{x\in \mathcal{X}}\pi (dx)P\left(x,A\right)=\pi (A). $$

(6)

On a discrete space, this can be written as

$$ \sum \limits_{x\in \mathcal{X}}\pi \left\{x\right\}P\left(x,A\right)=\pi (A). $$

(7)

Intuitive, the probability distribution π is invariant for the chain, i.e., its probabilities do not change while the chain runs.

The importance of stationary distributions is the Markov Chain Convergence Theorem which says that, under the mild assumptions of irreducibility and aperiodicity (nearly always satisfied and not considered further here), a chain will eventually converge to its stationary distribution. That is, if a Markov chain has a stationary distribution π, and is irreducible and aperiodic, then for all $ A\in \mathbf{\mathcal{G}} $,

$$ \underset{n\to \infty }{\lim}\mathbf{P}\left[{X}_n\in A\right]=\pi (A). $$

(8)

Informally, the chain will settle down into its stationary distribution in the long run.

To see this more concretely, consider the above Running Example. Figure 2 shows a typical realization of that Markov chain. Although it starts with larger values, from approximately time n = 40 onwards, it appears to be in a stable pattern, visiting approximately the same states with approximately the same probabilities.

This “settling down” Markov chain convergence property is enough to prove a law of large numbers and central limit theorem for Markov chains, similar to the i.i.d. case but under stronger assumptions; for a modern treatment, see, e.g., the monograph (Meyn and Tweedie 1993). Thus, the time-homogeneous Markov assumption helps to re-capture some of the power and utility of i.i.d. sequences, but in a more general context.

4 Markov Chain Monte Carlo (MCMC)

When they were first introduced, Markov chains served as models of certain natural scientific (and even social) phenomenon. However, in the latter half of the twentieth century, it was slowly realized that Markov chains could be used in an entirely different way, as computational algorithms.

The basic idea of a Monte Carlo algorithm is simple. Suppose π is a “target” probability distribution on $ \left(\mathcal{X},\mathbf{\mathcal{G}}\right) $, and $ h:\mathcal{X}\to \mathbf{R} $ is some measurable function, and you want to estimate the expected value

$$ \mu := {\mathbf{E}}_{\pi }(h):= {\int}_{x\in \mathcal{X}}h(x)\pi (dx) $$

(9)

(so, on a discrete space, $ \mu ={\mathbf{\sum}}_{x\in \mathcal{X}}h(x)\pi (x) $). In principle, μ could be calculated directly by an integral or sum, but in a complicated application on a high-dimensional state space that might be quite infeasible. However, if you could simulate random variables X₁, X₂,…, X_M which are i.i.d. with distribution π, then you could estimate μ by

$$ E:= \frac{1}{M}\sum \limits_{i=1}^Mh\left({X}_i\right). $$

(10)

Indeed, the law of large numbers says that as M → ∞, the estimator E will converge to μ.

The problem is that, if $ \mathcal{X} $ is high-dimensional and π is complicated, there is no feasible computer program to simulate independent random variables X₁, X₂,… which each have the distribution π. However, it turns out to be surprisingly easy to simulate a Markov chain X₀, X₁, X₂,… which has π as a stationary distribution and will thus converge to π in distribution. This gives rise to Markov chain Monte Carlo (MCMC) algorithms which simulate a Markov chain X₀, X₁, X₂,… which converges to π and then estimate μ by

$$ E:= \frac{1}{M\hbox{--} B}\sum \limits_{i=B+1}^Mh\left({X}_i\right). $$

(11)

If the burn-in period B is large enough that the distribution of X_B is approximately π, and M – B is large enough that the Markov chain law of large numbers approximately holds, then this estimator E will again provide a good approximation to μ.

5 The Metropolis Algorithm

While there are many ways to construct Markov chains which leave π stationary, the oldest and simplest one is the Metropolis algorithm (Metropolis et al. 1953). Assume the state space $ \left(\mathcal{X},\mathbf{\mathcal{G}}\right) $ has a reference measure ρ (e.g., counting measure on a discrete state space, or Lebesgue measure on R), with respect to which π has a density function f so that π(dx) = f (x)ρ(dx). The algorithm uses proposal distributions Q(x, dy) = q(x, y)ρ(dy) for $ x,y\in \mathcal{X}, $ which are symmetric (i.e., q(x, y) = q(y, x)). (For example, on a discrete state space, Q(x, ·) might give probability 1/2 to each of x + 1 and x – 1. Or, on the real line, Q(x, ·) might correspond to a Normal(x, v) distribution for some fixed variance parameter v > 0. Importantly, Q(x, ·) should be easy to sample from on a computer.)

The algorithm then proceeds as follows. First choose some initial state $ {X}_0\in \mathcal{X} $. Then, iteratively for n = 1, 2,…, given $ {X}_{n\hbox{--} 1}\in \mathcal{X} $, generate X_n as follows. First, generate a proposal state Y_n sampled from the probability distribution Q(X_n–1, ·) and independently generate a random variable U_n sampled from the uniform (Lebesgue) measure on the interval [0,1]. Then define X_n by

$$ {X}_n=\left\{\begin{array}{ll}{Y}_n& {U}_n<f\left({Y}_n\right)/f\left({X}_{n\hbox{--} 1}\right)\\ {}{X}_{n\hbox{--} 1}& \mathrm{otherwise}\end{array}\right. $$

(12)

That is, if U_n < f(Y_n)/f(X_n), then the proposed new state Y_n is “accepted” so we set X_n = Y_n; otherwise Y_n is “rejected” so we keep X_n = X_n–1. Incredibly, this simple algorithm, which only barely mentions π via its density f in the accept/reject rule, actually makes π a stationary distribution to which it will converge:

Theorem 1

The above Metropolis algorithm defines a Markov chain {X_n} which has π as a stationary distribution. Furthermore, if the chain is irreducible and aperiodic, then it will converge in distribution to π.

Proof. For y ≠ x, the only way for the chain to move from x to y (or to a neighborhood dy of y) is to first propose such a move and then accept it. Thus, for y ≠ x,

$$ {\displaystyle \begin{array}{c}P\left(x, dy\right):= \mathbf{P}\left[{X}_n\in dy|{X}_{n\hbox{--} 1}=x\right]\\ {}=\mathbf{P}\left[{Y}_n\in dy|{X}_{n\hbox{--} 1}=x\right]\ \mathbf{P}\left[{U}_n<f\left({Y}_n\right)/f(x)\right]\\ {}=Q\left(x, dy\right)\ \min \left[1,f\left({Y}_n\right)/f(x)\right]\\ {}=q\left(x,y\right)\rho (dy)\ \min \left[1,f(y)/f(x)\right].\end{array}} $$

(13)

It follows that the bivariate measure

$$ {\displaystyle \begin{array}{c}\pi (dx)P\left(x, dy\right)=\pi (dx)\ \mathbf{P}\left[{X}_n\in dy|{X}_{n-1}=x\right]\\ {}=f(x)\rho (dx)q\left(x,y\right)\rho (dy)\ \min \left[1,f(y)/f(x)\right]\\ {}=q\left(x,y\right)\ \min \left[f(x),f(y)\right]\rho (dx)\rho (dy).\end{array}} $$

(14)

Since q(x, y) = q(y, x), this measure is symmetric upon swapping x and y, i.e., π(dx) P(x, dy) = π(dy) P(y, dx), which says that {X_n} is a reversible Markov chain. (The above argument only shows this for y ≠ x, but it is immediately true for y = x too.) It follows that if X_n–1 has the distribution π, then

$$ {\displaystyle \begin{array}{c}\mathbf{P}\left[{X}_n\in A\right]={\int}_{x\in \mathcal{X}}\pi (dx)\ \mathbf{P}\left[{X}_n\in A|{X}_{n-1}=x\right]\\ {}={\int}_{x\in \mathcal{X}}{\int}_{y\in A}\pi (dx)\ P\left(x, dy\right)\\ {}={\int}_{x\in \mathcal{X}}{\int}_{y\in A}\pi (dy)\ P\left(y, dx\right)\\ {}={\int}_{y\in A}\pi (dy)\ P\left(y,\mathcal{X}\right)\\ {}={\int}_{y\in A}\pi (dy)\kern0.5em (1)=\pi (A),\end{array}} $$

(15)

so X_n also has the distribution π. Therefore, π is a stationary distribution.

The second statement follows immediately from the first one and the Markov Chain Convergence Theorem. □.

In summary, the Metropolis algorithm provides a simple method – propose a new state symmetrically, then accept or reject it according to a simple rule – which guarantees that π is a stationary distribution, to which (under very mild conditions) the chain will converge. This is what makes MCMC possible. For a graphical simulation of the Metropolis algorithm in action, see the web page at (Rosenthal 2020).

6 MCMC in Practice

To see how the Metropolis algorithm actually works, suppose the state space is $ \mathcal{X}=\left\{1,2,3,\dots \right\} $, so ρ is counting measure, and suppose the target probability distribution given by π({x}) = 2^–x for $ x\in \mathcal{X} $. Suppose we choose the simple proposal distributions Q(x, {x – 1}) = Q(x, {x + 1}) = 1/2. Then, at each iteration, our algorithm proposes a new state Y = X_n–1 ± 1 which increases or decreases the current state by one (with probability 1/2 each), and then with probability $ \min \left[1,\kern0.5em \pi (Y)/\pi \left({X}_{n-1}\right)\right]=\min \left[1,\kern0.5em {2}^{\hbox{--} Y}/{2}^{\hbox{--} {X}_{n\hbox{--} 1}}\right] $, it accepts the proposal and sets X_n = Y; otherwise it rejects the proposal and sets X_n = X_n–1.

So, what are the resulting transition probabilities? Well, since π({0}) = 0, proposed moves from state 1 to state 0 are always rejected. However, for x ≥ 2, since π({x – 1}) > π({x}), proposed moves from x – 1 to x are always accepted. Meanwhile, for any $ x\in \mathcal{X} $, proposed moves from x to x + 1 are accepted with probability min[1, 2^–(x + 1)/2^–x] = 1/2. Further, whenever a move is rejected, then the chain stays at x. A careful inspection shows that these transition rules are exactly the same as those in the diagram of Fig. 1. That is, this Metropolis algorithm exactly coincides with our Running Example, which therefore by Theorem 1 has stationary distribution π, given by π({x}) = 2^–x for $ x\in \mathcal{X} $ as above.

Hence, a typical realization of this Metropolis algorithm will be something like that in Fig. 2. In particular, it might start at some larger state (e.g., X₀ = 10), but after some burn-in period B, it will send to settle into a pattern of sampling the different states with probabilities roughly equal to the target stationary distribution π. So, as above, if the burn-in period B and total run length M are both large enough, then $ \frac{1}{M-B}{\sum}_{i=B+1}^Mh\left({X}_i\right) $ will be a good estimate of the expected value μ:= E_π (h).

For MCMC algorithms to be successful, we need to choose the burn-in value B large enough that the distribution of X_B is approximately equal to π. While there have been some attempts to provide precise mathematical bounds on this burn-in time (see, e.g., (Rosenthal 2002) and the references therein), such bounds cannot usually be obtained. Instead, convergence is usually assessed empirically, by monitoring the output from the Markov chain and using statistical or time-series approaches to decide if the chain seems to have reached a steady state. For example, for chain output as in Fig. 2, it might be determined empirically that roughly B = 40 iterations (the orange dotted line) are sufficient to approximately reach stationarity.

Of course, the distribution π in our Running Example is so simple that it can easily be sampled without using MCMC. However, MCMC is indeed very commonly required in more complicated situations. In particular, it is very often needed for Bayesian inference. The basic Bayesian paradigm is as follows. We wish to estimate some unknown parameters θ, given some observed data D = d₀, a prior distribution {P[θ ∈ A]}_A, and a statistical likelihood model {P[D ∈ A| θ]}_{θ, A}. The all-important posterior distribution, which tells us the final probability distribution of the unknown parameters θ given the observed data and the model, is then given (using Bayes’ Rule) by

$$ \pi \left( d\theta \right):= \mathbf{P}\left[\theta \in d\theta |D={d}_0\right]=\frac{\mathbf{P}\left[\theta \in d\theta \right]\ \mathbf{P}\left[D={d}_0|\theta \right]}{\mathbf{P}\left[D={d}_0\right]}, $$

(16)

which is proportional to the prior P[θ ∈ dθ] times the likelihood P[D = d₀ | θ].

In principle, the formula (16) gives precise information about the nature of the posterior distribution, usually expressible as a posterior density function. That should be sufficient to compute posterior probabilities π(A), or more generally posterior expectations E_π (h), by computing an integral (or sum) with respect to π as in (9). However, in practical statistical applications, the posterior distribution π is often very complicated and high-dimensional, so that these integrals cannot be computed directly, nor even well approximated by numerical integration, so their computation was thought to be infeasible. This challenge presented a major obstacle to the use of Bayesian inference in practical statistical problems and limited it for most of the twentieth century primarily to theoretical and philosophical and foundational interest rather than to genuine application. However, it was realized around 1990 (Gelfand and Smith 1990; Tanner and Wong 1987; Tierney 1994) that MCMC algorithms such as the Metropolis algorithm can provide good estimates of E_π (h) as in (11), often quite easily with good accuracy and limited computational effort.

These MCMC algorithms are now routinely used in Bayesian statistics, which accounts for their extreme popularity. They have transformed the entire field of Bayesian statistics, from a theoretical curiosity to one of the most influential areas of statistical analysis, with significant applications to physics and finance and machine learning and so much more.

7 Conclusion

This paper has explained how Markov chains, introduced in 1906 as a tractable generalization of independent random variables, found tremendous application many years later as a tool for statistical computation, which completely transformed statistics by making the application of Bayesian inference computationally feasible, thus allowing its application to so many modern fields of study.

Andrey Andreyevich Markov died one hundred years ago, in 1922. If he were alive today, he would surely be astonished and thrilled by the incredible amount of progress that has been achieved in statistical computation and so many other areas, as a direct result of that little idea that he proposed so long ago.

References

Brooks S, Gelman A, Jones G, Meng X-L (eds) (2011) Handbook of Markov chain Monte Carlo. Chapman & Hall, New York
MATH Google Scholar
Gelfand AE, Smith AFM (1990) Sampling based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409
Article MathSciNet MATH Google Scholar
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Article MATH Google Scholar
Geyer C (1992) Practical Markov chain Monte Carlo. Stat Sci 7(4):473–483
Google Scholar
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109
Article MathSciNet MATH Google Scholar
Markov AA (1906) Extension of the law of large numbers to dependent quantities (in Russian). Izv Fiz-Matem Obsch Kazan Univ (2nd Ser) 15:135–156
Google Scholar
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equations of state calculations by fast computing machines. J Chem Phys 21:1087–1091
Article MATH Google Scholar
Meyn SP, Tweedie RL (1993) Markov chains and stochastic stability. Springer-Verlag, London. Available at: http://probability.ca/MT/
Book MATH Google Scholar
Rosenthal JS (2002) Quantitative convergence rates of Markov chains: a simple account. Electron Commun Probab 7(13):123–128
MathSciNet MATH Google Scholar
Rosenthal JS (2020) Metropolis algorithm javascript program. Available at: http://probability.ca/jeff/js/metropolis.html
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation (with discussion). J Am Stat Assoc 82:528–550
Article MATH Google Scholar
Tierney L (1994) Markov chains for exploring posterior distributions (with discussion). Ann Stat 22:1701–1762
MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Toronto, Toronto, ON, Canada
Jeffrey S. Rosenthal

Authors

Jeffrey S. Rosenthal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeffrey S. Rosenthal .

Editor information

Editors and Affiliations

Department of Mathematical Sciences, The University of Montana, Missoula, MT, USA
Bharath Sriraman

Section Editor information

College of Education / Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
Egan J Chernoff

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Rosenthal, J.S. (2023). How Markov’s Little Idea Transformed Statistics. In: Sriraman, B. (eds) Handbook of the History and Philosophy of Mathematical Practice. Springer, Cham. https://doi.org/10.1007/978-3-030-19071-2_104-1

Download citation

DOI: https://doi.org/10.1007/978-3-030-19071-2_104-1
Received: 08 February 2022
Accepted: 28 March 2023
Published: 25 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19071-2
Online ISBN: 978-3-030-19071-2
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

How Markov’s Little Idea Transformed Statistics

Abstract