Definition

Any random process can, in principle, be represented by a computer program using pseudorandom numbers. If near-perfectly independent random numbers between 0 and 1 can be produced, appropriate functions of these numbers can simulate realizations of practically any probability distribution needed in technical and scientific applications. The development of Monte Carlo methods grew out of our ability to find those functions and algorithms.

The literature on Monte Carlo methods is vast, and the methods are many. We cannot cover everything here, but the intention is to provide a brief description of some of the most important techniques and their principles. After having reviewed basic techniques for random simulation, we shall move on to three high-level categories of Monte Carlo methods: testing, inference, and optimization. We shall also present some hybrid methods, as there is often no sharp borderline between the categories.

In the following, unless otherwise stated, we will consider continuous random variables over bounded (measurable) subsets of ℛN and their probability densities, although many of the methods described are applicable more generally.

Methods

Basic Simulation Methods

Exact sampling of probability distributions is an important component of all Monte Carlo algorithms. Most important are the following methods (formulated for a 1dimensional space):

Algorithm 1 The “Inverse Method” for a Continuous Random Variable with a Given, Everywhere Nonzero Probability Density

Let p be a probability density, and P its distribution function:

$$ P(s)={\int}_{-\infty}^xp(s) ds, $$
(1)

and let r be a random number chosen uniformly at random between 0 and 1. Then the random number x generated through the formula

$$ x={P}^{-1}(r) $$
(2)

has probability density p.

Algorithm 2 The Box-Muller Method for Gaussian Distributions

Let r1 and r2 be random numbers chosen uniformly at random between 0 and 1. Then the random numbers

$$ {x}_1=\sqrt{-2\ln {r}_2}\cos\;\left(2\pi {r}_1\right) $$
(3)
$$ {x}_2=\sqrt{-2\ln {r}_2}\sin\;\left(2\pi {r}_1\right) $$
(4)

are independent and Gaussian distributed with zero mean and unit variance.

As building blocks for some of the more sophisticated algorithms we shall discuss in the coming sections, we need to mention the following two algorithms for sampling in high-dimensional spaces:

Algorithm 3 Sequential Simulation

Consider a probability density p over an N-dimensional space decomposed into a product of N univariate conditional probability densities:

$$ p\left({x}_1,{x}_2,\dots, {x}_N\right)=p\left({x}_1\right)p\left({x}_2|{x}_1\right)p\left({x}_3|{x}_2,{x}_1\right)\dots p\left({x}_N|{x}_{N-1},\dots, {x}_1\right). $$
(5)

Exact simulation of realizations from p(x1, x2, …, xN) can now be performed by first generating a realization \( {\hat{x}}_1 \) from p(x1), then a realization \( {\hat{x}}_2 \) from \( p\left({x}_2|{\hat{x}}_1\right) \), then a realization \( {\hat{x}}_3 \) from \( p\left({x}_3|{\hat{x}}_2,{\hat{x}}_1\right) \), and so on until we have a realization \( {\hat{x}}_N \) from \( p\left({x}_N|{\hat{x}}_N,\dots, {\hat{x}}_1\right) \). The point \( \left({\hat{x}}_1,\dots, {\hat{x}}_N\right) \) is now a realization from \( p\left({x}_1,\dots, {x}_N\right) \).

The above type of algorithm is widely used for simulation of image realizations where the conditional probabilities are obtained from training images (see Strebelle 2002).

The simplest way to sample a constant probability density is through a simple application of Algorithm (3) where all 1D conditional probability densities are constant functions. However, in the Markov Chain Monte Carlo algorithms we shall describe below, we need to perform uniform sampling via random walks whose design can be adapted to our sampling problem. This can be done in the following way:

Algorithm 4 Sampling a Constant Probability Density by a Random Walk

Given a uniform (constant) probability density, a sequence of random numbers Un(n = 1, 2, …) in [0,1], and a proposal distribution q(x(n+1)|x(n)) defining the probability that the random walk goes to x(n+1), given that it starts in x(n) (possibly with q(x(n+1)|x(n)) ≠ q(x(n)|x(n+1)), then the points visited by the iterative random function V:

$$ V\left({\mathbf{x}}^{(n)}\right)=\left\{\begin{array}{l}{\mathbf{x}}^{\left(\mathrm{n},+,1\right)}\;\mathrm{if}\;\mathrm{U}\le \min \left[1,\frac{\mathrm{q}\left({\mathbf{x}}^{\left(\mathrm{n}\right)}|{\mathbf{x}}^{\left(\mathrm{n},+,1\right)}\right)}{\mathrm{q}\left({\mathbf{x}}^{\left(\mathrm{n},+,1\right)}|{\mathbf{x}}^{\left(\mathrm{n}\right)}\right)}\right]\\ {}{\mathbf{x}}^{(n)}\ \mathrm{otherwise}\end{array}\right., $$
(6)

asymptotically converge to a sample of the uniform distribution as n goes to infinity.

The above algorithms (1)–(4) are easily demonstrated and straightforward to use in practice. Algorithms (1) and (2) can without difficulty be generalized to higher dimensions .

Monte Carlo Testing

If we are interested in testing the validity of a parametric statistical model of, say, an inhomogeneous earth structure, we can select an appropriate function (a “statistic”) summarizing important properties of the structure, for instance, a variogram. We can now generate Monte Carlo realizations of the structure and compute values of the statistic for each realization. By comparing the distribution of the statistics with the observed (or theoretical) value of the statistics, we can decide if our choice of model is acceptable (Mrkvicka et al. 2016). The decision relies on a statistical test measuring to what extent the observed statistics coincides with high probabilities in the histogram of simulated values.

The above testing approach can be seen as a tool to determine acceptable models satisfying given observations. This is related, but in principle more general, than the Monte Carlo inference methods described in the following. In these methods, the statistical model is fixed beforehand.

Monte Carlo Inference

Inference problems are problems where we seek parameters of a given, underlying stochastic model, designed to explain observable data d. In physical sciences, technology, and economics, inference problems are often inverse problems which, in a probabilistic formulation, result in the definition of a posterior probability distribution p(x|d) describing our state of information about model parameters x after inference. Integrals of p over the parameter space can then provide estimates of event probabilities, expectations, covariances, etc. Such integrals are of the form

$$ I={\int}_{\chi }h\left(\mathbf{x}\right)p\left(\mathbf{x}\right)d\mathbf{x}, $$
(7)

where p(x) is the posterior probability density. They can be evaluated numerically by generating a large number of random, independent realizations x(1), …, x(N) from p(x). The sum

$$ I\approx \frac{1}{N}\sum \limits_{n=1}^Nh\left({\mathbf{x}}^{(n)}\right). $$
(8)

is then an estimate of the integral. The fact that the integral (7) in high-dimensional spaces is more efficiently calculated by Monte Carlo algorithms than by any other method means that Monte Carlo integration has become important in numerical analysis.

For many practical inference problems, the model space is so vast, and evaluation of the probability density p(x) is so computer intensive, that full information about p(x) remains unavailable. In this case, we need algorithms that can do with values of p(x) from one point at a time. The simplest of these algorithms is the rejection algorithm (von Neumann 1951) :

The Rejection Algorithm

Algorithm 5 The Rejection Algorithm

Assume that p(x) is a probability density with an upper bound M ≥  max (p(x)). In the nth step of the algorithm, choose with uniform probability a random candidate point xc. Accept xc only with probability

$$ {p}_{accept}=\frac{p\left({\mathbf{x}}_c\right)}{M}. $$
(9)

The set of thus accepted candidate points is a sample from the probability distribution p(x).

The rejection algorithm is ideal in the sense that it generates independent realizations from bounded probability densities, but for distributions in high-dimensional spaces with vast areas of negligible probability, the number of accepted points is small. This makes the algorithm very inefficient, and for this reason random-walk-based algorithms were developed. The original version of this was the Metropolis Algorithm (Metropolis et al. 1953), but it was later improved by Hastings (1970):

Markov-Chain Monte Carlo (MCMC)

Algorithm 6 (Metropolis-Hastings).

Given a (possibly un-normalized) probability density p(x) > 0, a sequence of random numbers Un(n = 1, 2, …) in [0,1], and an iterative random function V (x) sampling a constant probability density using Algorithm (4):

$$ {\mathbf{x}}^{\left(n+1\right)}=V\left({\mathbf{x}}^{(n)}\right). $$
(10)

Then the distribution of samples produced by the iterative random function W:

$$ {\mathbf{x}}^{\left(n+1\right)}=W\left({\mathbf{x}}^{(n)}\right)=\left\{\begin{array}{l}V\left({\mathbf{x}}^{(n)}\right)\qquad \mathrm{if}\, {\mathrm{U}}_{\mathrm{n}}\le \min \left[1,\frac{\mathrm{p}\left(\mathrm{V}\left({\mathbf{x}}^{\left(\mathrm{n}\right)}\right)\right)}{\mathrm{p}\left({\mathbf{x}}^{\left(\mathrm{n}\right)}\right)}\right]\\ {}{\mathbf{x}}^{(n)}\ \mathrm{otherwise}\end{array}\right., $$
(11)

will asymptotically converge to p(x) as n goes to infinity.

An extended form of this algorithm is obtained if V (x) is sampling an arbitrary probability density r(x), in which case the algorithm will sample the product distribution p(x)r(x). This was suggested by Mosegaard and Tarantola (1995) as a way of sampling the posterior distribution in Bayesian/probabilistic inverse problems, where r(x) is the prior probability density and p(x) is the likelihood. In this way, a closed-form expression for the prior was unnecessary, allowing V to be available only as a computer algorithm.

It is important to realize that the design of the proposal algorithm (4) is critical for the performance of the Metropolis-Hastings algorithm. The proposal distribution q(x(n+1)|x(n)) will work efficiently only if it locally resembles the sampling distribution p(x) (except for a constant). In this way, the number of rejected moves will be minimized.

In practice, any sampling with the Metropolis-Hastings algorithm needs to run for some time to reach equilibrium sampling with statistically stationary output. The time (number of iterations) needed to reach this equilibrium is called the burn-in time , and it is usually found by experimentation. Another problem is that the algorithm is based on a random walk, and hence, sample points will not be independent when sampled closely in time. To determine the time spacing between approximately independent samples, a correlation analysis may be needed. Ways of avoiding poor sampling results were discussed by Hastings (1970).

Geman and Geman (1984) introduced the Gibbs Sampler, another variant of the Metropolis algorithm where sample points were picked along coordinate axes in the parameter space according to the conditional distribution p(xj|x1, …, xj−1, xj+1…, xN), xj being the parameter to be perturbed. Although this strategy gives acceptance probabilities equal to 1, it may not be practical when calculation of the conditional probability is computationally expensive .

Monte Carlo Inference for Sparse Models

The Metropolis-Hastings algorithm (6) can be formulated to allow variable dimension of the parameter vector x (Green 1995). This version of the algorithm is used in Bayesian inference when the number of parameters N is treated as an unknown (Sambridge et al. 2006), making it possible to work with sparse models, and thereby reducing the computational burden:

Algorithm 7 Reversible-Jump Monte Carlo

Let p(x, N) be a probability density over a subset ofN, augmented with its dimension parameter N. The Metropolis-Hastings algorithm (6) operating in the extended space with

$$ V\left({\mathbf{x}}^{(n)},{N}^{(n)}\right)\, =\, \left\{\begin{array}{l}\left({\mathbf{x}}^{\left(n,+,1\right)},{N}^{\left(n,+,1\right)}\right)\, \mathrm{if}\, {\mathrm{U}}_{\mathrm{n}}\le \min \left[1,\frac{\mathrm{q}\left({\mathbf{x}}^{\left(\mathrm{n}\right)},{\mathrm{N}}^{\left(\mathrm{n}\right)}|{\mathbf{x}}^{\left(\mathrm{n},+,1\right)},{\mathrm{N}}^{\left(\mathrm{n},+,1\right)}\right)}{\mathrm{q}\left({\mathbf{x}}^{\left(\mathrm{n},+,1\right)},{\mathrm{N}}^{\left(\mathrm{n},+,1\right)}|{\mathbf{x}}^{\left(\mathrm{n}\right)},{\mathrm{N}}^{\left(\mathrm{n}\right)}\right)}|\mathrm{J}|\right]\\ {}\left({\mathbf{x}}^{(n)},{N}^{(n)}\right)\ \mathrm{otherwise}\end{array}\right., $$
(12)

where the proposal distributions q(x(q)|x(p)) and q(x(p)|x(q)) are expressed through the invertible, differential mappings:

$$ {\displaystyle \begin{array}{l}{\mathbf{x}}^{(q)}=h\left({\mathbf{x}}^{(p)},{\mathbf{u}}^{(p)}\right)\\ {}{\mathbf{x}}^{(p)}={h}^{\prime}\left({\mathbf{x}}^{(q)},{\mathbf{u}}^{(q)}\right),\end{array}} $$
(13)

u(p) and u(q) being random vectors with generally nonuniform distributions, and where ∣J∣ is the Jacobian

$$ \left|J\right|=\left|\frac{\partial \left({\mathbf{x}}^{(p)},{\mathbf{u}}^{(p)}\right)}{\partial \left({\mathbf{x}}^{(q)},{\mathbf{u}}^{(q)}\right)}\right|, $$
(14)

will sample the extended space according to p(x, N).

This algorithm will not only sample the space of models x (with varying dimension), but also the distribution N of the dimension itself.

Sequential Monte Carlo Methods

Estimating the evolution of a dynamic system is important in science and technology, for instance, in predicting the movement of an airplane in motion, or in numerical weather prediction. An important analytical tool in this field is the Kalman filter (Kalman 1960) which iteratively provides deterministic, Bayesian, and linear updates to past predictions of system parameters. For large systems, however, the many parameters make the analytical handling of large covariance matrices computationally intractable. This led to the development of the ensemble Kalman filter (EnKF) (Evensen 1994, 2003) a Monte Carlo algorithm that avoids the covariance matrices and instead represents the Gaussian distributions of noise, prior and posterior by ensembles of realizations. A basic formulation is the following:

Algorithm 8 Ensemble Kalman Filter (EnKF)

Assume that data d at any time are related to the system parameters x through the linear equation d = Hx. If X is an n × N matrix of N realizations from the initial (prior) Gaussian distribution of system parameters, and if D is an m × N matrix with N copies of data d + em where d is the noise-free data and each em is a realization of the Gaussian noise, it can be shown that if K = CHT(HCHT + R)−1, the columns of the matrix

$$ \hat{\mathbf{X}}=\mathbf{X}+\mathbf{K}\left(\mathbf{D}-\mathbf{HX}\right) $$
(15)

are realizations from the posterior probability density. Iterative application of these rules, where the posterior realizations in one time step are used as prior realizations in the next step, will approximately evolve the system in time according to the model equations and the uncertainties.

For more details about the implementation, see Evensen (2003).

A more recent, nonlinear, and non-Gaussian alternative to EnKF is the Particle Filter method (Reich and Cotter (2015), Nakamura and Potthast (2015) and van Leeuwen et al. (2015)). The standard particle filter is a bootstrapping technique (see below) operating in the following way :

Algorithm 9 Particle Filter

Assume that data d at any time are related to the system parameters x through the equation d = h(x). After n time steps, let \( {\mathbf{x}}_1^{(n)},{\mathbf{x}}_2^{(n)},\dots, {\mathbf{x}}_N^{(n)} \) be N initial realizations (“particles”) of the system parameters, representing the prior probability density

$$ p\left({\mathbf{x}}^{(n)}\right)=\sum \limits_{i=1}^{(n)}\frac{1}{N}\delta \left({\mathbf{x}}^{(n)}-{\mathbf{x}}_i^{(n)}\right). $$
(16)

Assume that the system is propagated forward in time by the generally nonlinear model f :

$$ {\mathbf{x}}^{(n)}=f\left({\mathbf{x}}^{\left(n,-,1\right)}\right)+{\mathbf{e}}^{(n)} $$
(17)

where e(n) is modelization noise introduced in the n’th time step. If the likelihood function for time step n is

$$ p\left({\mathbf{d}}^{(n)},\mid, {\mathbf{x}}^{(n)}\right)={p}_{\varepsilon}\left(\mathbf{d}-\mathrm{h}\left({\mathbf{x}}^{(n)}\right)\right) $$
(18)

We can use Bayes’ rule and obtain

$$ p\left({\mathbf{x}}^{(n)}|{\mathbf{d}}^{(n)}\right)=p\left({\mathbf{d}}^{(n)}|{\mathbf{x}}^{(n)}\right)\frac{p\left({\mathbf{x}}^{(n)}\right)}{p\left({\mathbf{d}}^{(n)}\right)}\approx \sum \limits_{i=1}^n{w_i}^{(n)}\delta \left({\mathbf{x}}^{(n)}-{{\mathrm{x}}_i}^{(n)}\right) $$
(19)

where

$$ {w_i}^{(n)}=\frac{p\left({\mathbf{d}}^{(n)}|{{\mathbf{x}}_i}^{(n)}\right)}{\sum_kp\left({\mathbf{d}}^{(n)}|{{\mathbf{x}}_k}^{(n)}\right)}. $$
(20)

Iterative application of these rules, where the posterior realizations in one time step are used as prior realizations in the next step, will approximately evolve the system in time according to the model equations and the uncertainties.

Resampling (sampling with replacement from the ensemble) is often used before each time step to obtain more equally weighted, but possibly duplicated particles.

Empirical Monte Carlo

Assume that N realizations from an unknown probability distribution p(x) are available, and that we, for some reason, are unable to obtain more samples. We are interested in inferring information about a parameter of p(x), for example, the distribution of its mean. This can be accomplished by an empirical Monte Carlo technique, for example, the resampling method Bootstrapping (Efron 1993):

Algorithm 10 Bootstrapping

To infer information about a parameter η of the distribution p(x) from a set of realizations A = {x1, …, xN}, draw N elements from A with replacement (thereby allowing elements to be repeated) and compute η for the N elements. This process is repeated many times, each time obtaining a new value of η (if N is large). The normalized histogram for η of all these resampling experiments will be an approximation to the distribution of η.

Bootstrapping can in this way be used to evaluate biases, variances, confidence intervals, prediction errors, etc.

Monte Carlo Optimization

An important application of Monte Carlo algorithms is in the solution of complex optimization problems, for instance, in the location of the global minimum for an objective function E(x). One of the first examples of a Monte Carlo optimizer was the following modification of the Metropolis-Hastings algorithm (6) (Kirkpatrick et al. 1983) :

Simulated Annealing

Algorithm 11 (Simulated Annealing)

Given an objective function E(x), a sequence of random numbers Un(n = 1, 2…) in [0,1], a decreasing sequence of positive numbers (“temperature” parameters) Tn → 0 for n → ∞, and an iterative random function V (x), sampling a constant probability density:

$$ {\mathbf{x}}^{\left(n+1\right)}=V\left({\mathbf{x}}^{(n)}\right). $$
(21)

Then the sample points produced by the iterative random function A:

$$ {\mathbf{x}}^{\left(n+1\right)}=A\left({\mathbf{x}}^{(n)}\right)=\left\{\begin{array}{l}V\left({\mathbf{x}}^{(n)}\right)\qquad \mathrm{if}\, {\mathrm{U}}_{\mathrm{n}}\le \min \left[1,\frac{\exp \left(-\mathrm{E}\left(\mathrm{V}\left({\mathbf{x}}^{\left(\mathrm{n}\right)}\right)\right)/{\mathrm{T}}_{\mathrm{n}}\right)}{\exp \left(-\mathrm{E}\left({\mathbf{x}}^{\left(\mathrm{n}\right)}\right)/{\mathrm{T}}_{\mathrm{n}}\right)}\right]\\ {}{\mathbf{x}}^{(n)}\ \mathrm{otherwise}\end{array}\right., $$
(22)

asymptotically converge to the minimum of E(x) as n goes to infinity.

This algorithm was inspired by the process of chemical annealing, where a crystalline material is slowly cooled (T → 0) from a high temperature through its melting point, resulting in the formation of highly ordered crystals with low lattice energy E. In each step of the algorithm, thermal fluctuations in the system are simulated by the Monte Carlo algorithm (for applications, see Mosegaard and Vestergaard (1991) and Deutsch and Journel (1994)).

It is seen that, for constant temperature parameter T, the above algorithm is actually a Metropolis-Hastings algorithm designed to sample the Gibbs-Boltzmann distribution (see Mosegaard and Vestergaard 1991)

$$ {P}_B\left(\mathbf{m}\right)=\frac{\exp \left(-\frac{E\left(\mathrm{m}\right)}{T}\right)}{Z(T)} $$
(23)

Monte Carlo Optimization-Sampling Hybrids

The use of MCMC algorithms for inference often runs into problems in high-dimensional parameter spaces. When the target probability/objective function has multiple optima, MCMC algorithms show a critical slowing-down often making the analysis computationally intractable. There exist a few methods to alleviate this problem. Common to these methods is to use a “temperature” parameter to artificially increase (or vary) the noise to broaden the sampling density to hinder entrapment in local optima during sampling.

Parallel Tempering

One method is a simple modification of simulated annealing, where E(x) =  −  ln (p(x)), and the decrease of T is stopped at T = 1 (see Salamon et al. (2002)). An inherent problem with simulated annealing is, however, the risk of entrapment in local minima for E(x). The parallel tempering algorithm (Geyer 1991; Sambridge 2013), which is an improvement of an older algorithm simulated tempering (Marinari and Parisi 1992; Sambridge 2013), seeks to avoid the entrapment problem by skipping the forced decrease of the “temperature” parameter T. The idea is to operate on an ensemble of models, and to include T in the parameters to be sampled:

Algorithm 12 (Parallel Tempering)

Given an ensemble x = (x1, x2, …xK) of K models, and their temperatures T1, T2, …, Tk, all distributed according to the same probability density p over the augmented spaces (xk, Tk), k = 1, 2, …, K. The Metropolis-Hastings algorithm (6), with the special rule for V that any perturbation of T consists of swapping values of T for two random ensemble members, asymptotically samples the joint distributionk p(xk, Tk), and the sample of ensemble members with Tk = 1 approaches a sample from the original target distribution p(x).

If positive values of T close to T ≈ 0 are allowed in this algorithm, the ensemble members with Tk = T* will be located close to the maxima for p(x). In this way, parallel tempering can be used for optimization in a way similar to simulated annealing .

Other Approaches

Biological systems evolve by using a large population to explore many options in parallel, and this is the source of inspiration for evolutionary algorithms. An example of this is genetic algorithms (Holland 1975) where an ensemble x1, …, xK of K sample points is assigned a probability p(xk) of survival (“fitness”). In a simple implementation, the population is initially generated randomly, but at each iteration it is altered by the action of three operators: selection, crossover, and mutation. Selection randomly resamples the ensemble according to p(xk), thereby typically duplicating ensemble members with high fitness at the expense of members with low fitness, improving the average fitness of the ensemble. The crossover step exchanges parameter values between two random members, and the mutation step randomly changes a parameter of one member. Crossover and mutation attempts are only accepted according to predefined probabilities. Iterative application of the above three steps allows the algorithm to asymptotically converge to a population with high fitness values p(xk).

Another widely used technique is the neighborhood algorithm (Sambridge 1999a, b) where sample points are iteratively generated from a neighborhood approximation to the target distribution p(x). The approximation is a piecewise constant function over Voronoi cells, centered at each of the previous sample points. The approximation to the target distribution is updated in each iteration, concentrating the sampling in multiple high-probability regions. The method will generate a distribution of points biased toward the maxima of p(x).

Neither of the two algorithms above produces samples of the probability distribution p, but if supplemented with appropriate resampling of the output, an approximate sample from p may be produced.

Summary

We have given an overview of the use of Monte Carlo techniques for inference and optimization. A main theme behind the development of many methods has been to reduce the computational workload required to solve large-scale problems. Several ways of dealing with this challenge were discussed, including Markov-Chain Monte Carlo strategies, the use of sparse models, improving numerical forecast schemes through nonparametric representations of probability distributions, and even improving ideas inherited from early works on simulated annealing. Research in Monte Carlo methods is still a prolific area, and there is no doubt that many interesting developments are waiting ahead.

Cross-References