Markov chain Monte Carlo sampling of a normal distribution from an exponential distribution



Markov chain Monte Carlo sampling of a normal distribution from an exponential distribution

Richard Hugtenburg
December 21, 2000

A PDF version of this file is available here


The following is a simple example to show two important properties of a Markov chain Monte Carlo (MCMC) sampler and to illustrate the basic functionality of the method and issues relating to it's usage.

A Markov chain is a series where the realisation of the next element in the series, Y, is dependent only on the current state, X, and occurs with probability, P(Y|X). So the even number series forms a Markov chain but the Fibonacci sequence does not. A particle trajectory is a Markov chain but the formation of a polymer is not.

MCMC is a technique for sampling from space that might be difficult to sample from because it does not have an analytical form or it may be difficult to integrate, in which case a normalising constant and a probability density function (pdf) cannot be found. The elements might be a numerical result from Monte Carlo sampling within a multivariate system, which can not be sampled directly, but are generated from other sampled variables.

MCMC enables independent sampling of a target distribution, f, by using samples from a integrable function, f* in a similar fashion to importance sampling. The first important property of MCMC samplers is that the weighting function need not be know in advance but is effectively generated by a stochastic process. Samples are taken from the distribution f* and accepted or rejected depending on the relative probabilities of the new state compared to the old state in the target distribution, f and the relative probabilities of selection in the integrable distribution f*. Consequently, the second important property of MCMC samplers is that the probability of a state need never be known to better than a normalising constant.

Hastings [1] introduces a general form for an MCMC sampler that is stationary which requires that given a sample from the target distribution, f, subsequent samples will remain in f. The probability of an arbitrary transition is given by $P(Y\vert X)=Q(Y\vert X)\alpha(X,Y)$ where,

 \begin{displaymath}\alpha(X,Y)=\min\left(1,\frac{P(Y)Q(X\vert Y)}{P(X)Q(Y\vert X)}\right),
\end{displaymath} (1)

is the probability of acceptance of a new sample, Y, of probability (or probability proportional to) P(Y) in f, given the current value X and where Q(Y|X) is the probability of the transition of sample Y given X in the distribution f*.

Stationarity is achieved under conditions of reversibility, where the probability of being in state X and changing to Y is equal to the probability of being in Y and changing to X, that P(X)P(Y|X)=P(Y)P(X|Y). The choice of $\alpha$ gives a symmetric relation that satisfies the reversibility condition.

P(X)P(Y|X) = $\displaystyle P(X)Q(Y\vert X)\alpha(X,Y)$  
  = $\displaystyle \min(P(X)Q(Y\vert X),P(Y)Q(X\vert Y))$  
  = P(Y)P(X|Y) (2)

While his results guarantee stationarity, convergence is only guaranteed if the Markov chain is irreducible, that there is a positive probability of visiting all states in a finite number of transitions. The quality of convergence is therefore dependent on the quality of the visitation of states generated by Q. One condition that ensures convergence is that $Q(X)/P(X)>\beta$, where $\beta$ is a number everywhere greater than zero. In other words this condition dictates that the probability of a transition is never vanishingly small. Intuitively, Q is often chosen to be similar in shape to P. In a Bayesian context, Q encompasses our knowledge of P prior to the calculation, hence Q(X)>P(X).

A typical method of sampling f* is the random walk sampler where the sample Y is dependent only on the separation from X. In which case, Q(Y|X)=Q(|X-Y|). Subsequently, $\alpha(X,Y) = f(Y)/f(X)$. An independence sampler is where the new sample Y is independent of the current value X. So, Q(Y|X)=f*(Y).

An example will be presented of a one-sided normal distribution sampled from an exponential decay distribution. A normal distribution can be sampled so this example is a little trivial but helps to illustrate several important characteristics of the method. We take our normal distribution as being proportional to $f=\exp(-x^2)$. The exponential function is defined $(0,\infty)$ as $f^*=\exp(-x)$. Y is sampled independently.

Equation 1 reduces to the following;

\begin{displaymath}\alpha(X,Y) = \exp(X^2-Y^2+Y-X)
\end{displaymath} (3)

To calculate the sampled pdf, f*(x) is integrated over $(0,\infty)$and normalised to a total probability of one. A value for x is sampled from the inverse of the integral of f* for a random variable, $\xi$, uniformly distributed on (0,1). As, $\xi=1-\xi$,

$\displaystyle \xi(x)$ = $\displaystyle 1-\exp(-x),$  
$\displaystyle \mbox{hence, } x$ = $\displaystyle -\log(\xi).$ (4)

A second random variable, $\zeta$, is used to determine the nature of the transition with respect to the calculated value of $\alpha$.


 
Figure 1: Pseudocode implementation of the algorithm
\begin{figure}
\rule{4cm}{.5mm}
\begin{center}
\begin{tabbing}
xxxx\=xxxx\= \kil...
...$X=Y$\\
\> \>Output $X$\\
\>End of loop
\end{tabbing}\end{center}
\end{figure}

Samples can be collected into groups. The samples, X, been collected into those bounded by $(0,1/\sqrt 2)$, the proportion of which represent the first standard deviation.


  
Figure 2: Estimate from 100 iterations of the proportion of samples lying within the first standard deviation (68%), demonstrating convergence and the influence of a range of starting values. X1 = 1/4 (*), 1/2 (+), 1 (o), 2 (.)
\includegraphics[width=9cm]{conv_norm.eps}

Figure 2 demonstrates the estimates as a function of iteration number and starting value. Note how the choice of starting value introduces a bias into the simulation as it is not an independent sample from the stationary distribution. An initial burn-in period is therefore included in a start-up procedure to minimise the significance of the starting value. Consequently, an MCMC sampler is best incorporated into a code as an object that runs continuously and independently of the main program.

This introductory material provides the background to a problem in radiation physics [2]. Monte Carlo is effective in determining macroscopic properties of radiation sources. A typical result from a Monte Carlo ``simulation'' of a radiation source would be the expectation or average energy or charge deposited within a particular region for a given (large) number of source particles. In many circumstances, the expectation corresponds to an actual measurement. Perhaps the energy or charge deposited is within the sensitive volume of a detector. The information from the detector adds a further constraint to the model and might be used to improve the accuracy of estimates of quantities that are more difficult to measure in the system.

Here is a multivariate system where a target quantity with a distribution defined by the accuracy of the measurement can only be generated from a dependent distributed parameter that may be sampled directly. To do so in such a way that is physically consistent such that unbiased estimates of other quantites can be made requires the Markov chain Monte Carlo method.

The distribution f* is generated from a series of samples from a distribution defined by our knowledge of the dependent parameter, $\psi^*$. Realisation of the Markov chain, from current state X to next state Y which will tend towards the distribution f, is complicated, Q(Y|X) is undefined as Y is not sampled directly from $\psi^*$.

Random walk sampling is useful as it removes the dependence on the transition probability but suffers if the distribution has poor visitation properties such as would be caused by a lot of local probability maxima.

Independence sampling has the advantage of a more robust convergence under these circumstances but can be less efficient if the distribution is highly peaked around probability maxima, leading to a lot of rejections. A simple way of obtaining an unbiased sample from f*, such that Q(Y) can be determined with confidence, is to ensure that the Monte Carlo generation from $\psi^*$ is of high accuracy. In practice, early estimates of f* can be of lesser accuracy and the quality of the Monte Carlo simulation improved as the algorithm proceeds. The method is accurate as long as quantities are examined carefully for signs of convergence before they are accumulated. The method is satifactory in some circumstances but quickly grinds down in problems requiring high resolution such as spectroscopy.

For further applications and extensions of Hastings algorithm the reader is referred to Tierney (1994) [3] and to a growing number of MCMC examples on the web.

Bibliography

1
W K Hastings.
Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57(1):97-109, 1970.

2
R P Hugtenburg.
Markov chain Monte Carlo methods in radiotherapy treatment planning.
In MC2000, Lisbon, 2000.

3
Luke Tierney.
Markov chains for exploring posterior distributions.
Annals of Statistics, 22(4):1701-1762, 1994.

Richard Peter Hugtenburg
2000-12-21