Hamiltonian MCMC

lorem ipsum

By iwoca on 24/10/2019

This week, Jamie and Mat talk business theft, flying taxis and the origins of cynicism | iwoca

A central problem arising in Bayesian probability concerns sampling from the posterior distribution. In this two-part blog we'll discuss a dynamics-inspired twist on the classic Metropolis-Hastings algorithm, known as Hamiltonian Monte Carlo, for solving this and other difficult sampling problems. This blog will focus on understanding the basic algorithm as quickly as possible, but we've included a short discussion of its salient features with the aid of some animation. This is a very rich topic and the reader is referred to the literature for further information (see e.g. Neal (1993), Neal (2010)). Our light informal presentation will cover only the very basics.

The problem setting

An important quantity in probability is expectation, so suppose we wish to compute this.  The formal definition of the expectation of a function $f$ of a random variable as an integral, if it exists, might very well be analytically intractable.  It might therefore be much easier to approximate it by the empirical mean of a finite random sample and appeal to a law of large numbers:

$$Ef = \int\Omega d\theta \, p(\theta) \, f(\theta) \approx \frac{1}{N} \sum{i=1}^N f(\theta^*_i)$$

where $\theta^_i \sim p$; i.e. $\theta^_i$ are drawn from a distribution having the probability density function (PDF) $p$.  This is the Monte Carlo method: we simulate a statistical sample and use this to estimate the population parameters of interest.  In the preceding example, $p$ might be the posterior distribution resulting from Bayes’ theorem, for example.

Beyond the simplest distributional assumptions for these quantities, sampling becomes more challenging, particularly in high dimensions.  Sampling methods vary from those based on acceptance and rejection and simple transformations of variables (e.g. the Box-Muller transform for Gaussian sampling), through to more specialized techniques such as importance sampling and MCMC.  Our goal here will be to construct a Markov chain whose (long-term) realizations may be said to have been drawn from $p$, and pursue the Monte Carlo technique outlined above, but we begin with some background.

Markov chains and the Metropolis algorithm

Basics of Markov chains

A discrete-time Markov chain is a sequence of random variables $(X_1, X_2, X_3, \ldots)$ (usually representing the state of some system) characterized by the property

$$P(X{t+1}=x|X_t=x_t, X{t-1}=x{t-1}, \ldots, X_1=x_1) = P(X{t+1}=x|X_t=x_t)$$.

That is, the realization of the next state in the sequence depends only on the current state and not on the past.

An example of a Markov chain is the one-dimensional random walk, where $X_t$ represents the position of the walker at discrete time points $t$, taking steps in either direction with fixed probabilities.

For the purpose of this discussion, suppose $X : \Omega \to S$ is a random variable mapping elements of the sample space to a countable set of states S.

The probability $p{ij} = P(X{t+1}=j|X_t=i)$ is the transition probability from state $i$ to $j$.  In the case of homogeneous Markov chains, the transition probabilities are time-independent, and hence the same probabilities govern the evolution of the process over all time.

If the probability of being in state $i$ at time $t$ is $\pi(t)$ then the probability of being in state $j$ at time $t+1$ is

$$\pij(t+1) = \sum{i \in S} \pij(t) \, p{ij}$$

A stationary distribution $\pi$ is one that does not evolve in time. That is $\pi^_j(t+1) = \pi^*_j(t)$ for all states $j$ and times $t$, which implies

$$\sum{i \in S} \pi^*_i \, p{ij} = \pi^*_j$$

A condition known as detailed balance is satisfied if there exists a distribution $\pi$ for which

$$\pii p{ij} = \pij p{ji}$$

for all states $i$ and $j$.  Summing over all states $i$ on both sides, we recover Eq. stationary distribution and hence detailed balance implies the existence of a stationary distribution.

The Metropolis algorithm


Algorithm 1: Metropolis


  1. initialize $x_0$;
  2. foreach iteration $t=0,1,2,\ldots$ do
  3. -- draw $x'$ from $g(x'|x_t)$;
  4. -- $\alpha \leftarrow f(x') / f(x_t)$;
  5. -- draw $u$ uniformly from $0,1$;
  6. -- if $u\leq\alpha$ then $x_{t+1} \leftarrow x'$;
  7. -- else $x_{t+1} \leftarrow x_t$;
  8. end

We're now ready to examine the Metropolis algorithm listed in Algorithm 1. The goal is to sample from the distribution having pdf proportional to $f$. We initialize a point $x0$. At each step of the algorithm, a new point is proposed with probability distribution g, which should be easy to sample from. The proposed point is then accepted with probability proportional to $x\mathrm{new}/x_\mathrm{old}$.

Why does the Metropolis algorithm work? We see that the transition probability is

$$P(X_{t+1}=x’|X_t=x_t) = g(x’|x_t) \cdot \min\left(1, \frac{f(x’)}{f(x_t)}\right)$$,

i.e. we transition with certainty if the new realization is more probable under $f$, but we transition with probability $\alpha$ if it is less probable. But $g(x'|x_t)=g(x_t|x')$ by construction, and $\min\left(1, f(x’)/f(x_t)\right) \, f(x_t) = f(x’) \, \min\left(f(x_t)/f(x’), 1\right)$, so we can show that detailed balance holds for this transition probability:

$$ \begin{align} f(xt) \cdot p{xt,x’} &= g(x’|x_t) \cdot \min\left(1, \frac{f(x’)}{f(x_t)}\right) \, f(x_t) \ &= f(x’) \, \min\left(\frac{f(x_t)}{f(x’)}, 1\right) \cdot g(x_t|x’) \ &= p{x’,x_t} \cdot f(x’) \end{align} $$

Therefore $f$ is a stationary distribution of this chain. The uniqueness of this distribution is a consequence of the ergodicity of the chain, which means the chain converges to this stationary distribution independently of the starting state.

Example

Here is an animation showing how the Metropolis algorithm simulates points from the two dimensional gaussian distribution with correlation.

metropolis-random-walk-optimized

This example is trivial to the point of not actually requiring MCMC, but being very simple it exposes the mechanisms clearly.

Initially the chain begins at position (-4, 4) which is a choice we make. The initial string of realizations is not drawn from the stationary distribution $f$ as we would like, since the (ergodicity of the chain) is not yet realized in practice, and what is observed is a so-called ‘burn-in’ period, highlighted here:

It should be clear from the animation that there is a high degree of serial correlation; i.e. the points being simulated are not i.i.d. insofar as they are not serially independent. This shows also shows up in the autocorrelation function (ACF) for this chain (Figure ...) showing that only after some 30 draws, the point becomes sufficiently decorrelated for practical purposes. The Metropolis algorithm is essentially a random walk, with transition probabilities given by the proposal distribution, only with the extra feature of probabilistically rejecting some proposals in order to bias the random walk behaviour towards the stationary distribution.

Techniques can mitigate this particular issue to certain degrees. An expensive solution is to simply throw away many of the realizations, keeping only every kth, for example. We’ll discuss another approach, however.

In order to avoid random walk behaviour, one can use a ‘wider’ proposal distribution. However this may result in a higher fraction of proposals being rejected, as they are in regions of much lower probability. This results in old points being retained, which is clear in animation #? Below:

Small $\sigma$Large $\sigma$
Each proposal highly correlated with previous oneProposals are less serially correlated
Small change in $f$ at new proposed point $\implies$ high acceptance probabilityPossibly large changes in f at new points $\implies$ high rejection probability
Random walk behaviourExplores space more; multiple repeat realizations

Part II

One of the main issues with the algorithm outlined above is the necessary trade-off between avoiding random walk behaviour and ensuring high acceptance rates. The idea behind Hamiltonian MCMC is to use the equations of a dynamical system to propose points far away from the current one, while maintaining a high acceptance rate, therefore avoiding both random walk behaviour and repeats.

References

  1. Neal, R. M. (2010) "MCMC using Hamiltonian dynamics", in the Handbook of Markov Chain Monte Carlo, S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng (editors), Chapman & Hall / CRC Press, pp. 113-162
  2. Neal, R. M. (1993) Probabilistic Inference Using Markov Chain Monte Carlo Methods, Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto