A deep dive into Hamiltonian MCMC

Engineering and Data Science

min read

Hamiltonian MCMC

A central problem arising in Bayesian probability concerns sampling from the posterior distribution. In this two-part blog we'll discuss a dynamics-inspired twist on the classic Metropolis-Hastings algorithm, known as Hamiltonian Monte Carlo, for solving this and other difficult sampling problems.

We'll focus on understanding the basics of the algorithms as quickly as possible, including a discussion of the most important features with the aid of some animation. This is a very rich topic and further reading can be found in the references.

The problem setting

Suppose we want to compute an expectation (an important quantity in probability). The expectation of a function $f$ of a random variable might be difficult to calculate analytically as an integral. In such cases it might therefore be much easier to approximate it by appealling to the law of large numbers and computing the empirical mean of a finite random sample instead:

$$E[f] = \int\Omega d\theta \, p(\theta) \, f(\theta) \approx \frac{1}{N} \sum{i=1}^N f(\theta^*_i)$$

where $\theta^_i \sim p$; i.e. $\theta^_i$ are drawn from a distribution having the probability density function (PDF) $p$. This is the Monte Carlo method: we simulate a statistical sample and use this to estimate the population parameters of interest. In the example above, $p$ might be the posterior distribution resulting from Bayes’ theorem, for example.

Sampling tends to become challenging quickly when we go beyond the simplest distributions, particularly in high dimensions. Sampling methods vary from those based on acceptance and rejection, and simple transformations of variables (e.g. the Box-Muller transform for Gaussian sampling), through to more specialized techniques such as importance sampling and Markov chain Monte Carlo (MCMC).

Our goal here is to construct a Markov chain whose (long-term) realizations may be said to have been drawn from $p$, and pursue the Monte Carlo technique outlined above.

But first, let's begin with some background.

Markov chains and the Metropolis-Hastings algorithm

Markov chain basics

A discrete-time Markov chain is a sequence of random variables $(X_1, X_2, X_3, \ldots)$, usually representing the state of some system, characterized by the property:

$$P(X_{t+1}=x|X_t=xt, X{t-1}=x_{t-1}, \ldots, X_1=x1) = P(X{t+1}=x|X_t=x_t)$$.

That is, the realization of the next state in the sequence depends only on the current state and not on the past. (Note that $t$ doesn't necessarily have to stand for 'time' but can just be some abstract index.)

An example of a Markov chain is the one-dimensional random walk, where $X_t$ represents the position of the walker at discrete time points $t$, taking steps in either direction with fixed probabilities.

For the purpose of this discussion, suppose $X : \Omega \to S$ is a random variable mapping elements of the sample space, $\Omega$, to a countable set of states S.

The probability $p{ij} = P(X{t+1}=j|X_t=i)$ is the transition probability from state $i$ to $j$. In the case of homogeneous Markov chains, these transition probabilities are time-independent, and hence the same probabilities govern the evolution of the process over all time.

If the probability of being in state $i$ at time $t$ is $\pi(t)$ then the probability of being in state $j$ at time $t+1$ is:

$$\pij(t+1) = \sum{i \in S} \pii(t) \, p{ij}$$ (1)

A stationary distribution $\pi^$ is one that does not evolve in time. That is $\pi^_j(t+1) = \pi^*_j(t)$ for all states $j$ and times $t$, which implies:

$$\sum_{i \in S} \pi^i \, p{ij} = \pi^_j$$

A stronger condition, known as detailed balance, is satisfied if there exists a distribution $\pi$ for which

$$\pii p{ij} = \pij p{ji}$$

for all states $i$ and $j$. Summing over all states $i$ on both sides, we recover Eq. (1) and hence detailed balance implies the existence of a stationary distribution.

Although it is a stronger condition than we need, detailed balance will be more straightforward to show.

The Metropolis-Hastings algorithm

As outlined above, our goal is to sample from a distribution having pdf proportional to $f$. For this we can use the Metropolis-Hastings algorithm:

initialize $x_0$;
foreach iteration $t=0,1,2,\ldots$ do
-- draw $x'$ from $g(x'|x_t)$;
-- $\alpha(x',x_t) \leftarrow \min(1, f(x') g(x_t|x') / f(x_t) g(x'|x_t))$;
-- draw $u$ uniformly from $[0,1]$;
-- if $u\leq\alpha$ then $x_{t+1} \leftarrow x'$;
-- else $x_{t+1} \leftarrow x_t$;
end

We initialize a point $x0$. At each step $t$ of the algorithm, a new point $x{t+1}$ is proposed with (conditional) probability distribution $g$, which should be easy to sample from. For example, this might be drawn from a Gaussian distribution with mean $x_t$ and variance $\sigma^2$. (The Metropolis algorithm is a special case of Metropolis-Hastings using a symmetric proposal distribution such as this.)

The proposed point is then accepted with probability $\alpha$, which depends on $x'$ and $x_t$, and the collection of accepted proposals should be distributed according to $f$ after a sufficient time.

Why does the Metropolis-Hastings algorithm work?

We see that the transition probability (for $x' \neq x_t$) is:

$$P(X_{t+1}=x’|X_t=x_t) = g(x’|x_t) \cdot \min\left(1, \frac{f(x’)}{f(x_t)} \, \frac{g(x_t|x')}{g(x'|x_t)}\right)$$,

i.e. we transition with certainty if the new realization is more probable under $f$, but we transition with probability $\alpha$ if it is less probable. We can show that detailed balance holds for this transition probability:

$$\begin{align}f(xt) \, p{x_t,x’}&= f(x_t) \, g(x’|x_t) \, \min\left(1, \frac{f(x’)}{f(x_t)} \, \frac{g(x_t|x')}{g(x'|x_t)}\right) \&= \min\left(f(x_t) \, g(x'|x_t) , f(x’) \, g(x_t|x')\right) \&= \min\left(\frac{f(x_t)}{f(x’)} \, \frac{g(x'|x_t)}{g(x_t|x')}, 1\right) f(x') \, g(xt|x’) \&= p{x’,x_t} \, f(x’)\end{align}$$

Therefore $f$ is a stationary distribution of this chain. The uniqueness of this distribution is a consequence of the ergodicity of the chain, which means the chain converges to this stationary distribution independently of the starting state.

Example of a Metropolis simulation

This animation shows how the Metropolis algorithm, with proposals drawn from a simple isotropic Gaussian, simulates points from the two dimensional Gaussian distribution with correlation ($\sigma_x=\sigma_y=1$, $\rho=0.7$):

This example is simple enough not to actually require MCMC, but its simplicity exposes the mechanisms clearly.

Initially the chain begins at position (-4, 4), which is a choice we made when setting up the simulation. The initial string of realizations is not drawn from the stationary distribution $f$ as we would like, since convergence to the stationary distribution has not yet been realized in practice, and what is observed is a so-called ‘burn-in’ period, highlighted here:

The animation suggests that there is a high degree of serial correlation, i.e. the points being simulated are not independent of each other. This is because each new proposal is quite close to the current state of the chain. (In the simulation above, $g(x'|x_t)$ is an isotropic Gaussian with $\sigma=0.5$.)

This also shows up in the autocorrelation function (ACF) for this chain, showing that it requires some 30 draws before the points become sufficiently decorrelated for practical purposes:

The Metropolis-Hastings algorithm is essentially a random walk, with transition probabilities given by the proposal distribution, only with the extra feature of probabilistically rejecting some proposals in order to bias the random walk behaviour towards the stationary distribution. Sampling that looks too much like a random walk is bad, however, as it is inefficient at generating decorrelated points.

Techniques can mitigate this particular issue to certain degrees. An expensive solution is to simply throw away many of the realizations, keeping only every $k$th, for example. In part two of this blog we’ll discuss another approach that is more efficient.

In our example, $\sigma=0.5$ is perhaps too narrow a proposal distribution. In order to avoid random walk behaviour, we could use a 'wider' proposal distribution. However this may result in a much higher fraction of proposals being rejected, since they are in regions of much lower probability.

The result is that old points are retained more often, as shown in the following animation, where we choose $\sigma=2$:

In this case we propose to jump further on average each time, but with diminishing probabilities. The behaviour of the chain is therefore strongly dependent on the choice of proposal distribution, as summarized in the table below.

Small width ($\sigma$)Large width ($\sigma$)Each proposal highly correlated with previous oneProposals are less serially correlatedSmall change in $f$ at new proposed point $\implies$ higher acceptance probabilityPossibly large changes in f at new points $\implies$ higher rejection probabilityRandom walk behaviourExplores space more; multiple repeat realizations

The aim is to avoid the inefficient sampling extremes of random walk behaviour and high rejection rates, but the nature of the problem might make these difficult to avoid in practice without modification of the underlying algorithm.

Summary

One of the main issues with the Metropolis-Hastings algorithm is the necessary trade-off between avoiding random walk behaviour and ensuring high acceptance rates. The idea behind Hamiltonian Monte Carlo is to use the equations of a dynamical system to propose points far away from the current one, while maintaining a high acceptance rate, therefore avoiding both random walk behaviour and repeats. This is what we explore in part two of this blog.

References

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E. (1953). "Equation of State Calculations by Fast Computing Machines". J. Chem. Phys. 21 (6): 1087.
Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika. 57 (1): 97–109.
Neal, R. M. (2010) "MCMC using Hamiltonian dynamics", in the Handbook of Markov Chain Monte Carlo, S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng (editors), Chapman & Hall / CRC Press, pp. 113-162
Neal, R. M. (1993) Probabilistic Inference Using Markov Chain Monte Carlo Methods, Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto

Words by

iwoca

Article updated on:

June 24, 2020

Get startedBorrow up to £1,000,000
Repay early with no fees
From 1 day to 24 months
Applying won't affect your credit score

Apply now

View all

From Intern to CMO: Joost’s path to leadership at iwoca

Joost Versteeg is the CMO at iwoca and joined in September 2017 as an intern.

People of iwoca

The Women of 3PS

We speak to members of iwoca's Women of 3PS initiative, who discuss the group's creation to address biases and foster positive change, covering topics such as diversity in leadership and career progression.

People of iwoca

Black History Month at iwoca: Saluting our sisters

In this post we share some of the most compelling highlights from our BHM blogs, showcasing the diverse perspectives and experiences within our community.

Hamiltonian MCMC

Hamiltonian MCMC

Contents

The problem setting

Markov chains and the Metropolis-Hastings algorithm

Markov chain basics

The Metropolis-Hastings algorithm

Why does the Metropolis-Hastings algorithm work?

Example of a Metropolis simulation

Summary

References

Get started

Related Articles

From Intern to CMO: Joost’s path to leadership at iwoca

The Women of 3PS

Black History Month at iwoca: Saluting our sisters