Our Group’s ArXiv wordcloud

The Sampling Problem

Much of our work is addressed to the “sampling problem” for smooth probability distributions in a high dimensional Euclidean space. In general the problem boils down to the computation of an integral of a nice function \(\varphi\) with respect to a known probability density \(\rho\) on a domain \(\Omega\), that is,
\[ \langle \varphi \rangle = \int_{\Omega} \varphi(x) \rho(x) {\rm d} x, \hspace{0.2in} \int_{\Omega} \rho(x) {\rm d}x = 1. \]
The catch is that our integration happens in a high dimensional space or on a manifold. For example the manifold might be an ellipsoid like this:

The red color in Figure 2 shows regions of high density, whereas the blue would be low values of the density.

In recent years, many Markov Chain Monte Carlo methods have been studied for solving this sort of problem. These can be derived in many different ways–from statistical reasoning, based on probability theory, using stochastic differential equations, etc. This is a very versatile framework for sampling.

We just construct a Markov chain which has the desired distribution as its (unique) invariant distribution.

\[ x_0 \rightarrow_T x_1 \rightarrow_T x_2 \ldots \hspace{0.2in} \rho T = \rho. \] The samples taken from the chain can be used to approximate averages with respect to the invariant measure. Of course, the devil is in the details. If we make a poor choice for our sampling process, we might just make our lives difficult.

The challenge is to find efficient MCMC sampling schemes that scale well to high dimensions and give accurate approximations with as little work as possible.

We typically work with discretized differential equations or stochastic differential equations (SDEs). Even within this setting there are many different schemes available to us, some of which are illustrated in the following diagram (Figure 3).

Some of these might be familiar if you have some experience with stochastic differential equations, but a few of these are not very well known. And this diagram is only a fraction of the methods we work with.

Let us illustrate how the sampling problem can be addressed by solving an SDE. Consider the overdamped Langevin dynamics (or Brownian dynamics) system \[ {\rm d} x = -\nabla U(x) {\rm d} t + \sqrt{2 \beta^{-1}}{\rm d W}. \] It is known that this system, under mild conditions on the potential, has a unique invariant distribution: \[ \rho_{\beta} (x) = Z^{-1} \exp(-\beta U(x)), \] where \(Z\) is a normalizing constant. The paths of the SDE, starting from \(x(0)=0\), are illustrated Figure 3 (right) for the potential shown in Figure 3 (left) (a ‘double well’ potential). The paths fan out from an initial distribution which is in this case a Dirac delta function to a final distribution of the exponential of negative potential form. We can see that the paths tends to accrete in the vicinity of the two minima of the double well potential (which makes sense, if you suppose that the system wants to minimize the potential energy).

</figure> Incidentally, the parameter \(\beta\) is a ‘reciprocal temperature’. When it is large it means that the distribution will be more concentrated near the bottoms of the wells of the energy. Small \(\beta\) corresponds to high temperature which has the effect of flattening out the density.

Many schemes have the potential to generate approximations of dynamical paths, e.g. Hamiltonian trajectories or Brownian dynamics. Numerical methods can be designed to approximate the paths of the SDE or to generate good approximation of the invariant distribution. These two aims can be quite at odds.

To illustrate, consider the two schemes illustrated below in Figure 4. The first is the Nose-Hoover-Langevin method. It produces very smooth paths–they look like the solutions of mildly perturbed Hamiltonian dynamics. The second method is something called Stochastic Line Sampling, which was proposed a few years ago in our group. Both methods produce accurate distributions for the positions of the system, but they get there in different ways.

There are quite a few different directions in research in sampling algorithms. First, we might be interested in the choice of the SDEs and in analysing their properties.

An issue that arises in many schemes has to do with the approximation of the invariant measure using numerical integrators. We cannot solve the SDEs exactly, so we have to introduce some numerical method. For example we might use the Euler-Maruyama method for solving the overdamped Langevin system: \[ x_{n+1} = x_n - \delta t \nabla U(x_n) + \sqrt{2\delta t \beta^{-1}} R_n, \hspace{0.2in} R_n \sim {\cal N}(0,1). \] The result of using this is that we slightly distort the invariant measure, that is, there is an error of size \(O(\delta t)\) that will be introduced in the averages computed using the discrete points.

It turns out that we can do a lot better than this by just thinking a bit about how we approximate the paths. For example we can get a higher order of approximation by using several evaluations of the gradient of \(U\). The only problem is that \(U\) and its gradient might be very expensive to evaluate, in the computational sense, since we might be working in a high dimensional space or the formulas involved might be very long (something that arises when we use SDEs to sample the parameters of a Bayesian posterior defined for a large data set). For such cases we can use a very simple trick to get more mileage from our gradient evaluations, that is, more accuracy. We replace the Euler-Maruyama method described above by a simple modification (referred to here as the BAOAB-limit method or LM): \[ x_{n+1} = x_n - \delta t \nabla U(x_n) + \sqrt{\delta t \beta^{-1}/2} (R_n+R_{n+1}), \hspace{0.2in} R_n \sim {\cal N}(0,1). \] By mixing \(R_n\) and \(R_{n+1}\), we create a discrete iteration in which the random noises introduced at the timesteps are correlated with each other, instead of independent as in the Euler-Maruyama method. This simple trick as a very important consequence: the averages computed using the modified scheme are second order accurate, whereas the Euler-Maruyama method is a first order integrator. For the method above we have \[ \lim_{K\rightarrow \infty} \left | K^{-1} \sum_{k=0}^{K-1} \varphi(x_k) -\langle \varphi \rangle \right | = O(\delta t^2). \]

This scheme originated in our work on the underdamped form of Langevin dynamics, where we examined a number of splitting methods. In fact the LM method is a direct consequence of a related property for a certain splitting method (“BAOAB”) which can be used for Langevin integration. The notation “A”, “B” and “O” which is now so widely used to describe integrators for Langevin dynamics also originated in our work, indeed the same paper where we introduced the LM scheme.

Another interesting issue is to compare the different formulations of Figure 2 with respect their convergence rates. To do so, we need to find a good way to quantify this convergence. We can talk about convergence of the density in some absolute sense (for example, in the Wasserstein norm) or we can think about the convergence of different observable averages.