# Research

### Molecular Dynamics.

Much of our work has been on the foundations of numerical methods for simulating the dynamics of molecules. We usually assume the that the model can be described in terms of the motions of the nuclei of the system. It turns out that this is the natural starting point in many cases. So essentially we have a large system of atoms, they interact through forces that are described by potential energy functions, and the goal is to figure out how the molecules move and arrange themselves over time. For example, the images below show two conformational states of a certain molecule.

These would represent more or less likely states, and it is important that we observe simulated states with a likelihood that is similar to that with which they would be observed in reality; that is, we would wish to have an accurate approximation of the natural distribution of different states.

In the simplest setting one solves equations defined by Newton’s 2nd law (F=MA), but more commonly one has to include additional constraints on the system or make the model relevant for a laboratory environment (where temperature and pressure come into play) or include some additional contributions to the forces from a noisy environment or an auxiliary quantum computation. So things get more complicated, but the basic idea is straightforward: We start with a conformation and approximate the motion after a brief instant, then take another step, and another and so eventually construct a series of snapshots that describe the motion of the system. Ultimately we might be interested in how the molecule arranges itself after a (relatively) long time–the so-called “invariant distribution”.

As examples of our recent successes in this area, I would highlight the understanding of theoretical foundations and the development of numerical methods for Langevin dynamics. In Langevin dynamics, the Newtonian dynamics model is supplemented by dissipative and random perturbations which capture the interactions with suppressed degrees of freedom or compensate for other limitations of the conservative description. The system of interest is thus a set of stochastic differential equations. We have consider the formulation of the equations of motion, the properties of those equations (e.g. questions of ergodicity), and the design of accurate and efficient numerical discretisation methods. Some of our work in this area is described in the textbook I have written with Charles Matthews (Molecular Dynamics, Springer 2015).

More recent work has generalised the algorithms to constrained systems and systems with memory (generalised Langevin), or looked at the redesign of the stochastic-dynamical model to enhance exploration in the presence of extreme metastability due to energetic or entropic barriers. We also have an important stream of research on mesoscale particle models such as dissipative particle dynamics.

### Bayesian Models for Data Analysis

We have also been progressively exploring algorithms that can be used for Bayesian parameterisation of complex models. The model can be a neural network or a traditional mathematical model. Let us denote the model by $\Phi$ and suppose that it takes input $X$ to output $Y$, given parameters $\theta$:
$Y = \Phi(X,\theta).$
Now in a common setup, one assumes there is stochastic uncertainty in the relationship and so it is natural to consider a model in which the in puts and outputs are interpreted as random variables and the strict equality above is replaced by a statistical model, say
$\pi \left (x,y|\theta \right ) = Z^{-1}\exp \left ( \frac{\left [ y-\Phi(x,\theta)\right ]^2}{2\sigma^2} \right ). (*)$
where, for simplicity, we take $\sigma$ fixed. This is just one (elementary) choice for the statistical model, but it will serve for this exposition.

Now given a collection of samples (data) $D=\{(x_{\alpha},y_{\alpha})\}$, we may ask what is the best choice of the parameter vector $\theta$ to fit the given model to the given data. From the Bayesian perspective, the parameters themselves can be viewed as uncertain, with a distribution defined by Bayesian inversion:
$\pi\left (\theta| D \right) \propto \pi(D|\theta)\pi_0(\theta).$
The factors here are all probability densities. $\pi_0$ is an assumed prior’’ for the parameters which might, for example, restrict the parameters to be taken from a specified finite domain (or it may be Gaussian). The usual choice for the conditional distribution, $\pi(D|\theta)$, is based on the likelihood which is defined by the statistical model (*) under the assumption that all the data points are independent:$\pi(D|\theta) = \prod_{\alpha=1}^{|D|} \pi(x_{\alpha},y_{\alpha}|\theta).$

It is now possible to treat the parameterization problem in various ways. The most usual approach is to optimise the parameters by finding the global maximum of the probability density $\pi(\theta|D)$. A more nuanced approach is to embrace the uncertainty in the choice of $\theta$ in the Bayesian perspective, and explore, i.e. sample’’ the parameter distribution. The distribution is normally multimodal and we may attempt to find a number of the modes or to average the parameter values across the distribution. This is the approach we are currently taking in our data analytics work.

Among the specific problems we have looked at/are looking at in this domain, I would mention the following. First, there is a basic problem in the simple technique I mentioned above in the context of large data sets. Every time we compute a new value of the density (or its gradient, as is often needed), we must compute a product of likelihoods across the entire data set which will be very expensive if $|D|$ is large. One way to reduce the computational burden is to replace the full likelihood by an approximation obtained by using a subsampling of the data set. This might just be a small percentage of the full data set, so the properties of the data set, e.g. its concentration, play a critical role in determining when this approach will be successful in practice. Nonetheless it is widely used in the data community, where it goes under the name of “Stochastic Gradients”. A very similar problem arises in multiscale modelling, when the forces are computed by some device which perturbs the conservative structure. This gradient noise creates an additional source of noise in the parameterisation process. It turns out that this noise can be neutralised by use of a thermostat’’ method, something we studied extensively in the context of chemical/physical modelling. This remains very much an open problem and we are currently exploring it as part of other work we are doing.

### Enhanced Sampling

Another important issue is the treatment of multimodality of the probability distribution. In many cases there will be multiple reasonable parameterizations of the model defined by the specific data provided. In such situations it becomes crucial to explore or sample the probability distribution defined by for example the molecular equilibrium state or, in the case of a data analysis, Bayes’ formula. This task is computationally demanding since the standard MCMC schemes will often get stuck in one or the other localised region of the parameter space. In order to enhance exploration one may use innovative sampling strategies. In the case of statistical modelling, a very promising method for this purpose is to employ a collection of coupled ‘walkers’ to enhanced the exploration by providing a means of rescaling the dynamics to increase exploration, using our so-called ensemble quasinewton preconditioned MCMC scheme. Another interesting possibility is to use multiple temperatures to enhance the exploration, as in our infinite swap simulated tempering scheme developed with Eric Vanden-Eijnden and Jianfeng Lu. <p> A third approach is based on a diffusion map analysis of simulated data (the data being parameters, in this case).
In this article we proposed algorithms which use meta-study of the potential energy surface (and canonical distribution) to enhance exploration. The idea is to compute a local diffusion map (a sort unsupervised machine learning) which allows to characterize the slowest dynamical processes and use this knowledge to guide an accelerated process. Effectively this method learns the collective variables adaptively. This combination of local and global exploration is probably the best way forward in large scale molecular modelling. To illustrate what can be gained from this type of analysis, I show two snapshots taken from our paper.

The figures show the states of the molecular model of deca-alanine mapped onto a two-dimensional course graining using some physical variables (radius of gyration and RMSD); the coloring is based on the committor’’ value (obtained using the diffusion map) which colours states according to the likelihood of dynamical transition to one or the other of two most prevalent final states. The difference between the two diagrams shown here is the temperature at which the simulation is performed (the right diagram corresponds to a higher temperature – 485K, the left one to 300K). If we could perform this type of study adaptively for all the primary states of the molecular system we could map out the global dynamics. As you can imagine – this takes quite a bit of computer power. Molecular dynamics at this scale is a potential application for exascale computing!

### Molecular Dynamics Software

On the subject of computing, it is also worth mentioning our recent work on molecular dynamics software. In this domain there are some very large consortia which have produced the most important products, such as OpenMM, CHARMM, NAMD, Gromacs, Amber, Lammps. These packages generally involve combined contributions of dozens of authors from different disciplines. Because all classical molecular dynamics models involve compromises (quantum mechanics would be more accurate, but is computationally astronomically expensive due to the ‘curse of dimensionality’), a lot of work is put into making the force fields as accurate as possible. This means fitting empirical force field models to functions of the pair distances, coordinates of local triples, etc. It’s highly nontrivial. Other substantial work goes into making the codes run fast on the latest hardware, for example GPUs.

When we develop new algorithms, it can take a long time to get them implemented in these codes. There are often development queues and priorities that would delay implementation. We have gotten our methods into some popular code packages though, for example OpenMM implements our Langevin integrator. To speed up the transition from algorithm to software, we joined forces with Iain Bethune (originally at Edinburgh Parallel Computing Centre (EPCC); since moved to a software company ) to create a software interface called MIST (“Molecular Integrator Software Tools”) that allows us to very easily implement our algorithms. The software interface handles all the communication with various molecular simulation software packages such as NAMD, GROMACS, etc. In this way we can take advantage of the optimisations built into those codes while also having the ability to test our algorithms rapidly on “real-world” problems. MIST is freely available and documented here.

### Software for Machine Learning

As we learned more about machine learning and sought to implement our methods in code, we decided to develop a software package to facilitate sampling the posterior distribution of complicated models like neural networks. We also wanted to explore various alternative training methods such as the partitioned methods of this paper. For this reason we have begun to develop software packages that facilitate machine learning computations (e.g. optimization/sampling a Bayesian posterior). One such software package was developed in a project at the Alan Turing Institute, with Frederik Heber acting as the lead software developer. This package is called TATi (it stands for “Thermodynamic Analytics Toolkit”); it is written in TensorFlow and allows ensemble sampling of neural nets. An example of using TATi to train a neural network on the MNIST data set is shown below. Here we used a simplified technique to visualise the loss landscape of the neural net. For details see our JMLR paper which describes TATi.