## The Metropolis-Hastings algorithm

In this post, we will investigate the Metropolis-Hastings algorithm, which is still one of the most popular algorithms in the field of Markov chain Monte Carlo methods, even though its first appearence (see [1]) happened in 1953, more than 60 years in the past. It does for instance appear on the CiSe top ten list of the most important algorithms of the 20th century (I got this and the link from this post on WordPress).

Before we get into the algorithm, let us once more state the problem that the algorithm is trying to solve. Suppose you are given a probability distribution $\pi$ on some state space X (most often this will be a real euclidian space on which you can do floating point arithmetic). You might want to imagine the state space as describing possible states of a physical system, like spin configurations in a ferromagnetic medium similar to what we looked at in my post on the Ising model. The distribution $\pi$ then describes the probability for the system to be in a specific state. You then have some quantity, given as a function f on the state space. Theoretically, this is a quantity that you can calculate for each individual state. In most applications, however, you will never be able to observe an individual state. Instead, you will observe an average, weighted by the probability of occurence. In other words, you observe the expectation value

$\langle f \rangle = \int_X f d\pi$

of the quantity f. Thus to make a prediction that can be verified or falsified by an observation, you will have to calculate integrals of this type.

Now, in practice, this can be very hard. One issue is that in order to naively calculate the integral, you would have to transverse the entire state space, which is not feasible for most realistic problems as this tends to be a very high dimensional space. Closely related to this is a second problem. Remember, for instance, that a typical distribution like the Boltzmann distribution is given by

$\pi(x) = \frac{1}{Z} e^{-\beta E(x)}$

The term in the numerator is comparatively easy to calculate. However, the term in the denominator is the partition function, and is itself an integral over the state space! This makes even the calculation of $\pi(x)$ for a single point in the state space intractable.

But there is hope – even though calculating the values of $\pi$ for one point might be impossible, in a distribution like this, calculating ratios of probabilities is easy, as the partition function cancels out and we are left with the exponential of an energy difference! The Metropolis-Hasting algorithm leverages this and also solves our state space problem by using a Markov chain to approximate the integral. So the idea is to build a Markov chain Xt that converges and has $\pi$ as an invariant distribution, so that we can approximate the integral by

$\langle f \rangle = \int_X f d\pi \approx \frac{1}{N} \sum_{t=1}^N f(X_t)$

for large values of N.

But how do we construct a Markov chain that converges to a given distribution? The Metropolis Hastings approach to solve this works as follows.

The first thing that we do is to choose a proposal density q on our state space X, i.e. a measurable function

$q \colon X \times X \rightarrow [0,\infty)$

such that for each x, $\int q(x,y) dy = 1$.

Then q defines a Markov chain, where the probability to transition into a measurable set A being at a point x is given by the integral

$Q(x,A) = \int_{\mathcal X} q(x,y) dy$

Of course this is not yet the Markov chain that we want – it has nothing to do with $\pi$, so there is no reason it should converge to $\pi$. To fix this, we now adjust the kernel to incorporate the behaviour of $\pi$. For that purpose, define

$\alpha(x,y) = \begin{cases} \min \{ 1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \} & \text{if } \pi(x) q(x,y) > 0 \\ 1 & \text{if } \pi(x) q(x,y) = 0 \end{cases}$

This number is called the acceptance probability, and, as promised, it only contains ratios of probabilities, so that factors like the partition function cancel and do not have to be computed.

The Metropolis Hastings algorithm now proceeds as follows. We start with some arbitrary point x0. When the chain has arrived at xn, we first draw a candidate y for the next location from the proposal distribution $q(x_n, \cdot)$. We now calculate $\alpha$ according to the formula above. We then accept the proposal with probability $\alpha$, i.e. we draw a random sample U from a uniform distribution and accept if $U \leq \alpha$. If the proposal is accepted, we set xn+1 = y, otherwise we set xn+1 = xn, i.e. we stay where we are.

Clearly, the xn are samples from a Markov chain, as the position at step xn only depends on the position at step xn-1. But is still appears to be a bit mysterious why this should work. To shed light on this, let us consider a case where the expressions above simplify a bit. So let us assume that the proposal density q is symmetric, i.e. that

$q(x,y) = q(y, x)$

This is the original Metropolis algorithm as proposed in [1]. If we also assume that $\pi$ and q are nowhere zero, the acceptance probability simplifies to

$\alpha(x,y) = \min \{ 1, \frac{\pi(y)}{\pi(x)} \}$

Thus we accept the proposal if $\pi(y) \geq \pi(x)$ with probability one. This is very similar to a random search for a global maximum – we start at some point x, choose a candidate for a point with higher value of $\pi$ at random and proceed to this point. The major difference is that we also accept candidates with $\pi(y) < \pi(x)$ with a non-zero probability. This allows the algorithm to escape a local maximum much better. Intuitively, the algorithm will still try to spend more time in regions with large values of $\pi$, as we would expect from an attempt to sample from the distribution $\pi$.

The image above illustrates this procedure. The red graph displays the distribution $\pi$. If our algorithm is currently at step xn, the purpose is to move “up-hill”, i.e. to the left in our example. If we draw a point like y from q which goes already in the right directory, we will always accept this proposal and move to y. If, however, we draw a point like y’, at which $\pi$ is smaller, we would accept this point with a non-zero probability. Thus if we have reached a local maximum like the one on the right hand side of the diagram, there is still a chance that we can escape from there and move towards the real maximum to the left.

In this form, the algorithm is extremely easy to implement. All we need is a function propose that creates the next proposal, and a function p that calculates the value of the probability density $\pi$ at some point. Then an implementation in Python is as follows.

import numpy as np
chain = []
X = 0
chain.append(X)
for n in range(args.steps):
Y = propose(X)
U = np.random.uniform()
alpha = p(Y) / p(X)
if (U <= alpha):
X = Y
chain.append(X)


In the diagram below, this algorithm has been applied to a Cauchy distribution with mode zero and scale one, using a normal distribution with mean x and standard deviation 0.5 as a proposal for the next location. The chain was calculated for 500.000 steps. The diagram in the upper part shows the values of the chain during the simulation.

Then the first 100.000 steps were discarded and considered as "burn-in" time for the chain to stabilize. Out of the remaining 400.000 sample points, points where chosen with a distance of 500 time steps to obtain a sample which is approximately independent and identically distributed. This is called subsampling and typically not necessary for Monte Carlo integration (see [2], chapter 1 for a short discussion of the need of subsampling), but is done here for the sake of illustration. The resulting subsample is plotted as a histogramm in the lower left corner of the diagram. The yellow line is the actual probability density.

We see that after a few thousand steps, the chain converges, but continues to have spikes. However, the sampled distribution is very close to the sample generated by the Python standard method (which is to take the quotient of two independent samples from a standard normal distribution).

In the diagram at the bottom, I have displayed how the integral of two functions ($\sin(x)$ and $\cos(x)$) approximated using the partial sums develops over time. We see that even though we still have huge spikes, the integral remains comparatively stable and converges already after a few thousand iterations. Even if we run the simulation only for 1000 steps, we already get close to the actual values zero (for $\sin(x)$ for symmetry reasons) and $\approx 0.3678$ (for $\cos(x)$, obtained using the scipy.integrate.quad integration routine).

In the second diagram in the middle row, I have plotted the autocorrelation versus the lag, as an indicator for the failure of the sample points to be independent. Recall that for two samples X and Y, the Pearson correlation coefficient is the number

$\frac{E((X-\bar{X})(Y-\bar{(Y)})}{\sigma_X \sigma_Y}$

where $\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y. In our case, given a lag, i.e. a number l less than the length of the chain, we can form two samples, one consisting of the points $X_0, X_2, \dots$ and the second one consisting of the points of the shifted series $X_l, X_{l+1}, X_{l+2}, \dots$. The autocorrelation with lag l is then defined to be the correlation coefficient between these two series. In the diagram, we can see how the autocorrelation depends on the lag. We see that for a large lag, the autocorrelation becomes small, supporting our intuition that the series and the shifted series become independent. However, if we execute several simulation runs, we will also find that in some cases, the convergence of the autocorrelation is very slow, so care needs to be taken when trying to obtain a nearly independent sample from the chain.

In practice, the autocorrelation is probably not a good measure for the convergence of a Markov chain. It is important to keep in mind that obtaining an independent sample is not the point of the Markov chain – the real point is that even though the sample is autocorrelated, we can approximate expectation values fairly well. However, I have included the autocorrelation here for the sake of illustration.

This form of proposal distributions is not the only one that is commonly used. Another choice that appears often is called an independence sampler. Here the proposal distribution is chosen to be independent of the current location x of the chain. This gives us an algorithm that resembles the importance sampling method and also shares some of the difficulties associated with it – in my notes on Markov chain Monte Carlo methods, I have included a short discussion and a few examples. These notes also contain further references and a short discussion of why and when the Markov chain underlying a Metropolis-Hastings sampler converges.

Other variants of the algorithm work by updating – in a high-dimensional space – either only one variable at a time or entire blocks of variables that are known to be independent.

Finally, if we are dealing with a state space that can be split as a product $X_1 \times X_2$, we can use the conditional probability given either x1 or x2 as a proposal distribution. Thus, we first fix x2, and draw a new value for x1 from the conditional probability for x1 given the current value of x2. Then we move to this new coordinate, fix x1, draw from the conditional distribution of x2 given x1 and set the new value of x2 accordingly. It can be shown (see for example [5]) that the acceptance probability is one in this case. So we end up with the Gibbs sampling algorithm that we have already used in the previous post on Ising models.

Monte Carlo sampling methods are a broad field, and even though this has already been a long post, we have only scratched the surface. I invite you to consult some of the references below and / or my notes for more details. As always, you will also find the sample code on GitHub and might want to play with this to reproduce the examples above and see how different settings impact the result.

In a certain sense, this post is the last post in the series on restricted Boltzmann machines, as it provides (at least some of) the mathematical background behind the Gibbs sampling approach that we used there. Boltzmann machines are examples for stochastic neuronal networks that can be applied to unsupervised learning, i.e. the allow a model to learn from a sample distribution without the need for labeled data. In the next few posts on machine learning, I will take a closer look at some other algorithms that can be used for unsupervised learning.

### References

1. N. Metropolis,A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculation by fast computing machines, J. Chem. Phys. Vol. 21, No. 6 (1953), pp. 1087-1092
2. S. Brooks, A. Gelman, C.L. Jones,X.L. Meng (ed.), Handbook of Markov chain Monte Carlo, Chapman Hall / CRC Press, Boca Raton 2011
3. W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, Vol. 57 No. 1 (1970), pp. 97-109
R.M. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993
5. C.P. Robert, G. Casella, Monte Carlo Statistical Methods,
Springer, New York 1999

## Recurrent and ergodic Markov chains

Today, we will look in more detail into convergence of Markov chains – what does it actually mean and how can we tell, given the transition matrix of a Markov chain on a finite state space, whether it actually converges.

So suppose that we are given a Markov chain on a finite state space, with transition probabilities described by a matrix K as in the previous post on this topic.

We have seen that the distribution of Xn is given by $\mu K^n$ where $\mu$ is the initial distribution. So if Kn converges, we can expect the distribution of Xn to converge as well. If there is a matrix $K^\infty$ such that

$\lim_{n \rightarrow \infty} K^n = K^\infty$

then of course

$K^\infty K = (\lim_{n \rightarrow \infty} K^n) K = \lim_{n \rightarrow \infty} K^{n+1} = K^\infty$

In other words, each row $\pi$ of $K^\infty$ will have the property that $\pi K = \pi$. Interpreting K as transition probabilities, this implies that if the distribution of Xn is $\pi$, the distribution of Xn+1 will again be $\pi$. Any distribution, described by a row vector $\pi$ with row sum 1, for which this holds is called an invariant distribution and traditionally denoted by $\pi$. In other words, invariant distributions correspond to eigenvectors of the transposed matrix KT with eigenvalue 1, and our argument has shown that if $K^n$ converges to $K^\infty$, then every row of $K^\infty$ will be an invariant distribution.

So convergence implies the existence of an invariant distribution. Let us next try to understand whether this invariant distribution is unique. There are obvious examples where this is not the case, the most trivial one being the unit matrix

$K = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$

Obviously $K^n = K$ and the chain converges, but every vector with row sum one is an invariant distribution. This chain is too rigid, because in whatever state we start, we will stay in this state forever. It turns out that in order to ensure uniquess of an invariant distribution, we need a certain property that makes sure that the states can move around freely which is called irreducibility.

Intuitively, we say that a Markov chain is irreducible if any state can be reached from any other state in finitely many steps. For a finite state space, this is rather easy to formalize. A finite Markov chain is called irreducible if, given two states i and j, we can find some power n such that $K^n_{ij} > 0$. In other words, given a row index i and a column index j, we can find a power n such that the element of Kn at (i,j) is not zero. It turns out that if a chain is not irreducible, we can split the state space into smaller areas that the chain – once it has entered one of them – does not leave again and on which it is irreducible. So irreducible Markov chains are the buildings blocks of more general Markov chains, and the study of many properties of Markov chains can be reduced to the irreducible case.

Now let us assume that our chain is in fact irreducible. What else do we have to ask for to make sure that it converges? Again, let us as look at an obvious example.

$K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$

All even powers of this matrix are the unit matrix and all odd powers are K itself, so the Markov chain described by this transition matrix is clearly irreducible. However, there is still something wrong. If, for instance, we are in state 1 at the time t, then the probability to still be in state 1 at time t+1 is zero. Thus we are actually forced to move to state 2. In a certain sense, there is still an additional constraint on the the behaviour of the chain – the state space can be split in two pieces and we cycle through these pieces with every step. Such a chain is called periodic. If a chain is periodic, it is again obvious that we can think of it as two different chains, one chain given by the random variables X0, X2,… at even times and the other chain given by the random variables X1, X3, … at odd times. Thus again, there is a way to split the chain into parts.

So let us now focus on chains that are irreducible and aperiodic. Can we tell whether the chain converges? The answer is surprisingly simple and one of the most powerful results in the theory or Markov chains: every finite Markov chain which is irreducible and aperiodic converges. Thus once we have verified aperiodicity and irreducibility, we can be sure that the limit $K^\infty$ exists. We can say even more – we have seen that the rows of the limit are invariant distributions for K. However, one can show that an irreducible Markov chain can only have one invariant distribution. Thus we can conclude that all rows of the matrix $K^\infty$ are identical. In fact, they are all equal to the (unique) invariant distribution that again corresponds to an eigenvector of KT with eigenvalue one.

This gives us two different approaches to calculating the invariant distribution and the limit $K^\infty$. First, we can form high powers Kn to approximate the limit $K^\infty$. Or, alternatively, we can search for eigenvectors of the transposed matrix KT with eigenvalue 1 using one of the known algorithms to calculate eigenvectors and eigenvalues.

Let us do this for the example of the random walk on a circle that we have studied earlier. For simplicity, we will use N = 4 this time. The transition matrix with p = 0.8 is

$K = \begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can then easily implement both approaches in Python using the numpy library.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

So we see that the approach to take the eigenvalues and eigenvectors gives us – in this case – the exact result (and a direct computation confirms that this is really an eigenvector), whereas the matrix power with n=20 gives a decent approximation. Both results confirm what we have observed visually in the last post – the distribution converges, and it does in fact converge to the uniform distribution. So roughly speaking, after sufficiently many steps, we end up everyhwere in the state space with equal probability.

So far, we have only discussed the case of a finite state space. In the general case, the situation is more complicated. First, we need a better definition of irreducibility, as on a general state space, the probability measure of a single point tends to be zero. It turns out that the solution is to define irreducibility with respect to a measure on the state space, which is called $\psi$-irreducibility. Once we have that, we find two potential behaviours that do not show up in the finite case.

First, it might happen that even though the chain is irreducible, it does not properly cycle through the state space and revisits every point with finite probability, but – intuitively speaking – heads off to infinity. Such a chain is called transient and does not converge.

Second, even if the chain is not transient, we might be able to find an invariant distribution (a measure in this case), but this measure might not be a probability distribution as it is not finite. Chains with this property are called null recurrent.

And finally, the concept of converge needs to be made more precise, and it turns out to have a subtle dependency on the starting point which can be fixed by assuming so called Harris recurrence. But all this can be done and delivers a very similar result – a convergence theorem for a large class of chains (in this case aperiodic Harris recurrent Markov chains). If you would like to get deeper into this and also see proofs for the claims made in this post and references, I invite you to take a look at my short introduction to Markov chains.

We close this post with one remark which, however, is of crucial importance for applications. Suppose we have a convergent Markov chain Xt, and a function f on the state space. If the Xt were independent and identically distributed, we could approximate the expectation value of f as

$\int_X f(x) dP = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

where dP is the (common) distribution of the Xt. For Markov chains, however, we know that the Xt are not independent – the whole point of having a Markov chain is that Xt+1 does actually depend on Xt. However, in our example above, we have seen that the dependency gets trivial for large t as in this example, all entries in the transition matrix become equal in the limit.

Of course this is a special case, but it turns out that even in general, the approximation of the expectation value by averages across the sample is still possible. Intuitively, this is not so surprising at all. If the distributions $K^n$ of the $X_n$ converge to some invariant measure $\pi$, the $X_n$ will, for large n, be approximately identically distributed, namely according to $\pi$. Moreover, heuristically we have for large n, m and a fixed starting point s:

$P(X_{n+m}=i, X_n=j) = K^m_{ji} K^n_{sj} \approx \pi_i \pi_j$

On the other hand, this product is again approximately the product

$\pi_i \pi_j \approx P(X_{n+m}=i) P(X_n=j)$

Thus, intuitively, Xn and Xm are almost independent if n and m are large, which makes us optimistic that the law of large numbers could actually hold.

In fact, if Xt converges (again I refer to my notes for a more precise definition of convergence in this case), it turns out that the law of large numbers remains valid if we integrate f with respect to the invariant distribution $\pi$:

$\int_X f(x) d\pi = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

This is extremely useful in applications, where often the primary purpose is not so much to obtain a sample, but to approximate otherwise intractable integrals! Once we have found a Markov chain that converges to the invariant distribution $\pi$, we can calculate integrals over $\pi$ by running a long Markov chain until converge and then taking the average value of the function that we want to integrate for a large number of subsequent points from this chain. In fact, this is how most applications work – and this is more or less what we also did when we applied Markov chains to the problem of calculating gradients in the PCD algorithm, where this chain is given by the state of the negative particles in subsequent iterations.

So given a distribution $\pi$, we are now lead to the question how we can possibly find a Markov chain for which this distribution is invariant and which converges. The most general answer to this question is a class of algorithms known as Metropolis-Hastings algorithms which we study in the next post in this series.

## Finite Markov chains

In this post, we will look in more detail into an important class of Markov chains – Markov chains on finite state spaces. Many of the subtleties that are present when studying Markov chains in general state spaces do not appear in the finite case, while most of the key ideas and features of Markov chains are still visible, so this is a good starting point if you want to grasp the key points.

So let us assume that our state space X is finite. For simplicity,  we label our states as $\{1, 2, \cdots N \}$ where N is the number of states. We also assume that all our points are measurable, i.e we consider our state space as a discrete probability space.

Now consider a sequence of random variables X0, X1, …. How can we formalize the idea that Xt+1 depends on Xt in a randomized way?

As so often in probability theory, let us model the dependency as a conditional probability. The conditional probability for Xt+1 to take on a value i given Xt

$P(X_{t+1} = i | X_t)$

considered as a function of Xt will assign a conditional probability to each of the states i and for each value of Xt. Therefore we can write this as a matrix

$P(X_{t+1} = i | X_t = j) = K_{ji}$

To obtain a time homogeneous Markov chain, we also assume that the matrix K does not depend on the time t. Therefore we define a Markov chain on X to be a sequence $\{X_t\}_t$ of random variables taking values in X with the Markov property saying that for all times t, the conditional distribution for Xt+1 given all previous values $X_0, X_1, \dots, X_t$ only depends on Xt, i.e.

$P(X_{t+1} | X_t, X_{t-1}, \cdots ) = P(X_{t+1} | X_t)$

We also require that this conditional probability  is independent of t and is therefore given by a matrix K as in the formula above.

Markov chains on finite state spaces are often visualized as a graph. Suppose, for instance, that our state space contains only two elements: $X = \{ 1 , 2\}$. We can think of the combined values $X_t$ for all times t as a history of states or as a random walk in the state space. The Markov property then means that the probability to transition into a next state does not depend on the full history, but only on the current state – Markov chains do not have a memory.

In our state space with two elements only, the Markov chain is then described by four transition probabilities: the probability to stay in state 1 when the chain is in state 1, the probability to move to state 2 when being in state 1, the probability to stay in state 2 and the probability to move to state 1 after being in state 2. This can be visualized as follows.

Let us now calculate a few probabilities to get an idea for the relevant quantities in such a model. First, suppose that at time 0, the model is in state j with probability $\mu_j$. What is the probability to be in state j after one step? Of course we can write

$P(X_1 = i) = \sum_j P(X_1 = i, X_0 = j)$

Now, according to the rules of conditional probabilities, we can express the joint probability as follows.

$P(X_1 = i, X_0 = j) = P(X_1 = i | X_0 = j) P(X_0 = j)$

Plugging this into our previous expression and using the definitions of K and $\pi$, we now obtain

$P(X_1 = i) = \sum_j K_{ji} \mu_j$

Thus if we think of $\mu$ as a row vector, then the probability after one step is described by the matrix product $\mu K$.

Now let us calculate a slightly different quantity. Assume that we know that a chain starts at j. What is the probability to be at i after two steps? Using once more the rules of conditional probability and the Markov property, we can write

$P(X_2=i | X_0=j) = \sum_k P(X_2=i | X_1=k)P(X_1=k | X_0=j)$

Intuitively, this is very appealing. To get from j to i in two steps, we can take the way via any intermediate state k. To get the total probability, we simply sum up all these different probabilities! If you have ever seen path integrals in quantum mechanics, this idea will look familiar.

Again, we can write this in matrix notation. Each of the conditional probabilities on the right hand side of the expression above is a matrix element, and we find that

$P(X_2 = i | X_0 = j) = \sum_k K_{ki} K_{jk}$

so that the probability to get in two steps from j to i is simply given by the elements of the matrix K2. Similarly, the n-step transition probabilities are the entries of the matrix Kn.

Let us look at an example to understand what is going on here, which is known as a finite random walk on a circle. Our state space consists of N distinct points which we place arbitrarily on a circle. We then define a Markov chain as follows. We start at some arbitrary point x. In each step, we move along the circle to a neighbored point – one point to the left with probability 1/2 and one point to the right with probability 1/2. By the very definition, the transition probabilities do not change over time and depend only on the current state, so is a finite Markov chain. If we label the points on the circle by $1, 2, \dots, N$, then the transition matrix is given by a matrix of the form (for instance for N=4)

$K = \begin{pmatrix} 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \\ 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \end{pmatrix}$

More generally, we can also allow the process to stay where it is with probability 1 – p, where p is then the probability to move, which leads us to the matrix

$K = \begin{pmatrix} 1-p & \frac{p}{2} & 0 & \frac{p}{2} \\ \frac{p}{2} & 1-p & \frac{p}{2} & 0 \\ 0 & \frac{p}{2} & 1-p & \frac{p}{2} \\ \frac{p}{2} & 0 & \frac{p}{2} & 1 - p \end{pmatrix}$

Let us try to figure out whether the target distribution, given by the matrix Kn for large n, somehow converges.

To see this, we do two numerical experiments. First, we can easily simulate a random walk. Suppose that we have a function draw which accepts a distribution (given by a vector p whose elements add up to one) and draws a random value according to that distribution, i.e. it returns 1 with probability p0, 2 with probability p1 and so on. We can then simulate a random walk on the circle as follows.

def simulate_chain(N, p, steps=100, start=5):
chain = []
x = start % N
chain.append(x)
for i in range(steps):
x = (x  + draw([p/2.0, 1.0 - p, p / 2.0]) - 2) % N
chain.append(x)
return chain


Here N is the number of points on the circle, steps is the number of simulation steps that we run and start is the starting point. The function draw then returns 1 with probability p/2, 3 with probability p/2 and 2 with probability 1 – p. Thus we move with probability p/2 to the right, with probability p/2 to the left and stay were we are with probability 1 – p. If we set p = 0.8, this results in the following transition matrix.

$\begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can visualize the results if we execute a large number of runs and draw histograms of the resulting positions 100, 200, 300, … steps. The result will look roughly like this (for this image, I have used p = 0.8 and N=10):

We see that after a few hundred steps, this seems to converge – actually this starts to look very much like a uniform distribution. As we already know that the distribution after n steps is given by the matrix Kn, it makes sense to look at high powers of this matrix as well. To visualize this, I will use a method that I have seen in MacKays excellent book and that works as follows.

Visualizing one row of a matrix is not difficult. We can plot the entries in a two-dimensional diagram, where the x-axis corresponds to the column index and the y-axis corresponds to the value. A flat line is then a row in which all entries have the same value.

To display a full matrix, we use colors – we assign one color to each row index and plot the individual rows as just described. If we do this for the powers of the matrix K with the parameters p = 0.8 and N = 10 (and only chose some rows, for instance 1,4,7, to not run out of colors…) we obtain an image like the one below.

We can clearly see that the powers of the matrix K converge towards a matrix

$K^\infty = \lim_{n \rightarrow \infty} K^n$

where all entries in each row seem to have the same value. We have seen that the distribution after n steps assuming an initial distribution described by a row vector $\mu$ is $\mu K^n$. Therefore the limit distribution is $\mu K^\infty$. If we take $\mu$ to be a unit vector, we find that the rows of the matrix $K^\infty$ do actually represent the limit distribution of all chains that have started at a specific point i of the state space. Therefore the entries in the rows need to sum up to one, and thus, if they are all equal, need to be equal to 1 / N. If you calculate and print out high powers of the matrix K, you will in fact see that they approach the matrix where all entries are 0.1 (as we have chosen N = 10 in this example).

To completes our short introduction into Markov chains. We have seen that Markov chains model stochastic processes in discrete time in which the state at step t+1 depends only on the state at step t and the dependency is given by a function independent of the current time. In finite state spaces, Markov chains are described by a transition matrix K. The i-th row of the matrix Kn is the distribution of chains starting at point i after n steps. Consequently, converge properties of the Markov chains can be related to converge of high powers Kn and the apparatus of linear algebra can be applied.

In the next post, we will learn more about convergence – how it can be made precise, and how we can tell whether a given Markov chain converges. This will then allow us to construct Markov chains that converge towards a given target distribution and use them for sampling.

## Monte Carlo methods and Markov chains – an introduction

In our short series on machine learning, we have already applied sampling methods several times. We have used and implemented Gibbs sampling, and so far we have simply accepted that the approach works. Time to look at this in a bit more detail in order to understand why it works and what the limitations of the algorithm are.

Regardless of whether you want to simulate ferromagnetic behavior in an Ising model, run a Hopfield network or train a Boltzmann machine, the fundamental problem that we have to solve is always the same. We are given a probability distribution P living on some state space X, and we are trying to create a sample, i.e. a set of points in the state space such that the probability for a point x to appear in this sample is equal to the probability P(x) given by the probability distribution.

The naive approach to this is simple: visit every point x in the state space and include that point with probability P(x). However, it is clear that with a large state space, this approach is not computationally feasible. In the example of a Boltzmann machine trained on handwritten digits with 28 x 28 pixels, our state space has 2784 elements, and there is no way we can visit them all one by one. Instead, we would need something like a randomized walk through the state space. We could start with same randomly chosen state X0, then – using a randomized transition rule – move on to a point X1 and so forth. Intuitively, we want to select our transition rule in such a way that the state space elements Xi selected in this way form a sample, i.e. such that our chain of state space locations visits regions with large values for P(x) more often than regions with low values of P(x). Thus we would systematically ignore regions of the state space with low probability which would greatly reduce the number of states that we have to visit to obtain a valid sample.

So, from a mathematical point of view, we consider a sequence of random variables Xi such that Xi+1 is related to Xi by some randomized transition rule. We also assume that this rule does not depend on the index i which is usually called the time. Thus we have a sequence of random variables Xi which is not independent, but almost independent – Xi+1 depends only on Xi and in way that itself does not vary with i. This is called a Markov chain (more precisely, a time homogeneous Markov chain).

Let us consider an example to illustrate the idea. As our state space, we choose the space or real numbers. We fix a starting value, say X0=0, and we obtain the next value by adding a number that we draw from a standard normal distribution. Thus, mathematically, we assume that Wn are identically distributed and independent random variables, all distributed according to the standard normal distribution, and set

$X_{n+1} = X_n + W_n$

This is a Markov chain: the value Xn+1 depends only on Xn, not on any earlier elements of the chain. The transition rule is randomized, but itself does not depend on the time step n – all Wn have the same distribution. Let us implement this in Python to see how it works (the full notebook can be downloaded here).

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Here we have created and displayed three different random walks. All of them start at the same point (zero), and all of them follow the same transformation rule, but as the transformation rule is stochastic in nature, they all develop differently.

Now let us try to turn the view on this upside down. This time, we execute a larger number – 1000 – of random walks with 5000 steps each. But instead of plotting the sequence of points Xi for every walk, we display the distribution of the last point of each walk, i.e. we plot the distribution of the random variable X4999.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This does in fact look more familiar. We see that most walks end up being close to zero at the end – steps in the positive direction and steps in the negative direction cancel each other. Only very few walks end up at an extreme position close to plus or minus 200 – this is not surprising as well, to arrive at an extreme point, we would need to draw 5000 times in a row an extreme value from the random normal distribution, which is a rather unlikely chain of events.

In this case, the distribution does actually not converge if we increase the number of steps – you can try this out and play with different values, i.e. replace 5000 by 50000 (this will run some time) and look at the distribution of X49999 – you will see that this is now spread out to roughly plus / minus 750 (in fact, the distribution is obtained as a multiple convolution of the standard normal distribution with itself and thus is again a normal distribution).

Even though the distribution does not converge, we have been able to sample from a specific distribution – in this case the distribution after 5000 steps – using only the ability to sample from a different distribution – in this case the random normal distribution. Obviously, in this special case, the result is trivial, but the principle that we have found looks interesting. Can we generalize this approach to obtain sampling methods for target distributions that are otherwise intractable?

Now this is exactly the idea behind the sampling approach that is commonly known as Markov chain Monte Carlo (MCMC) and which has become very popular, with applications to complex simulations in theoretical physics, to machine learning and even asset pricing and value-at-risk calculations.

So let us summarize how the MCMC approach works. Given a target distribution P(x), we first construct a Markov chain that converges to that target distribution. Once we have that, we can simulate a large number of runs and use the resulting points as our sample (in fact, in many cases we can also do with one run only, as we will see later). Thus in order to utilize Markov chains for sampling, we would need to understand under what conditions a Markov chain converges and if it converges, how we can relate the target distribution to the transformation rule. We will look into these points in more detail in future posts in this series.