In our short series on machine learning, we have already applied sampling methods several times. We have used and implemented Gibbs sampling, and so far we have simply accepted that the approach works. Time to look at this in a bit more detail in order to understand why it works and what the limitations of the algorithm are.
Regardless of whether you want to simulate ferromagnetic behavior in an Ising model, run a Hopfield network or train a Boltzmann machine, the fundamental problem that we have to solve is always the same. We are given a probability distribution P living on some state space X, and we are trying to create a sample, i.e. a set of points in the state space such that the probability for a point x to appear in this sample is equal to the probability P(x) given by the probability distribution.
The naive approach to this is simple: visit every point x in the state space and include that point with probability P(x). However, it is clear that with a large state space, this approach is not computationally feasible. In the example of a Boltzmann machine trained on handwritten digits with 28 x 28 pixels, our state space has 2784 elements, and there is no way we can visit them all one by one. Instead, we would need something like a randomized walk through the state space. We could start with same randomly chosen state X0, then – using a randomized transition rule – move on to a point X1 and so forth. Intuitively, we want to select our transition rule in such a way that the state space elements Xi selected in this way form a sample, i.e. such that our chain of state space locations visits regions with large values for P(x) more often than regions with low values of P(x). Thus we would systematically ignore regions of the state space with low probability which would greatly reduce the number of states that we have to visit to obtain a valid sample.
So, from a mathematical point of view, we consider a sequence of random variables Xi such that Xi+1 is related to Xi by some randomized transition rule. We also assume that this rule does not depend on the index i which is usually called the time. Thus we have a sequence of random variables Xi which is not independent, but almost independent – Xi+1 depends only on Xi and in way that itself does not vary with i. This is called a Markov chain (more precisely, a time homogeneous Markov chain).
Let us consider an example to illustrate the idea. As our state space, we choose the space or real numbers. We fix a starting value, say X0=0, and we obtain the next value by adding a number that we draw from a standard normal distribution. Thus, mathematically, we assume that Wn are identically distributed and independent random variables, all distributed according to the standard normal distribution, and set
This is a Markov chain: the value Xn+1 depends only on Xn, not on any earlier elements of the chain. The transition rule is randomized, but itself does not depend on the time step n – all Wn have the same distribution. Let us implement this in Python to see how it works (the full notebook can be downloaded here).
Here we have created and displayed three different random walks. All of them start at the same point (zero), and all of them follow the same transformation rule, but as the transformation rule is stochastic in nature, they all develop differently.
Now let us try to turn the view on this upside down. This time, we execute a larger number – 1000 – of random walks with 5000 steps each. But instead of plotting the sequence of points Xi for every walk, we display the distribution of the last point of each walk, i.e. we plot the distribution of the random variable X4999.
This does in fact look more familiar. We see that most walks end up being close to zero at the end – steps in the positive direction and steps in the negative direction cancel each other. Only very few walks end up at an extreme position close to plus or minus 200 – this is not surprising as well, to arrive at an extreme point, we would need to draw 5000 times in a row an extreme value from the random normal distribution, which is a rather unlikely chain of events.
In this case, the distribution does actually not converge if we increase the number of steps – you can try this out and play with different values, i.e. replace 5000 by 50000 (this will run some time) and look at the distribution of X49999 – you will see that this is now spread out to roughly plus / minus 750 (in fact, the distribution is obtained as a multiple convolution of the standard normal distribution with itself and thus is again a normal distribution).
Even though the distribution does not converge, we have been able to sample from a specific distribution – in this case the distribution after 5000 steps – using only the ability to sample from a different distribution – in this case the random normal distribution. Obviously, in this special case, the result is trivial, but the principle that we have found looks interesting. Can we generalize this approach to obtain sampling methods for target distributions that are otherwise intractable?
Now this is exactly the idea behind the sampling approach that is commonly known as Markov chain Monte Carlo (MCMC) and which has become very popular, with applications to complex simulations in theoretical physics, to machine learning and even asset pricing and value-at-risk calculations.
So let us summarize how the MCMC approach works. Given a target distribution P(x), we first construct a Markov chain that converges to that target distribution. Once we have that, we can simulate a large number of runs and use the resulting points as our sample (in fact, in many cases we can also do with one run only, as we will see later). Thus in order to utilize Markov chains for sampling, we would need to understand under what conditions a Markov chain converges and if it converges, how we can relate the target distribution to the transformation rule. We will look into these points in more detail in future posts in this series.