Machine learning – Page 4

Recurrent and ergodic Markov chains

Today, we will look in more detail into convergence of Markov chains – what does it actually mean and how can we tell, given the transition matrix of a Markov chain on a finite state space, whether it actually converges.

So suppose that we are given a Markov chain on a finite state space, with transition probabilities described by a matrix K as in the previous post on this topic.

We have seen that the distribution of X_n is given by $\mu K^n$ where $\mu$ is the initial distribution. So if Kⁿ converges, we can expect the distribution of X_n to converge as well. If there is a matrix $K^\infty$ such that

$\lim_{n \rightarrow \infty} K^n = K^\infty$

then of course

$K^\infty K = (\lim_{n \rightarrow \infty} K^n) K = \lim_{n \rightarrow \infty} K^{n+1} = K^\infty$

In other words, each row $\pi$ of $K^\infty$ will have the property that $\pi K = \pi$ . Interpreting K as transition probabilities, this implies that if the distribution of X_n is $\pi$ , the distribution of X_n+1 will again be $\pi$ . Any distribution, described by a row vector $\pi$ with row sum 1, for which this holds is called an invariant distribution and traditionally denoted by $\pi$ . In other words, invariant distributions correspond to eigenvectors of the transposed matrix K^T with eigenvalue 1, and our argument has shown that if $K^n$ converges to $K^\infty$ , then every row of $K^\infty$ will be an invariant distribution.

So convergence implies the existence of an invariant distribution. Let us next try to understand whether this invariant distribution is unique. There are obvious examples where this is not the case, the most trivial one being the unit matrix

$K = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$

Obviously $K^n = K$ and the chain converges, but every vector with row sum one is an invariant distribution. This chain is too rigid, because in whatever state we start, we will stay in this state forever. It turns out that in order to ensure uniquess of an invariant distribution, we need a certain property that makes sure that the states can move around freely which is called irreducibility.

Intuitively, we say that a Markov chain is irreducible if any state can be reached from any other state in finitely many steps. For a finite state space, this is rather easy to formalize. A finite Markov chain is called irreducible if, given two states i and j, we can find some power n such that $K^n_{ij} > 0$ . In other words, given a row index i and a column index j, we can find a power n such that the element of Kⁿ at (i,j) is not zero. It turns out that if a chain is not irreducible, we can split the state space into smaller areas that the chain – once it has entered one of them – does not leave again and on which it is irreducible. So irreducible Markov chains are the buildings blocks of more general Markov chains, and the study of many properties of Markov chains can be reduced to the irreducible case.

Now let us assume that our chain is in fact irreducible. What else do we have to ask for to make sure that it converges? Again, let us as look at an obvious example.

$K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$

All even powers of this matrix are the unit matrix and all odd powers are K itself, so the Markov chain described by this transition matrix is clearly irreducible. However, there is still something wrong. If, for instance, we are in state 1 at the time t, then the probability to still be in state 1 at time t+1 is zero. Thus we are actually forced to move to state 2. In a certain sense, there is still an additional constraint on the the behaviour of the chain – the state space can be split in two pieces and we cycle through these pieces with every step. Such a chain is called periodic. If a chain is periodic, it is again obvious that we can think of it as two different chains, one chain given by the random variables X₀, X₂,… at even times and the other chain given by the random variables X₁, X₃, … at odd times. Thus again, there is a way to split the chain into parts.

So let us now focus on chains that are irreducible and aperiodic. Can we tell whether the chain converges? The answer is surprisingly simple and one of the most powerful results in the theory or Markov chains: every finite Markov chain which is irreducible and aperiodic converges. Thus once we have verified aperiodicity and irreducibility, we can be sure that the limit $K^\infty$ exists. We can say even more – we have seen that the rows of the limit are invariant distributions for K. However, one can show that an irreducible Markov chain can only have one invariant distribution. Thus we can conclude that all rows of the matrix $K^\infty$ are identical. In fact, they are all equal to the (unique) invariant distribution that again corresponds to an eigenvector of K^T with eigenvalue one.

This gives us two different approaches to calculating the invariant distribution and the limit $K^\infty$ . First, we can form high powers Kⁿ to approximate the limit $K^\infty$ . Or, alternatively, we can search for eigenvectors of the transposed matrix K^T with eigenvalue 1 using one of the known algorithms to calculate eigenvectors and eigenvalues.

Let us do this for the example of the random walk on a circle that we have studied earlier. For simplicity, we will use N = 4 this time. The transition matrix with p = 0.8 is

$K = \begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can then easily implement both approaches in Python using the numpy library.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw TransitionMatrixMarkovChain.ipynb hosted with ❤ by GitHub

So we see that the approach to take the eigenvalues and eigenvectors gives us – in this case – the exact result (and a direct computation confirms that this is really an eigenvector), whereas the matrix power with n=20 gives a decent approximation. Both results confirm what we have observed visually in the last post – the distribution converges, and it does in fact converge to the uniform distribution. So roughly speaking, after sufficiently many steps, we end up everyhwere in the state space with equal probability.

So far, we have only discussed the case of a finite state space. In the general case, the situation is more complicated. First, we need a better definition of irreducibility, as on a general state space, the probability measure of a single point tends to be zero. It turns out that the solution is to define irreducibility with respect to a measure on the state space, which is called $\psi$ -irreducibility. Once we have that, we find two potential behaviours that do not show up in the finite case.

First, it might happen that even though the chain is irreducible, it does not properly cycle through the state space and revisits every point with finite probability, but – intuitively speaking – heads off to infinity. Such a chain is called transient and does not converge.

Second, even if the chain is not transient, we might be able to find an invariant distribution (a measure in this case), but this measure might not be a probability distribution as it is not finite. Chains with this property are called null recurrent.

And finally, the concept of converge needs to be made more precise, and it turns out to have a subtle dependency on the starting point which can be fixed by assuming so called Harris recurrence. But all this can be done and delivers a very similar result – a convergence theorem for a large class of chains (in this case aperiodic Harris recurrent Markov chains). If you would like to get deeper into this and also see proofs for the claims made in this post and references, I invite you to take a look at my short introduction to Markov chains.

We close this post with one remark which, however, is of crucial importance for applications. Suppose we have a convergent Markov chain X_t, and a function f on the state space. If the X_t were independent and identically distributed, we could approximate the expectation value of f as

$\int_X f(x) dP = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

where dP is the (common) distribution of the X_t. For Markov chains, however, we know that the X_t are not independent – the whole point of having a Markov chain is that X_t+1 does actually depend on X_t. However, in our example above, we have seen that the dependency gets trivial for large t as in this example, all entries in the transition matrix become equal in the limit.

Of course this is a special case, but it turns out that even in general, the approximation of the expectation value by averages across the sample is still possible. Intuitively, this is not so surprising at all. If the distributions $K^n$ of the $X_n$ converge to some invariant measure $\pi$ , the $X_n$ will, for large n, be approximately identically distributed, namely according to $\pi$ . Moreover, heuristically we have for large n, m and a fixed starting point s:

$P(X_{n+m}=i, X_n=j) = K^m_{ji} K^n_{sj} \approx \pi_i \pi_j$

On the other hand, this product is again approximately the product

$\pi_i \pi_j \approx P(X_{n+m}=i) P(X_n=j)$

Thus, intuitively, X_n and X_m are almost independent if n and m are large, which makes us optimistic that the law of large numbers could actually hold.

In fact, if X_t converges (again I refer to my notes for a more precise definition of convergence in this case), it turns out that the law of large numbers remains valid if we integrate f with respect to the invariant distribution $\pi$ :

$\int_X f(x) d\pi = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

This is extremely useful in applications, where often the primary purpose is not so much to obtain a sample, but to approximate otherwise intractable integrals! Once we have found a Markov chain that converges to the invariant distribution $\pi$ , we can calculate integrals over $\pi$ by running a long Markov chain until converge and then taking the average value of the function that we want to integrate for a large number of subsequent points from this chain. In fact, this is how most applications work – and this is more or less what we also did when we applied Markov chains to the problem of calculating gradients in the PCD algorithm, where this chain is given by the state of the negative particles in subsequent iterations.

So given a distribution $\pi$ , we are now lead to the question how we can possibly find a Markov chain for which this distribution is invariant and which converges. The most general answer to this question is a class of algorithms known as Metropolis-Hastings algorithms which we study in the next post in this series.

Finite Markov chains

In this post, we will look in more detail into an important class of Markov chains – Markov chains on finite state spaces. Many of the subtleties that are present when studying Markov chains in general state spaces do not appear in the finite case, while most of the key ideas and features of Markov chains are still visible, so this is a good starting point if you want to grasp the key points.

So let us assume that our state space X is finite. For simplicity, we label our states as $\{1, 2, \cdots N \}$ where N is the number of states. We also assume that all our points are measurable, i.e we consider our state space as a discrete probability space.

Now consider a sequence of random variables X₀, X₁, …. How can we formalize the idea that X_t+1 depends on X_t in a randomized way?

As so often in probability theory, let us model the dependency as a conditional probability. The conditional probability for X_t+1 to take on a value i given X_t

$P(X_{t+1} = i | X_t)$

considered as a function of X_t will assign a conditional probability to each of the states i and for each value of X_t. Therefore we can write this as a matrix

$P(X_{t+1} = i | X_t = j) = K_{ji}$

To obtain a time homogeneous Markov chain, we also assume that the matrix K does not depend on the time t. Therefore we define a Markov chain on X to be a sequence $\{X_t\}_t$ of random variables taking values in X with the Markov property saying that for all times t, the conditional distribution for X_t+1 given all previous values $X_0, X_1, \dots, X_t$ only depends on X_t, i.e.

$P(X_{t+1} | X_t, X_{t-1}, \cdots ) = P(X_{t+1} | X_t)$

We also require that this conditional probability is independent of t and is therefore given by a matrix K as in the formula above.

Markov chains on finite state spaces are often visualized as a graph. Suppose, for instance, that our state space contains only two elements: $X = \{ 1 , 2\}$ . We can think of the combined values $X_t$ for all times t as a history of states or as a random walk in the state space. The Markov property then means that the probability to transition into a next state does not depend on the full history, but only on the current state – Markov chains do not have a memory.

In our state space with two elements only, the Markov chain is then described by four transition probabilities: the probability to stay in state 1 when the chain is in state 1, the probability to move to state 2 when being in state 1, the probability to stay in state 2 and the probability to move to state 1 after being in state 2. This can be visualized as follows.

Let us now calculate a few probabilities to get an idea for the relevant quantities in such a model. First, suppose that at time 0, the model is in state j with probability $\mu_j$ . What is the probability to be in state j after one step? Of course we can write

$P(X_1 = i) = \sum_j P(X_1 = i, X_0 = j)$

Now, according to the rules of conditional probabilities, we can express the joint probability as follows.

$P(X_1 = i, X_0 = j) = P(X_1 = i | X_0 = j) P(X_0 = j)$

Plugging this into our previous expression and using the definitions of K and $\pi$ , we now obtain

$P(X_1 = i) = \sum_j K_{ji} \mu_j$

Thus if we think of $\mu$ as a row vector, then the probability after one step is described by the matrix product $\mu K$ .

Now let us calculate a slightly different quantity. Assume that we know that a chain starts at j. What is the probability to be at i after two steps? Using once more the rules of conditional probability and the Markov property, we can write

$P(X_2=i | X_0=j) = \sum_k P(X_2=i | X_1=k)P(X_1=k | X_0=j)$

Intuitively, this is very appealing. To get from j to i in two steps, we can take the way via any intermediate state k. To get the total probability, we simply sum up all these different probabilities! If you have ever seen path integrals in quantum mechanics, this idea will look familiar.

Again, we can write this in matrix notation. Each of the conditional probabilities on the right hand side of the expression above is a matrix element, and we find that

$P(X_2 = i | X_0 = j) = \sum_k K_{ki} K_{jk}$

so that the probability to get in two steps from j to i is simply given by the elements of the matrix K². Similarly, the n-step transition probabilities are the entries of the matrix Kⁿ.

Let us look at an example to understand what is going on here, which is known as a finite random walk on a circle. Our state space consists of N distinct points which we place arbitrarily on a circle. We then define a Markov chain as follows. We start at some arbitrary point x. In each step, we move along the circle to a neighbored point – one point to the left with probability 1/2 and one point to the right with probability 1/2. By the very definition, the transition probabilities do not change over time and depend only on the current state, so is a finite Markov chain. If we label the points on the circle by $1, 2, \dots, N$ , then the transition matrix is given by a matrix of the form (for instance for N=4)

$K = \begin{pmatrix} 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \\ 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \end{pmatrix}$

More generally, we can also allow the process to stay where it is with probability 1 – p, where p is then the probability to move, which leads us to the matrix

$K = \begin{pmatrix} 1-p & \frac{p}{2} & 0 & \frac{p}{2} \\ \frac{p}{2} & 1-p & \frac{p}{2} & 0 \\ 0 & \frac{p}{2} & 1-p & \frac{p}{2} \\ \frac{p}{2} & 0 & \frac{p}{2} & 1 - p \end{pmatrix}$

Let us try to figure out whether the target distribution, given by the matrix Kⁿ for large n, somehow converges.

To see this, we do two numerical experiments. First, we can easily simulate a random walk. Suppose that we have a function draw which accepts a distribution (given by a vector p whose elements add up to one) and draws a random value according to that distribution, i.e. it returns 1 with probability p₀, 2 with probability p₁ and so on. We can then simulate a random walk on the circle as follows.

def simulate_chain(N, p, steps=100, start=5):
    chain = []
    x = start % N
    chain.append(x)
    for i in range(steps):
        x = (x  + draw([p/2.0, 1.0 - p, p / 2.0]) - 2) % N
        chain.append(x)
    return chain

Here N is the number of points on the circle, steps is the number of simulation steps that we run and start is the starting point. The function draw then returns 1 with probability p/2, 3 with probability p/2 and 2 with probability 1 – p. Thus we move with probability p/2 to the right, with probability p/2 to the left and stay were we are with probability 1 – p. If we set p = 0.8, this results in the following transition matrix.

$\begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can visualize the results if we execute a large number of runs and draw histograms of the resulting positions 100, 200, 300, … steps. The result will look roughly like this (for this image, I have used p = 0.8 and N=10):

We see that after a few hundred steps, this seems to converge – actually this starts to look very much like a uniform distribution. As we already know that the distribution after n steps is given by the matrix Kⁿ, it makes sense to look at high powers of this matrix as well. To visualize this, I will use a method that I have seen in MacKays excellent book and that works as follows.

Visualizing one row of a matrix is not difficult. We can plot the entries in a two-dimensional diagram, where the x-axis corresponds to the column index and the y-axis corresponds to the value. A flat line is then a row in which all entries have the same value.

To display a full matrix, we use colors – we assign one color to each row index and plot the individual rows as just described. If we do this for the powers of the matrix K with the parameters p = 0.8 and N = 10 (and only chose some rows, for instance 1,4,7, to not run out of colors…) we obtain an image like the one below.

We can clearly see that the powers of the matrix K converge towards a matrix

$K^\infty = \lim_{n \rightarrow \infty} K^n$

where all entries in each row seem to have the same value. We have seen that the distribution after n steps assuming an initial distribution described by a row vector $\mu$ is $\mu K^n$ . Therefore the limit distribution is $\mu K^\infty$ . If we take $\mu$ to be a unit vector, we find that the rows of the matrix $K^\infty$ do actually represent the limit distribution of all chains that have started at a specific point i of the state space. Therefore the entries in the rows need to sum up to one, and thus, if they are all equal, need to be equal to 1 / N. If you calculate and print out high powers of the matrix K, you will in fact see that they approach the matrix where all entries are 0.1 (as we have chosen N = 10 in this example).

To completes our short introduction into Markov chains. We have seen that Markov chains model stochastic processes in discrete time in which the state at step t+1 depends only on the state at step t and the dependency is given by a function independent of the current time. In finite state spaces, Markov chains are described by a transition matrix K. The i-th row of the matrix Kⁿ is the distribution of chains starting at point i after n steps. Consequently, converge properties of the Markov chains can be related to converge of high powers Kⁿ and the apparatus of linear algebra can be applied.

In the next post, we will learn more about convergence – how it can be made precise, and how we can tell whether a given Markov chain converges. This will then allow us to construct Markov chains that converge towards a given target distribution and use them for sampling.

Monte Carlo methods and Markov chains – an introduction

In our short series on machine learning, we have already applied sampling methods several times. We have used and implemented Gibbs sampling, and so far we have simply accepted that the approach works. Time to look at this in a bit more detail in order to understand why it works and what the limitations of the algorithm are.

Regardless of whether you want to simulate ferromagnetic behavior in an Ising model, run a Hopfield network or train a Boltzmann machine, the fundamental problem that we have to solve is always the same. We are given a probability distribution P living on some state space X, and we are trying to create a sample, i.e. a set of points in the state space such that the probability for a point x to appear in this sample is equal to the probability P(x) given by the probability distribution.

The naive approach to this is simple: visit every point x in the state space and include that point with probability P(x). However, it is clear that with a large state space, this approach is not computationally feasible. In the example of a Boltzmann machine trained on handwritten digits with 28 x 28 pixels, our state space has 2⁷⁸⁴ elements, and there is no way we can visit them all one by one. Instead, we would need something like a randomized walk through the state space. We could start with same randomly chosen state X₀, then – using a randomized transition rule – move on to a point X₁ and so forth. Intuitively, we want to select our transition rule in such a way that the state space elements X_i selected in this way form a sample, i.e. such that our chain of state space locations visits regions with large values for P(x) more often than regions with low values of P(x). Thus we would systematically ignore regions of the state space with low probability which would greatly reduce the number of states that we have to visit to obtain a valid sample.

So, from a mathematical point of view, we consider a sequence of random variables X_i such that X_i+1 is related to X_i by some randomized transition rule. We also assume that this rule does not depend on the index i which is usually called the time. Thus we have a sequence of random variables X_i which is not independent, but almost independent – X_i+1 depends only on X_i and in way that itself does not vary with i. This is called a Markov chain (more precisely, a time homogeneous Markov chain).

Let us consider an example to illustrate the idea. As our state space, we choose the space or real numbers. We fix a starting value, say X₀=0, and we obtain the next value by adding a number that we draw from a standard normal distribution. Thus, mathematically, we assume that W_n are identically distributed and independent random variables, all distributed according to the standard normal distribution, and set

$X_{n+1} = X_n + W_n$

This is a Markov chain: the value X_n+1 depends only on X_n, not on any earlier elements of the chain. The transition rule is randomized, but itself does not depend on the time step n – all W_n have the same distribution. Let us implement this in Python to see how it works (the full notebook can be downloaded here).

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw GaussianRandomWalk1.ipynb hosted with ❤ by GitHub

Here we have created and displayed three different random walks. All of them start at the same point (zero), and all of them follow the same transformation rule, but as the transformation rule is stochastic in nature, they all develop differently.

Now let us try to turn the view on this upside down. This time, we execute a larger number – 1000 – of random walks with 5000 steps each. But instead of plotting the sequence of points X_i for every walk, we display the distribution of the last point of each walk, i.e. we plot the distribution of the random variable X₄₉₉₉.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw GaussianRandomWalk2.ipynb hosted with ❤ by GitHub

This does in fact look more familiar. We see that most walks end up being close to zero at the end – steps in the positive direction and steps in the negative direction cancel each other. Only very few walks end up at an extreme position close to plus or minus 200 – this is not surprising as well, to arrive at an extreme point, we would need to draw 5000 times in a row an extreme value from the random normal distribution, which is a rather unlikely chain of events.

In this case, the distribution does actually not converge if we increase the number of steps – you can try this out and play with different values, i.e. replace 5000 by 50000 (this will run some time) and look at the distribution of X₄₉₉₉₉ – you will see that this is now spread out to roughly plus / minus 750 (in fact, the distribution is obtained as a multiple convolution of the standard normal distribution with itself and thus is again a normal distribution).

Even though the distribution does not converge, we have been able to sample from a specific distribution – in this case the distribution after 5000 steps – using only the ability to sample from a different distribution – in this case the random normal distribution. Obviously, in this special case, the result is trivial, but the principle that we have found looks interesting. Can we generalize this approach to obtain sampling methods for target distributions that are otherwise intractable?

Now this is exactly the idea behind the sampling approach that is commonly known as Markov chain Monte Carlo (MCMC) and which has become very popular, with applications to complex simulations in theoretical physics, to machine learning and even asset pricing and value-at-risk calculations.

So let us summarize how the MCMC approach works. Given a target distribution P(x), we first construct a Markov chain that converges to that target distribution. Once we have that, we can simulate a large number of runs and use the resulting points as our sample (in fact, in many cases we can also do with one run only, as we will see later). Thus in order to utilize Markov chains for sampling, we would need to understand under what conditions a Markov chain converges and if it converges, how we can relate the target distribution to the transformation rule. We will look into these points in more detail in future posts in this series.

Training a restricted Boltzmann machine on a GPU with TensorFlow

During the second half of the last decade, researchers have started to exploit the impressive capabilities of graphical processing units (GPUs) to speed up the execution of various machine learning algorithms (see for instance [1] and [2] and the references therein). Compared to a standard CPU, modern GPUs offer a breathtaking degree of parallelization – one of NVIDIAs current flagships, the Tesla V100, offers more than 5.000 CUDA cores that can perform work in parallel. As training and evaluating neural networks involves many floating operations on large matrices, they can benefit heavily from the special capabilities that a GPU provides.

So how can we make our code execute on a GPU? Of course you could program directly against the CUDA interface or similar interfaces like OpenCL. But specifically for the purposes of machine learning, there are easier options – over the last years, several open source frameworks like Theano, Torch, MXNet or TensorFlow have become available that make it comparatively easy to leverage a GPU for machine learning. In this post, I will use the TensorFlow framework, simply because so far this is the only one of these frameworks that I have used (though MXNet looks very interesting too and I might try that out and create a post on it at some point in the future).

It takes some time to get used to the programming model of TensorFlow which is radically different from the usual imparative programming style. As an example, let us suppose we wanted to add two matrices. In Python, using numpy, this would look as follows.

import numpy as np
a = np.matrix([[0, 1], [1, 0]])
b = np.matrix([[1, 0], [0, 1]])
c = a + b
print(c)

This program is described by a sequence of instructions (let us ignore the fact for a moment that these are of course functions that we call – ultimately, functions are composed of instructions). When we execute this program, the instructions are processed one by one. First, we assign a value to the variable a, then we assign a value to a variable b, then we add these two values and assign the result to a variable c and finally we print out the value of c.

The programming model behind TensorFlow (and other frameworks like Theano) is fundamentally different. Instead of describing a program as a sequence of instructions, the calculations are organized as a graph. The nodes in this graph correspond to operations. The edges joining the nodes represent the flow of data between the operations. In TensorFlow, data is always represented as a tensor, so the edges in the graph are tensors. An operation consumes data from its inputs, processes it and forwards it to the next operation in the graph as its output.

A program using TensorFlow typically consists of two phases. In the first phase, we build the graph, i.e. we define the operations and their inputs and outputs that make up the calculation that we want to perform. However, in this phase, no calculations are actually performed. Instead, this happens in the second phase when we actually run the graph. For that purpose, we create a session. Roughly speaking, a session defines an environment in which a graph can be executed. Once the session has been defined, we can invoke its run method. To the run method, we pass as an argument the operation in the graph that we want to execute. The run method will then trace the graph backwards and evaluate all operations that provide input to our target operation recursively, i.e. it will identify the subgraph that needs to be executed to evaluate our target operation.

Let us again use the example of a simple addition to illustrate this. The source code looks as follows.

import tensorflow as tf
#
# Build the model
#
a = tf.constant([[0, 1], [1, 0]], name="a")
b = tf.constant([[1, 0], [0, 1]], name="b")
c = tf.add(a,b, name="c")

#
# Create a session and run it
#
session = tf.Session()
print(session.run(c))

First, we import the tensorflow library itself. Then, in the next three lines, we build the graph. We define three nodes in the graph. The first two nodes are special operations that output simply a constant value. The third operation is the operation that performs the actual addition and uses the previously defined operations as input. Thus our final graph has three nodes and two edges, as shown below.h

In the next line, we create a TensorFlow session which we then run. The argument specifies which operation we want to execute and therefore determines which part of the graph we will actually run. The output of the run method is an ordinary numpy array which we then print out.

Let us now look at an example which is slightly more complicated. In the PCD algorithm, we can compute the contribution of the negative phase to the weight updates as follows.

E = expit(self.beta*(np.matmul(S0, self.W) + self.c))
pos = np.tensordot(S0, E, axes=((0),(0)))

Here S0 is a batch from the sample set, W is the current value of the weights and c is the current value of the bias. In TensorFlow, the code to build the corresponding part of the model looks quite similar.

S0 = tf.placeholder(name="S0", shape=[batch_size, self.visible], 
                    dtype=tf.float32)
W = tf.get_variable(name="W", 
                        dtype=tf.float32, 
                        shape=[self.visible, self.hidden],
                        initializer = tf.zeros_initializer(),
                        trainable=False)
c = tf.get_variable(name="c", 
                        dtype=tf.float32, 
                        shape=[1, self.hidden],
                        initializer = tf.zeros_initializer(),
                        trainable=False)
E = tf.sigmoid(self.beta*(tf.matmul(S0, W) + c), name="E")
pos = tf.tensordot(S0, E, axes=[[0],[0]], name="pos")

The first element that we define – S0 – is a so called placeholder. This is a bit like a constant, with the difference that its value can be specified per run, using an additional argument called feed dictionary to the Session.run method. The next two elements that we define are variables. Variables are similar to operations – they represent nodes in the network and provide an output, but have no input. Instead, they have a certain value and feed that value as outputs to other operations. We then use the built-in tensorflow operations sigmoid and tensordot to calculate the expectation values of the visible units and the positive phase.

The full model to train a restricted Boltzmann machine is of course a bit more complicated. TensorFlow comes with a very useful device called TensorBoard that can be used to visualize a graph constructed in TensorFlow. The image below has been created using TensorFlow and shows the full graph of our restricted Boltzmann machine.

TensorBoard offers the option to combine operations into groups which are then collapsed in the visual representation. In the image above, all groups are collapsed except the group representing the contribution from the positive phase. We can clearly see the flow of data as described above – we first multiply S0 and W, then add c to the result, multiply this by a constant (the inverse temperature, called x in the diagram) and then apply the sigmoid operation that we have called E. The result is then fed into other, collapsed groups like the group delta which holds the part of the model responsible for calculating the weight updates.

I will not go through the full source code that you can find on GitHub as usual – you will probably find the well written tutorial on the TensorFlow homepage useful when going through this. Instead, let us play around a bit with the result.

As the PC that is under my desk is almost seven years old and does not have a modern GPU, I did use a p2.xlarge instance from Amazon EC2 which gave me access to a Tesla K80 GPU and four Intel Xeon E5-2686 cores running at 2.3 GHz (be careful – this instance type is not covered by the free usage tier, so that will cost you a few dollars). I used the Amazon provided Deep Learning AMI based on Ubuntu 16.04. After logging into the instance, we first have to complete a few preparational steps.

$ source activate tensorflow_p36
$ git clone http://www.github.com/christianb93/MachineLearning.git
$ cd MachineLearning
$ export MPLBACKEND="AGG"
$ conda install scikit-learn
$ python3 RBM.py --algorithm=PCDTF

Here we activate the pre-configured TensorFlow environment, download the source code from GitHub, set the environment variable to define our Matplotlib backend, and download and install some required packages. Then we do a first run with the BAS dataset to verify that everything works. If that is the case, we can run the actual MNIST training and sampling.

$ python3 RBM.py --N=28 --data=MNIST --save=1 --hidden=128 --pattern=256 --batch_size=128 --epochs=40000 --run_samples=1 --sample_size=6,6 --beta=1.0 --sample=200000 --algorithm=PCDTF --precision=32

This produced the following sample of 6 x 6 digits.

The execution took roughly 5 minutes – 2 minutes for the training phase and 3 minutes for the sampling phase. During the training, the GPU utilization (captured with nvidia-smi -l 2) was at around 57% and stayed in that range during the sampling phase.

A second run using the switch --precision=64 to set the floating point precision to 64 bits did not substantially change the outcome or the performance.

Then a run with the same parameters was done in pure Python running on the four CPU cores provided by the p2.xlarge instance (--algorithm=PCD). During the training phase, the top command showed a CPU utilization of 400%, i.e. all four cores where at 100%. The utilization stayed in that range during the sampling phase. The training took 10:20 minutes, the sampling 8 minutes. Thus the total run time was 18 minutes compared to 5 minutes – a factor of 360%.

Following the advice on this post, I then played a bit with the settings of the GPU and adjusted the clock rates and the auto boost mode as follows.

sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,875

That brought the GPU utilization down to a bit less than 50%, but had a comparatively small impact on the run times which now were 1:40 min (instead of 2 min) for training and 2:30 min (instead of 3 min) for sampling. So the total run time was now a bit more than 4 minutes, which is a speed up of roughly 20% compared to the default settings. Compared to the CPU, we have now reached a speed up of almost 4,5.

Next, let us compare this to the run time on two CPUs only. To measure that, I grabbed an instance of the t2.large machine type that comes with 2 CPUs – according to /proc/cpuinfo, it is equipped with two Intel Xeon E5-2676 CPUs at 2.40GHz. Interestingly, the training phase only took roughly 8 minutes on that machine, which is even a bit faster than on the p2.xlarge which has four cores. The sampling phase was faster as well, taking only 6 minutes instead of 8 minutes. It seems that adding more CPUs increases the overhead for the synchronisation between the cores drastically so that it results in a performance penalty instead of a performance improvement. To verify this, I did a run on a p2.8xlarge with 32 CPUs and got a similar result – training took 9 minutes, sampling 6:50 minutes.

Finally, I could not resist the temptation to try this out on a more advanced GPU enabled machine. So I got a p3.2xlarge instance which contains one of the relatively new Tesla V100 GPUs. I did again adjust the application clocks using

sudo nvidia-smi -ac 877,1530

With these settings, one execution now took only about 1:20 minutes for the training and 1:50 min for the sampling. However, the GPU utilization was only at 30% – so we have reached a point where just having a faster GPU does not lead to a significant speed advantage any more. The following table summarizes the results of the various measurements.

Instance	Run time training	Run time sampling
p3.2xlarge (Tesla V100)	1:20 min	1:40 min
p2.large (Tesla K80)	1:40 min	2:30 min
p2.large (4 x CPU)	10 min	8 min
p2.8xlarge (32 x CPU)	9 min	6:50 min
t2.large (2 x CPU)	8 min	6 min

Of course we could now start to optimize the implementation. For the training phase, I assume that the bottleneck that limits GPU utilization is the use of the feed dictionary mechanism which could be replaced by queues to avoid overhead of switching back between CPU and GPU. During the sampling phase, we could also try to reduce the relative overhead of the run method by combining a certain number of steps – say 10 – into the graph and thus reducing the number of iterations that happen outside of the model. It would be interesting to play with this and see whether we can improve the performance significantly. But this is already a long post, so I will leave this for later…

References

1. R. Raina, A. Madhavan, A. Ng, Large-scale Deep Unsupervised Learning using Graphics Processors, Proceedings of the 26 th International Conference on Machine Learning (2009)
2. K. Chellapilla, S. Puri , P. Simard, High Performance Convolutional Neural Networks for Document Processing, International Workshop on Frontiers in Handwriting Recognition (2006)

Training restricted Boltzmann machines with persistent contrastive divergence

In the last post, we have looked at the contrastive divergence algorithm to train a restricted Boltzmann machine. Even though this algorithm continues to be very popular, it is by far not the only available algorithm. In this post, we will look at a different algorithm known as persistent contrastive divergence and apply it to the BAS data set and eventually to the MNIST data set.

Recall that one of the ideas of contrastive divergence is to use a pattern from the sample set as a starting point for a Gibbs sampler to calculate the contribution of the negative phase to the weight update. The idea behind persistent contrastive divergence (PCD), proposed first in [1], is slightly different. Instead of running a (very) short Gibbs sampler once for every iteration, the algorithm uses the final state of the previous Gibbs sampler as the initial start for the next iteration. Thus, in every iteration, we take the result from the previous iteration, run one Gibbs sampling step and save the result as starting point for the next iteration.

This amounts to running one long chain of states that are related by Gibbs sampling steps. Of course this is not exactly one longs Gibbs sampler, as the weights and therefore the probability distribution changes with each step. However, the idea is that when the learning rate is small, the weight change during two subsequent iterations is neglegible, and we effectively create one long Gibbs sampler which provides a good approximation to the actual distribution.

In practice, one often uses several chains that are run in parallel. Such a chain is sometimes called a negative particle. It is recommended in [1] to chose the number of particles to be equal to the batch size. In an implementation in Python, we can store the state of the negative particles in a matrix N where each row corresponds to one particle.

The idea to form one long Markov chain obviously works best if the learning rate is very small. On the other hand, this slows down the convergence of the gradient descent algorithm. In order to solve this, it is common to reduce the learning rate over time, for instance linearly with the number of iterations.

A second additional improvement that is usually implemented is a weight decay. Essentially, a weight decay is an additional penalty that is applied to avoid that the weights grow too large which would slow down the sampling procedure.

Let us now see how the PCD algorithm can be coded in Python. We will again store the model parameters and the state in a Python class. In the __init__ method of that class, we initialize the weights and the bias vectors and also set the particles to some randomly chosen initial value.

class PCDRBM (Base.BaseRBM):

    def __init__(self, visible = 8, hidden = 3, particles = 10, beta=2.0):
        self.visible= visible
        self.hidden = hidden
        self.beta = beta
        self.particles = particles
        #
        # Initialize weights with a random normal distribution
        #
        self.W = np.random.normal(loc=0.0, scale=0.01, size=(visible, hidden))
        #
        # set bias to zero
        #
        self.b = np.zeros(dtype=float, shape=(1, visible))
        self.c = np.zeros(dtype=float, shape=(1, hidden))
        #
        # Initialize the particles
        #
        self.N = np.random.randint(low=0, high=2, size=(particles,self.visible))
        self.global_step = 0

Assuming that we have a method runGibbsStep that runs one Gibbs sampling step with the given weight starting at some initial state, one iteration of the PCD algorithm now looks as follows.

#
# Update step size - we do this linearly over time
#
step = initial_step_size * (1.0 -(1.0*self.global_step)/(1.0*iterations*epochs))
#
# First we compute the negative phase. We run the
# Gibbs sampler for one step, starting at the previous state
# of the particles self.N
#
self.N, _ = self.runGibbsStep(self.N, size=self.particles)
#
# and use this to calculate the negative phase
#
Eb = expit(self.beta*(np.matmul(self.N, self.W) + self.c))
neg = np.tensordot(self.N, Eb, axes=((0),(0)))
#
# Now we compute the positive phase. We need the
# expectation values of the hidden units
#
E = expit(self.beta*(np.matmul(V, self.W) + self.c))
pos = np.tensordot(V, E, axes=((0),(0)))
#
# Now update weights
#
dW = step*self.beta*(pos -neg) / float(batch_size) - step*weight_decay*self.W / float(batch_size)
self.W += dW
self.b += step*self.beta*np.sum(V - self.N, 0) / float(batch_size)
self.c += step*self.beta*np.sum(E - Eb, 0) / float(batch_size)
self.global_step +=1

As always, the full source code is available from my machine learning GitHub repository. I have enhanced the code in RBM.py so that it accepts a command line parameter --algorithm that lets you choose between ordinary contrastive divergence and the PCD algorithm.

Let us now run a few trials. First, we will again use the BAS data set. You can download and run the code from the GitHub repository as follows.

$ git clone http://www.github.com/christianb93/MachineLearning.git
$ cd MachineLearning
$ python RBM.py --algorithm=PCD --run_reconstructions=1 --show_metrics=1

When the script completes, you should again see the two images. The first image displays how the reconstruction errors and weight changes behave during the training.

We see that the reconstruction error (the diagram on the right) decreases slower than it did for the ordinary contrastive divergence algorithm. On the left hand side, where the change of the weights is displayed, we can clearly see the impact of the linearly decreasing step size. The second picture shows again the result of a reconstruction attempt of slightly distorted patterns.

Let us now try out a different application of restricted Boltzmann machines – sampling. After a successful training phase, the model distribution given by the weights should be close to the empirical distribution of the training data. Thus, if we sample from the model distribution, using for instance Gibbs sampling, we should be able to obtain patterns. that somehow resemble the training data.

We will use this to generate handwritten digits based on the well known MNIST data set, more precisely the copy available at mldata.org. To download and read the data set, we use the method fetch_mldata provided by the scikit learn library. We will then train our network for 40.000 epochs using 60 images out of this data set and 128 hidden units and subsequently run 200.000 Gibbs sampling steps starting from a random pattern.

$ python RBM.py --algorithm=PCD --data=MNIST --N=28 --epochs=40000 --pattern=60 --hidden=128 --run_samples=1 --sample=200000 --save=1

Note that when you run this for the first time, the MNIST data set will be downloaded and stored in a folder in your home directory, so this might take some time (the file has a bit less than 60 MBytes).

The results are already very encouraging. Most patterns resemble a digit closely, only the image at the top left corner did obviously not converge properly. However, we still see a strong bias – only very few of the 9 digits that the data set contains appear. So we probably need to fine tune the parameters like number of hidden units, learning rate, weight decay or the number of epochs to obtain better results.

Unfortunately, when you start to play around to optimize this further, you will see that the run time of the algorithm has reached a point where quick iterations to try out different parameters become virtually impossible. I have been running this on my PC that has an Intel Core i7 CPU, and Python was able to distribute this nicely across all four physical cores, taking them to 100% utilization, but still the script was already running for 7 minutes. If we want to increase the number of iterations or the number of hidden units to be able to learn more pattern, the run time can easily go up to almost 30 minutes.

Of course professional training of neuronal networks is nowadays no longer been done on a CPU. Instead, modern frameworks use the power of graphical processing units (GPUs) that are optimized for exactly the type of work that we need – highly parallel processing of floating point matrices. Therefore, I will show you in the next post in this series how you can use the TensorFlow framework to move the workload to a GPU.

1. T. Tieleman, Training restricted Boltzmann machines using approximations to the likelihood gradient, International Conference on Machine Learning (ICML), 2008
2. A. Fischer, C. Igel, Training restricted Boltzmann machines: an introduction, Pattern Recognition Vol. 47 (2014), pp 25–39

Learning algorithms for restricted Boltzmann machines – contrastive divergence

In the previous post on RBMs, we have derived the following gradient descent update rule for the weights.

$\Delta W_{ij} = \beta \left[ \langle v_i \sigma(\beta a_j) \rangle_{\mathcal D} - \langle v_i \sigma(\beta a_j) \rangle_{P(v)} \right]$

In this post, we will see how this update rule can be efficiently implemented. The first thing that we note is that the term $\sigma(\beta a_j)$ that appears several times is simply the conditional probability for the hidden unit j to be “on” and, as only the values 0 and 1 are possible, at the same time the conditional expectation value of that unit given the values of the visible units – let us denote this quantity by $e_j$ . Our update rule now reads

$\Delta W_{ij} = \beta \left[ \langle v_i e_j \rangle_{\mathcal D} - \langle v_i e_j \rangle_{P(v)} \right]$

Theoretically, we know how to calculate this. The first term – the positive phase – is easy, this is just the average over the sample set.

The second term is more challenging. Theoretically, we would need a Gibbs sampler to calculate it using a Monte Carlo approach. One step of this sampler would proceed as follows.

Given the values v of the visible units, calculate the resulting expectation values e
Set hidden unit j to one with probability e_j
For each visible unit i, calculate the conditional probability p_i to be one given the new values of the hidden units
Set v_i to 1 with probability p_i

After some burn-in phase, we would then calculate the product $v_i e_j$ after each step and take the average of these values.

The crucial point is that for a naive implementation, we would start the Gibbs sampling procedure during each gradient descent iteration from scratch, i.e. with some randomly initialized values for the visible units. One of the ideas behind the algorithm known as contrastive divergence that was proposed by G. Hinton in [1] is to restart the Gibbs sampler not at a random value, but a randomly chosen vector from the data set! The idea behind this is that if we have been running the training for some time, the model distribution should be close to the empirical distribution of the data, so sampling a vector from the data should give us something close to the equilibrium state of the Gibbs sampling Markov chain (if you do not known what a Markov chain is – do not worry and just read on, I will cover Markov chains and the mathematics behind all this in a later post).

The second approximation that the contrastive divergence algorithm makes is to replace the expectation values in the positive and negative phase by a point estimate. For the positive phase, that means we simply calculate the value at one point from the data set. For the negative phase, we run the Gibbs sampling procedure – starting as explained above with a vector from the data set – and then simply compute the product $v_i e_j$ for the result.

It now turns out that, based on empirical observations, these approximations work extremely well – in fact, it turns out that instead of running a full Gibbs sampler with a few hundred or even a few thousand steps, one step is often sufficient! This is surprising, but open to an intuitive explanation – we run all this within the outer loop provided by the gradient descent algorithm, and if we chose the learning rate sufficiently small, the parameters do not change a lot between these steps, so that we effectively do something that is close to one long Gibbs sampling Markov chain.

With these simplifications, the constrastive divergence algorithm now looks as follows.

FOR EACH iteration DO

Sample a vector v from the data set

SET $e = \sigma(\beta( W^T v + c))$

FOR EACH hidden unit DO

SET $h_j = 1$ with probability $e_j$

FOR EACH visible unit DO

SET $\bar{v}_i = 1$ with probability $\sigma(\beta (W h + b))_i$

SET $\bar{e} = \sigma(\beta (W^T \bar{v} + c))$

SET $W = W + \lambda \beta \left[ v e^T - \bar{v} \bar{e}^T \right]$

SET $b = b + \lambda \beta \left[ v - \bar{v} \right]$

SET $c = c + \lambda \beta \left[ e - \bar{e} \right]$

DONE

The first six lines within an iteration constitute one Gibbs sampling step, starting with a value for the visible units from the data set, sampling the hidden units from the visible units and sampling the visible units from the hidden units. In the next line, we recalculate the expectation values of the hidden units given the (updated) values of the visible units. The value $\bar{v}_i \bar{e}_j$ is then the contribution of the negative phase to the update of $W_{ij}$ . We can summarize the contributions for all pairs of indices as the matrix $\bar{v} \bar{e}^T$ . Similarly, the positive phase contributes with $v e^T$ . In the next line, we update W with both contributions, where $\lambda$ is the learning rate. We then apply similar update rules to the bias for visible and hidden units – the derivation of these update rules from the expression for the likelihood function is done similar to the derivation of the update rules for the weights as shown in my last post.

Let us now implement this in Python. To have a small data set for our tests, we will use an artificial data set called bars and stripes that I have seen first in [3]. Given a number N, we can create an image with N x N pixels for every number x smallers than 2^N as follows. Each row corresponds to one binary digit of x. If this digit is one, the entire row is black, i.e. we have one black vertical stripe, otherwise the entire row is white. A second row of patterns is obtained by coloring the columns similarly instead of the rows. Thus we obtain 2^N+1 possible patterns, more than enough for our purposes. I have written a helper class BAS in Python that creates these patterns.

Next, let us turn to the actual RBM. We store the current state of the RBM in a class RBM that is initialized as follows.

class RBM:

    def __init__ (self, visible = 8, hidden = 3, beta = 1):
        self.visible = visible
        self.hidden = hidden
        self.beta = beta
        self.W = np.random.normal(loc = 0, scale = 0.01, size = (visible, hidden))
        self.b = np.zeros(shape = (1,visible))
        self.c = np.zeros(shape = (1,hidden))

Here W is the weight matrix, beta is the inverse temperature, and b and c are the bias vectors for the visible and hidden units.

Next we need a method that runs one step in a Gibbs sampling chain, starting with a state of the visible units captured in a matrix V (we calculate this in a mini-batch for more than one sample at a time, each row in the matrix represents one sample vector). Using once more the numpy library, this can be done as follows.

def runGibbsStep(self, V, size = 1):
    #
    # Sample hidden units from visible units
    #
    E = expit(self.beta*(np.matmul(V, self.W) + self.c))
    U = np.random.random_sample(size=(size, self.hidden))
    H = (U <= E).astype(int)
    #
    # and now sample visible units from hidden units
    #
    P = expit(self.beta*(np.matmul(H, np.transpose(self.W)) + self.b))
    U = np.random.random_sample(size=(size, self.visible))
    return (U <= P).astype(int), E

With this method at hand – which returns the new value for the visible units but the old value for the conditional expectation of the hidden units – we can now code our training routine.

def train(self,  V, iterations = 100, step = 0.01):
    batch_size = V.shape[0]
    #
    # Do the actual training. First we calculate the expectation
    # values of the hidden units given the visible units. The result
    # will be a matrix of shape (batch_size, hidden)
    #
    for _ in range(iterations):
        #
        # Run one Gibbs sampling step and obtain new values
        # for visible units and previous expectation values
        #
        Vb, E = self.runGibbsStep(V, batch_size)
        #
        # Calculate new expectation values
        #
        Eb = expit(self.beta*(np.matmul(Vb, self.W) + self.c))
        #
        # Calculate contributions of positive and negative phase
        # and update weights and bias
        #
        pos = np.tensordot(V, E, axes=((0),(0)))
        neg = np.tensordot(Vb, Eb, axes=((0),(0)))
        dW = step*self.beta*(pos -neg) / float(batch_size)
        self.W += dW
        self.b += step*self.beta*np.sum(V - Vb, 0) / float(batch_size)
        self.c += step*self.beta*np.sum(E - Eb, 0) / float(batch_size)

Let us now play around with this network a bit and visualize the training results. To do this, clone my repository and then run the simulation using

$ git clone  https://github.com/christianb93/MachineLearning.git
$ cd MachineLearning
$ python RBM.py  --run_reconstructions=1 --show_metrics=1

This will train a restricted Boltzmann machine on 20 images out of the BAS dataset with N=6. For the training, I have used standard parameters (which you can change using the various command line switches, use --help to see which parameters are available). The learning rate was set to 0.05. The number of iterations during training was set to 30.000, and 16 hidden units are used. The inverse temperature $\beta$ is set to 2.0. In each iteration, a mini-batch of 10 patterns is trained.

After every 500 iterations, the script prints out the current value of the reconstruction error. This is defined to be the norm of the difference between the value of the visible units when the Gibbs sampling step starts and the value after completing the Gibbs sampling step, i.e. this quantity measures how well the network is able to reconstruct the value of the visible units from the hidden units alone.

After the training phase is completed, the script will select eight patterns randomly. For each of these patterns, it will flip a few bits and then run 100 Gibbs sampling steps. If the training was successful, we expect that the result will be a reconstruction of the original image, i.e. the network would be able to match the distorted images to the original patterns.

When all the calculations have been completed, the network will display two images. The first image should roughly look like the image below.

This matrix visualizes the result of the reconstruction process described above. Each of the rows shows the outcome for one of the eight selected patterns. The first image in each row is the original pattern from the BAS data set. The second one is the distorted image some pixels have been flipped. The third image shows the result of the reconstruction run after 50 Gibbs iterations, and the last image shows the result after the full 100 iterations.

We see that in most cases, the network is able to correctly reconstruct the original image. However, there are also a fes rows that look suspicious. In the first row, we could hope that the network eventually converges if we execute more sampling steps. In the third row, however, the network converges to a member of the BAS data set, but to the wrong one.

The second diagram that the script produces displays the change to the weights after each iteration and the reconstruction error.

We see that both quantities quickly get smaller, but never stabilize at exactly zero. This is not really surprising – as we work with a non-zero temperature, we will always have some thermal fluctuations and the reconstruction error will never be constantly zero, but oscillate around a small value.

I invite you to play around with the parameters a bit to see how the network behaves. We can change the value of the inverse temperature with the parameter --beta, the number of hidden units with the parameter --hidden, the number of Gibbs steps used during the reconstruction with --sample and the step size with --step. If, for instance, you raise the temperature, the fluctuations of the reconstruction error will increase. If, one the other hand, we choose a very small temperature, the network converges very slowly. Making the step size too small or too large can also lead to non-convergence etc.

That completes this post on contrastive divergence. In the next post, I will show you an alternative algorithm that has gained a lot of popularity called persistent contrastive divergence (PCD), before we finally set out to implement an restricted Boltzmann machine on a GPU using the TensorFlow framework.

1. G. Hinton, Training products of experts by minimizing contrastive divergence, Journal Neural Computation Vol. 14, No. 8 (2002), 1771 1800
2. G. Hinton, A practical guide to training restricted Boltzmann machines, Technical Report University of Montreal TR-2010-003 (2010)
[3] D. MacKay, Information Theory, Inference and learning
algorithms, section 43, available online at this URL

Restricted Boltzmann machines

In the previous post, we have seen that a Boltzmann machine as studied so far suffers from two deficiencies. First, training is very slow as we have to run a Gibbs sampler until convergence for every iteration of the gradient descent algorithm. Second, we can only see the second moments of the data distribution and the learning rule ignores higher moments.

A class of networks called Restricted Boltzmann machines (RBM) has been designed to overcome these problems. An RBM is a Boltzmann machine with two additional architectural features. First, it has hidden units. This simply means that we split the set of all units in the network into two disjoint sets called visible units and the said hidden units. When we we train the network, we connect the data samples only to the visible units. The hidden units, however, also follow the dynamical rules of the network and serve as latent variables – you can think of them as additional parameters of the network which are adapted during training but are not directly prescribed by the training set, similar to a hidden layer in a feed-forward neuronal network.

Second, in a restricted Boltzmann machine, certain restrictions on the weights are in effect. Specifically, we only allow hidden units to be connected to visible units and vice versa, so there are no connections between hidden units and no connections between visible units. Effectively, a restricted Boltzmann machine is therefore organised in two layers – one layer containing the hidden units and one layer containing the visible units, as shown below.

What does this imply for the mathematical description of the network? In fact, we will see that this simplifies things considerably. First, corresponding to the differentiation between hidden and visible units, our index set can be written as

$\{ 1, \dots, N \} = I_v \cup I_h$

so that unit i is a hidden unit if i is in the set $I_v$ and a hidden unit if i is in the set $I_h$ . Second, it is common to use 0 and 1 as states instead of -1 and +1. Our state space then splits

$\{ 0, 1\}^N = {\mathcal S} = {\mathcal V} \times \mathcal {H}$

and correspondingly we can write any state as

$s = (v,h)$

where v specifies the state of the visible units and h the state of the hidden units. As only visible units correspond to actual input, the purpose of the training phase is now to adjust the marginal distribution

$P(v) = \sum_h P(v,h) = \frac{1}{Z} \sum_h e^{-\beta E(v,h)}$

such that is it as close as possible to the empirical distribution of the test data.

The expression for the energy also simplifies greatly, as all terms involving only hidden units and only visible units disappear. If we replace the matrix W that contains all connections by a reduced matrix – that we again call W – that only contains the remaining connections between visible and hidden units, we can express the energy as

$E(v,h) = - \sum_{i \in I_v, j \in I_h} W_{ij} v_i h_j$

In addition, we will now also add an explicit bias to both the hidden and visible units, so that our full energy is

$E(v,h) = - \sum_{i \in I_v, j \in I_h} W_{ij} v_i h_j - \sum_i v_i b_i - \sum_j h_j c_j$

Of course the matrix W is now no longer symmetric and not even quadratic (as the number of hidden units will in general not be the same as the number of visible units).

We can now again calculate the update rules as before. First, we write down the likelihood function

$l({\mathcal D} | W) = - \frac{1}{K} \ln P({\mathcal D} | W) = - \frac{1}{K} \sum_k \ln \sum_h e^{-\beta E(v^{(k)},h)}+ \ln Z$

where now $v^{(k)}$ is the k-the sample point corresponding to a set of values for the visible units.

Again we will need the derivatives of this with respect to the weights. For the second term – the logarithm of the partition function – we have already seen in the last post how this works. Recalling the results from this post, we easily find that

$\frac{\partial}{\partial W_{ij}} \ln Z = - \beta \langle \frac{\partial E}{\partial W_{ij}} \rangle_P = \beta \langle v_i h_j \rangle_P$

so that the derivative is again an expectation value which we could try to approximate using a sample of the model distribution. The first term requires a bit more work. Let us first calculate

$\frac{\partial }{\partial W_{ij}} \ln \sum_h e^{-\beta E(v,h)} = \frac{1}{Z P(v)} \sum_h \frac{\partial }{\partial W_{ij}} e^{-\beta E(v.h)}= - \beta \sum_h \frac{\partial E(v,h)}{\partial W_{ij}} P(h | v)$

But this is again an expectation value, this time it is an expectation value with respect to the conditional distribution of the hidden units given the visible units.

$\frac{\partial }{\partial W_{ij}} \ln \sum_h e^{-\beta E(v,h)} = - \beta \langle \frac{\partial E(v,h)}{\partial W_{ij}} \rangle_{P(\cdot | v)}$

The derivative of the energy with respect to the weights is as above, and we finally obtain the following update rule for the weights:

$\Delta W_{ij} = \lambda \beta \left[ \langle \langle v_i h_j \rangle_{P(\cdot | v)} \rangle_{\mathcal D} - \langle v_i h_j \rangle_P \right]$

Note that the first term is a double expectation value – for each sample $v^{(k)}$ for the visible units, we use the expectation value under the conditional distribution over the hidden units given this value for the visible units.

Now let us start to simplify this expression a bit further, leveraging the restrictions on the geometry of the network. Let us first try to find an expression for the conditional probability

$P(h_j = 1 | v)$

This is in fact easy to calculate in our situation. As the state of a hidden unit does not depend on the other hidden units, but only on the visible units, we find that

$P(h_j = 1 | v)= \sigma(\beta (\sum_i W_{ij} v_i + c_j)) = \sigma(\beta a_j)$

where

$a_j = \sum_i W_{ij} v_i + c_j$

is the activation of the hidden unit j. Using this, we can already simplify the first term in the update rule as follows:

$\langle v_i h_j \rangle_{P(\cdot | v)} = \sum_h P(h | v) v_i h_j = v_i \sum_{h : h_j = 1} P(h | v)$

But this is of course nothing but

$v_i P(h_j = 1 | v)$

so that we eventually find

$\langle v_i h_j \rangle_{P(\cdot | v)} = v_i \sigma (\beta a_j)$

A similar argument works for the second term in the update rule. We have

$\langle v_i h_j \rangle_P = \sum_v \sum_h v_i h_j P(v,h) = \sum_v v_i P(v) \sum_h h_j P(h | v)$

Now the second term sum is again the conditional probability for $h_j$ to be one given v, so that this turns into

$\langle v_i h_j \rangle_P = \sum_v v_i P(v) \sigma(\beta a_j) = \langle v_i \sigma(\beta a_j) \rangle_{P(v)}$

We therefore finally obtain the following simplified update rule.

$\Delta W_{ij} = \beta \left[ \langle v_i \sigma(\beta a_j) \rangle_{\mathcal D} - \langle v_i \sigma(\beta a_j) \rangle_{P(v)} \right]$

Thus again, we see that the gradient is composed of two terms, which we call the positive phase and the negative phase. In each phase, we sample the same expression, once over the data distribution and once over the marginal distribution.

How do we actually calculate these terms? The positive phase is easy – we have written this as an expectation value, but it is nothing but an ordinary sum. For each vector in the sample, we calculate the activation of the hidden unit j, apply the multiplication by $\beta$ and the sigmoid function and multiply the result with the value of the visible unit. So this is in fact an easily calculated analytical expression.

Whereas we have found an analytic expression for the positive phase, there is no obvious analytic expression for the negative phase, so we again need a sampling procedure to calculate this term. At this point, the special structure of the network again helps to make the sampling easier. Suppose we wanted to apply an ordinary Gibbs sampler, where instead of choosing the neuron that we update next randomly, we cycle sequentially through all the neurons. We could then do all the hidden neurons first and then continue with the visible units. Now, as the visible units only depend on the hidden units and vice versa, we could as well update all hidden units in parallel and then all visible units in parallel, using that as in the case of hidden units, the conditional probability for a visible unit to be one can be expressed as

$P(v_i = 1 | h) = \sigma(\beta (\sum_j W_{ij} h_j + b_i))$

This procedure is called Gibbs sampling with block updates. It is also obvious that sampling from the joint distribution $P(v,h)$ in this way and then ignoring the values of the hidden units in this way gives a sampler for the marginal distribution.

Therefore our algorithm to calculate the second term of the update rule would be as follows. We would start with some value for the visible units. Then we would calculate the probability that each hidden unit is on given these values for the visible units and update the hidden units according to this distribution. We would then use the new values for the hidden units, calculate the conditional distribution of the visible units and update the visible units according to this distribution. This would constitute a full Gibbs sampling step. We would repeat this process until convergence is reached and then sample for a few steps to calculate the expectation values above. Plugging this into the update rule and calculating the first term analytically, we would then obtain the needed update for the weights.

So it looks like we are back to our old problem – to calculate one weight update during the gradient descent procedure, we have to run a Gibbs sampler to convergence. Fortunately, it turns out that several approximations exist that make this calculation feasible. Next, we will look at two of these approaches – constrastive divergence and its companion persistent contrastive divergence (PCD). We will then implement both algorithms in Python and try it out, first on a small sample set and then finally on the MNIST data set. But this post has already grown a bit lengthy – so let us save this for the next post in this series.

Turn on the heating – from Hopfield networks to Boltzmann machines

In my recent post on Hopfield networks, we have seen that these networks suffer from the problem of spurious minima and that the deterministic nature of the dynamics of the network makes it difficult to escape from a local minimum. A possible approach to avoid this issue is to randomize the update rule. Intuitively, we want to move into a direction of lower energy most of the time, but sometimes allow the network to move a different direction, so that there is a certain probability to move away from a local minimum.

In a certain sense, a Boltzmann machine is exactly this – a stochastic version of a Hopfield network. If we want to pursue the physical analogy further, think of a Hopfield network as an Ising model at a very low temperature, and of a Boltzmann machine as a “warm” version of the same system – the higher the temperature, the higher the tendency of the network to behave randomly and to escape local minima. As for the Hopfield network, there are different versions of this model. We can allow the units to take any real value, or we can restrict the values to two values. In this post, we will restrict ourselves to binary units. Thus we consider a set of N binary units, taking values -1 and +1, so that our state space is again

$S = \{ -1, +1 \}^N$

Similar to a Hopfield network, each unit can be connected to every other unit, so that again the weights are given by an N x N matrix W that we assume to be symmetric and zero along the diagonal. For a state s, we define the energy to be

$E(s) = - \frac{1}{2} \langle s, Ws \rangle$

This energy defines a Boltzmann distribution on the state space, given by

$P(s) = \frac{1}{Z} e^{-\beta E(s)}$

Now our aim is to adjust the weights such that this distribution is the best possible approximation to the real distribution behind the training data.

How do we measure the distance between the current distribution and the target distribution? A common approach to do this is called the maximum likelihood approach: given a set of weights W, we try to maximize the probability for the training data under the distribution given by W. For convenience, one does usually not maximize this function directly, but instead minimizes minus the logarithm of this probability, divided by the number K of samples. In our case, we therefore try to minimize the loss function

$l({\mathcal D} | W) = - \frac{1}{K} \ln P({\mathcal D} | W)$

Now let us assume that our sample is given by K data points that we denote by $s^{(k]}$ . Assuming that the sample states are independent, we can write the probability for the data given the weights W as the product.

$P({\mathcal D} | W) = \prod_k P(s^{(k)} | W)$

Using the definition of the Boltzmann distribution and the partition function Z, we can therefore express our loss function as

$l({\mathcal D} | W) = \ln Z + \frac{\beta}{K} \sum_k E(s^{(k)})$

where $s^{(k)}$ is the k-th sample point, where we assume that all sample points are drawn independently.

Now how do we minimize this function? An obvious approach would be to use the gradient descent algorithm or one of its variants. To be able to do this, we need the gradient of the loss function. Let us first calculate the partial derivative for the first term, the logarithm of the partition function. This is

$\frac{\partial}{\partial W_{ij}} \ln Z = \frac{1}{Z} \sum_s \frac{\partial}{\partial W_{ij}} e^{-\beta E(s)} = - \beta \frac{1}{Z} \sum_s e^{-\beta E(s)} \frac{\partial}{\partial W_{ij}} E(s) = - \beta \sum_s P(s) \frac{\partial}{\partial W_{ij}} E(s)$

Now the sum on the right hand side of this equation is of the form “probability of a state times a function of this state”. In other words, this is an expectation value. Using the standard notation for expectation values, we can therefore write

$\frac{\partial}{\partial W_{ij}} \ln Z = - \beta \langle \frac{\partial}{\partial W_{ij}} E(x) \rangle_P$

If you remember that expectation values can be approximated using Monte Carlo methods, this is encouraging, at least we would have an idea how to calculate this. Let us see whether the second term can be expressed as an expectation values as well. In fact, this is even easier.

$\frac{\partial}{\partial W_{ij}} \frac{\beta}{K} \sum_k E(s^{(k)}) = - \frac{\beta}{2} \frac{1}{K} \sum_k s_i^{(k)} s_j^{(k)}$

Now this is again an expectation value – it is not an expectation value under the model distribution (the Boltzmann distribution) but under the empirical distribution of the data set.

$\frac{\partial}{\partial W_{ij}} \frac{\beta}{K} \sum_k E(s^{(k)}) = - \frac{\beta}{2} \langle s_i s_j \rangle_{\mathcal D}$

Finally, our expression for the first term still contains the derivative of the energy, which is easily calculated. Putting all of this together, we now obtain a formula for the gradient of the loss function which is maybe the most important single formula for Boltzmann machines that you need to remember.

$\frac{\partial}{\partial W_{ij}} l({\mathcal D} | W) = - \frac{\beta}{2} \left[ \langle s_i s_j \rangle_{\mathcal D} - \langle s_i s_j \rangle_P \right]$

When using the standard gradient descent algorithm, this expression for the gradient leads to the following update rule for the weights, where $\lambda$ is the step size.

$\Delta W_{ij} = \frac{1}{2} \lambda \beta \left[ \langle s_i s_j \rangle_{\mathcal D} - \langle s_i s_j \rangle_P \right]$

Let us pause for a moment and reflect what this formula tells us. First, if we have reached our goal – model distribution and sample distribution are identical – the gradient is zero and the algorithm stops.

Second, the first term is essentially the Hebbian learning rule that we have used to train our Hopfield network. In fact, this is the weighted sum over the product $s_i s_j$ across all sample points, i.e. we strengthen a connection between two units if the two units are strongly correlated in the sample set, and weaken the connection otherwise. The second term is a correction to the Hebbian rule that does not appear in a Hopfield network.

The third point that we can observe is a bit more subtle. To explain it, assume for a moment that the data has been normalized (which is very often done in actual applications) so that its average is zero, in other words such that the expectation values

$\langle s_i \rangle_{\mathcal D}$

of the coordinates are all zero. For the Boltzmann distribution, this will be the case anyway as the distribution is symmetric in s (and it would therefore not even make sense to try to achieve convergence with unnormalized data). Thus the two terms that appear in the above equation are simply the elements of the covariance matrix under empirical and model distribution. This implies that a Boltzmann machine is not able to distinguish two distributions that have the same second moments as the covariance matrix is all that it sees.

That is a bit disappointing as it limits the power of our model significantly. But this is not the only problem with Boltzmann machines. Whereas we could easily calculate the first term in our formula for the weight change, the second term is more difficult. In our discussion of the Ising model, we have already seen that we could use Gibbs sampling for this, but would need to run a Gibbs sampling chain to convergence which can easily take one million steps or more for large networks. Now this is embedded into gradient descent which is by itself an iterative algorithm! Imagine that one single gradient descent step could take a few minutes and then remember that we might need several thousand of these steps and you see that we are in trouble.

Fortunately, help is on the way – with a slightly simplified model called restricted Boltzmann machines, both problems can be solved. I will look at this class of networks in my next post in this series. If you do not want to wait until then, you can take a look at my notes on Boltzmann machines that also give you some more background on what we have discussed in this post.

Before we close, let me briefly describe what we could do with a Boltzmann machine if we had found a way to train it. Similar to a Hopfield network, Boltzmann machines are generative models. Thus once they are trained, we can either use them to create samples or to correct errors. If, for instance, each unit corresponds to a pixel in an image of a handwritten digit, we could sample from the model to obtain artificially created images that resemble handwritten digits. We could also use the network for pattern completion – if we have an image where a few pixels have been erased, we could start the network in the state given by the remaining pixels and some random values for the unknown pixels and hope that it converges to the memorized state, thus reconstructing the unknown part of the picture. However, specifically for restricted Boltzmann machines, we will see that an even more important application is to be used as feature extractor in deep layered networks.

So there are good reasons to continue analyzing these networks – so join me again in my next post when we discuss restricted Boltzmann machines.

Hopfield networks: practice

After having discussed Hopfield networks from a more theoretical point of view, let us now see how we can implement a Hopfield network in Python.

First let us take a look at the data structures. We will store the weights and the state of the units in a class HopfieldNetwork. The weights are stored in a matrix, the states in an array.

class HopfieldNetwork:

    #
    # Initialize a Hopfield network with N
    # neurons
    #
    def __init__(self, N):
        self.N = N
        self.W = np.zeros((N,N))
        self.s = np.zeros((N,1))

Next we write a method for the update rule. In a matrix notation, the activation of unit i is given as the dot product of the current state and the i-th row of the weight matrix (or the i-th column, as the matrix is symmetric). Therefore we can use the following update function.

#
# Run one simulation step
#
def runStep(self):
    i = np.random.randint(0,self.N)
    a = np.matmul(self.W[i,:], self.s)
    if a < 0:
        self.s[i] = -1
    else:
        self.s[i] = 1

Finally, there is the Hebbian learning rule. If we store the sample set in a matrix S such that each row corresponds to one sample, the learning rule is equivalent to

$W = S^T S$

Thus we can again use the matrix multiplication from the numpy package, and our learning rule is simply

def train(self, S):
    self.W = np.matmul(S.transpose(), S)

Now we need some pattern to train our network. For the sake of demonstration, let us use a small network with 5 x 5 units, each of them representing a pixel in a grayscale 5 x 5 image. For this post, I have hardcoded five simple patterns, but we could of course use any other training set.

To illustrate how the Hopfield network operates, we can now use the method train to train the network on a few of these patterns that we call memories. We then take these memories and randomly flip a few bits in each of them, in other words we simulate random errors in the pattern. We then place the network in these states and run the update rule several times. If you want to try this yourself, get the script Hopfield.py from my GitHub repository.

The image above shows the result of this exercise. Here we have stored three memories – the first column of images – in the network. The second image in each row then shows the distorted versions of these patterns after flipping five bits randomly. The next images in each row show the state of the network after 20, 40, 60, 80 and 100 iterations of the update rule. We see that in this case, all errors could be removed and the network did in fact converge to the original image.

However, the situation becomes worse if we try to store more memories. The next image shows the outcome of a simulation with the same basic parameters, but five instead of three memories.

We clearly see that not a single one of the distorted patterns converges to the original image. Apparently, we have exceeded the capacity of the network. In his paper, Hopfield – based on theoretical considerations and simulations – argues that the network can only store approximately $0.15 \cdot N$ patterns, where N is the number of units. In our case, with 25 units, this would be approximately 3 to 4 patterns.

If we exceed this limit, we seem to create local minima of the energy that do not correspond to any of the stored patterns. This problem is known as the problem of spurious minima which can also occur if we stay below the maximum capacity – if you do several runs with three memories, you will find that also in this case, spurious minima can occur.

The update rule of the Hopfield network is deterministic, its energy can never increase. Thus if the system moves into one of those local minima, it can never escape again and gets stuck. An Ising model at a finite, non-zero temperature behaves differently. As the update rule is stochastic, it is possible that the system moves away from a local minimum in a Gibbs sampling step, and therefore has the chance to escape from a spurious minimum. This is one of the reasons why researchers have tried to come up with stochastic versions of the Hopfield network. One of these stochastic versions is the Boltzmann network, and we will start to look at its theoretical foundations in the next post in this series.

Hopfield networks: theory

Having looked in some detail at the Ising model, we are now well equipped to tackle a class of neuronal networks that has been studied by several authors in the sixties, seventies and early eighties of the last century, but has become popular by an article [1] published by J. Hopfield in 1982.

The idea behind this and earlier research is as follows. Motivated by the analogy between a unit in a neuronal network and a neuron in a human brain, researchers were trying to understand how the neurons needed to be organized to be able to create abilities like associative memories, i.e. a memory that can be navigated by associations that bring up additional stored memories. To explain how the human brain organizes the connections between the neurons in optimal (well, as least useful) way, analogies with physical systems like the Ising model covered in this post which also exhibit some sort of spontaneous self organization, were pursued.

In particular, the analogy with stability attracted attention. In many physical systems, there are stable states. If the system is put into a state which is sufficiently close to such a stable state, it will, over time, move back into that stable state. A similar property is desirable for associative memory systems. If, for instance, such a system has memorized an image and is then placed in a state which is somehow close to that image, i.e. only a part of the image or a noisy version of the image is presented, it should converge into the memorized state representing the original image.

With that motivation, Hopfield described the following model of a neuronal network. Our network consists of individual units that can be in any of two states, “firing” and “not firing”. The system consists of N such units, we will denote the state of unit i by $s_i$ .

Any two units can be connected, and there is a matrix W whose elements represent the strength of the connection between the individual units, i.e. $w_{ij}$ is the strength of the connection between the units i and j. We assume that no neuron is connected to ifself, i.e. that $w_{ii} = 0$ , and that the matrix of weights is symmetric, i.e. that $w_{ij} = w_{ji}$ .

The activation of unit i is then obtained by summing up the weighted values of all neurons connected to it, i.e. given by

$a_i = \sum_j w_{ij} s_j$

Hopfield used a slightly different notation in his paper and assigned the values 0 and 1 to the two states, but we will again use -1 and +1.

So how does the Hopfield network operate? Suppose that the network is in a certain state. i.e. some of the neurons will be “firing”, represented by the value +1, and others will be passive, represented by the value -1. We now choose a neuron at random and calculate its activation function according to the formula above. We then determine the new state by the rule

$s_i' = \begin{cases} +1 & a_i \geq 0 \\ -1 & a_i < 0 \end{cases}$

In most cases, the network will actually converge after a finite number of steps, i.e. this rule does not change the state any more. To see why this happens, let us consider the function

$E(s) = - \frac{1}{2} \sum_{i,j} w_{ij} s_i s_j$

which is called the energy function of the model. Suppose that we pass from a state s to a state s’ by applying the update rule above. Let us assume that we have updated neuron i and changed its state from $s_i$ to $s_i'$ . Let

$\Delta s_i = s_i' - s_i$

Using the fact that the matrix W is symmetric, we can then write

$E(s') = -\frac{1}{2} \sum_{p,q \neq i} s_p s_q - \sum_p w_{pi} s_i' s_p$

which is the same as

$-\frac{1}{2} \sum_{p,q \neq i} s_p s_q - \sum_p w_{pi} s_i s_p - \sum_p w_{pi} (s_i' - s_i) s_p$

Thus we find that

$E(s') = E(s) - \Delta s_i \sum_p w_{ip} s_p$

Now the sum is simply the activation of neuron i. As our update rule guarantees that the product of $\Delta s_i$ and the activation of unit i is never negative, this implies that during the upgrade process, the energy function will always increase or stay the same. Thus the state will settle in a local minimum of the energy function.

At this point, we can already see some interesting analogies with the Ising model. Clearly, the units in a Hopfield network correspond to the particles in an Ising model. The state (firing or not) corresponds to the spin (upward or downward). The energy is almost literally the same as the energy of the Ising model without an external magnetic field.

Also the update rules are related. Recall that during a Gibbs sampling step for an Ising model, we calculate the conditional probability

$P = \sigma(2 \beta \langle J_i, s \rangle)$

Here the scalar product is the equivalent of the activation, and we could rewrite this as

$P = \sigma(2 \beta a_i)$

Let us now assume that the temperature is very small, so that $\beta$ is close to infinity. If the activation of unit i is positive, the probability will be very close to one. The Gibbs sampling rule will then almost certainly set the spin to +1. If the activation is negative, the probability will be zero, and we will set the spin to -1. Thus the update role of a Hopfield network corresponds to the Gibbs sampling step for an Ising model at temperature zero.

At nonzero temperatures, a Hopfield network and an Ising model start to behave differently. The Boltzmann distribution guarantees that the state with the lowest energies are most likely, but as the sampling process proceeds, the random element built into the Gibbs sampling rule implies that a state can evolve into another of higher energy as well, even though this is unlikely. For the Hopfield network, the update rule is completely deterministic, and the states will always evolve into states of lower or at least equal energy.

The memories that we are looking for are now the states of minimum energy. If we place the system in a nearby state and let it evolve according to the update rules, it will move over time back into a minimum and thus “remember” the original state.

This is nice, but how do we train a Hopfield network? Given some state s, we want to construct a weight matrix such that s is a local minimum. More generally, if we have already defined weights giving some local minima, we want to adjust the weights in order to create an additional minimum at s, if possible without changing the already existing minima significantly.

In Hopfields paper, this is done with the following learning rule.

$w_{ij} = \begin{cases} \sum_s S^{(s)}_i S^{(s)}_j & i \neq j \\ 0 & i = j \end{cases}$

where $S^{(1)}, \dots, S^{(K)}$ are the states that the network should remember (in a later post in this series, we will see that this rule can be obtained as the low temperature limit of a training algorithm called contrastive divergence that is used to train a certain class of Boltzmann machines).

Thus a state S contributes with a positive value to $w_{ij}$ if $S_i$ and $S_j$ have the same sign, i.e. are in the same state. This corresponds to a rule known as Hebbian learning rule that has been postulated as a principle of learning by D. Hebb and basically states that during learning, connections between neurons are enforced if these neurons fire together ([2], chapter 4).

Let us summarize what we have done so far. We have described a Hopfield network as a fully connected binary neuronal network with symmetric weight matrices and have defined the update rule and the learning rule for these networks. We have seen that the dynamics of the network resembles that of an Ising model at low temperatures. We now expect that a randomly chosen initial state will converge to one of the memorized states and that therefore, this model can serve as an associative memory.

In the next post, we will put this to work and implement and train a Hopfield network in Python to study its actual behavior.

References

1. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proc. Nat. Acad. Sci. Vol. 79, No. 8 (1982), pp. 2554-2558
2. D.O. Hebb, The organization of behaviour, Wiley, New York 1949