Turn on the heating – from Hopfield networks to Boltzmann machines

In my recent post on Hopfield networks, we have seen that these networks suffer from the problem of spurious minima and that the deterministic nature of the dynamics of the network makes it difficult to escape from a local minimum. A possible approach to avoid this issue is to randomize the update rule. Intuitively, we want to move into a direction of lower energy most of the time, but sometimes allow the network to move a different direction, so that there is a certain probability to move away from a local minimum.

In a certain sense, a Boltzmann machine is exactly this – a stochastic version of a Hopfield network. If we want to pursue the physical analogy further, think of a Hopfield network as an Ising model at a very low temperature, and of a Boltzmann machine as a “warm” version of the same system – the higher the temperature, the higher the tendency of the network to behave randomly and to escape local minima. As for the Hopfield network, there are different versions of this model. We can allow the units to take any real value, or we can restrict the values to two values. In this post, we will restrict ourselves to binary units. Thus we consider a set of N binary units, taking values -1 and +1, so that our state space is again

S = \{ -1, +1 \}^N

Similar to a Hopfield network, each unit can be connected to every other unit, so that again the weights are given by an N x N matrix W that we assume to be symmetric and zero along the diagonal. For a state s, we define the energy to be

E(s) = - \frac{1}{2} \langle s, Ws \rangle

This energy defines a Boltzmann distribution on the state space, given by

P(s) = \frac{1}{Z} e^{-\beta E(s)}

Now our aim is to adjust the weights such that this distribution is the best possible approximation to the real distribution behind the training data.

How do we measure the distance between the current distribution and the target distribution? A common approach to do this is called the maximum likelihood approach: given a set of weights W, we try to maximize the probability for the training data under the distribution given by W. For convenience, one does usually not maximize this function directly, but instead minimizes minus the logarithm of this probability, divided by the number K of samples. In our case, we therefore try to minimize the loss function

l({\mathcal D} | W) = - \frac{1}{K} \ln P({\mathcal D} | W)

Now let us assume that our sample is given by K data points that we denote by s^{(k]}. Assuming that the sample states are independent, we can write the probability for the data given the weights W as the product.

P({\mathcal D} | W) = \prod_k P(s^{(k)} | W)

Using the definition of the Boltzmann distribution and the partition function Z, we can therefore express our loss function as

l({\mathcal D} | W) = \ln Z + \frac{\beta}{K} \sum_k E(s^{(k)})

where s^{(k)} is the k-th sample point, where we assume that all sample points are drawn independently.

Now how do we minimize this function? An obvious approach would be to use the gradient descent algorithm or one of its variants. To be able to do this, we need the gradient of the loss function. Let us first calculate the partial derivative for the first term, the logarithm of the partition function. This is

\frac{\partial}{\partial W_{ij}} \ln Z = \frac{1}{Z} \sum_s \frac{\partial}{\partial W_{ij}} e^{-\beta E(s)} = - \beta \frac{1}{Z} \sum_s e^{-\beta E(s)} \frac{\partial}{\partial W_{ij}} E(s) = - \beta \sum_s P(s) \frac{\partial}{\partial W_{ij}} E(s)

Now the sum on the right hand side of this equation is of the form “probability of a state times a function of this state”. In other words, this is an expectation value. Using the standard notation for expectation values, we can therefore write

\frac{\partial}{\partial W_{ij}} \ln Z = - \beta \langle \frac{\partial}{\partial W_{ij}} E(x) \rangle_P

If you remember that expectation values can be approximated using Monte Carlo methods, this is encouraging, at least we would have an idea how to calculate this. Let us see whether the second term can be expressed as an expectation values as well. In fact, this is even easier.

\frac{\partial}{\partial W_{ij}} \frac{\beta}{K} \sum_k E(s^{(k)}) = - \frac{\beta}{2} \frac{1}{K} \sum_k s_i^{(k)} s_j^{(k)}

Now this is again an expectation value – it is not an expectation value under the model distribution (the Boltzmann distribution) but under the empirical distribution of the data set.

\frac{\partial}{\partial W_{ij}} \frac{\beta}{K} \sum_k E(s^{(k)}) = - \frac{\beta}{2} \langle s_i s_j \rangle_{\mathcal D}

Finally, our expression for the first term still contains the derivative of the energy, which is easily calculated. Putting all of this together, we now obtain a formula for the gradient of the loss function which is maybe the most important single formula for Boltzmann machines that you need to remember.

\frac{\partial}{\partial W_{ij}} l({\mathcal D} | W) = - \frac{\beta}{2} \left[ \langle s_i s_j \rangle_{\mathcal D} - \langle s_i s_j \rangle_P \right]

When using the standard gradient descent algorithm, this expression for the gradient leads to the following update rule for the weights, where \lambda is the step size.

\Delta W_{ij} = \frac{1}{2} \lambda \beta \left[ \langle s_i s_j \rangle_{\mathcal D} - \langle s_i s_j \rangle_P \right]

Let us pause for a moment and reflect what this formula tells us. First, if we have reached our goal – model distribution and sample distribution are identical – the gradient is zero and the algorithm stops.

Second, the first term is essentially the Hebbian learning rule that we have used to train our Hopfield network. In fact, this is the weighted sum over the product s_i s_j across all sample points, i.e. we strengthen a connection between two units if the two units are strongly correlated in the sample set, and weaken the connection otherwise. The second term is a correction to the Hebbian rule that does not appear in a Hopfield network.

The third point that we can observe is a bit more subtle. To explain it, assume for a moment that the data has been normalized (which is very often done in actual applications) so that its average is zero, in other words such that the expectation values

\langle s_i \rangle_{\mathcal D}

of the coordinates are all zero. For the Boltzmann distribution, this will be the case anyway as the distribution is symmetric in s (and it would therefore not even make sense to try to achieve convergence with unnormalized data). Thus the two terms that appear in the above equation are simply the elements of the covariance matrix under empirical and model distribution. This implies that a Boltzmann machine is not able to distinguish two distributions that have the same second moments as the covariance matrix is all that it sees.

That is a bit disappointing as it limits the power of our model significantly. But this is not the only problem with Boltzmann machines. Whereas we could easily calculate the first term in our formula for the weight change, the second term is more difficult. In our discussion of the Ising model, we have already seen that we could use Gibbs sampling for this, but would need to run a Gibbs sampling chain to convergence which can easily take one million steps or more for large networks. Now this is embedded into gradient descent which is by itself an iterative algorithm! Imagine that one single gradient descent step could take a few minutes and then remember that we might need several thousand of these steps and you see that we are in trouble.

Fortunately, help is on the way – with a slightly simplified model called restricted Boltzmann machines, both problems can be solved. I will look at this class of networks in my next post in this series. If you do not want to wait until then, you can take a look at  my notes on Boltzmann machines that also give you some more background on what we have discussed in this post.

Before we close, let me briefly describe what we could do with a Boltzmann machine if we had found a way to train it. Similar to a Hopfield network, Boltzmann machines are generative models. Thus once they are trained, we can either use them to create samples or to correct errors. If, for instance, each unit corresponds to a pixel in an image of a handwritten digit, we could sample from the model to obtain artificially created images that resemble handwritten digits. We could also use the network for pattern completion – if we have an image where a few pixels have been erased, we could start the network in the state given by the remaining pixels and some random values for the unknown pixels and hope that it converges to the memorized state, thus reconstructing the unknown part of the picture. However, specifically for restricted Boltzmann machines, we will see that an even more important application is to be used as feature extractor in deep layered networks.

So there are good reasons to continue analyzing these networks – so join me again in my next post when we discuss restricted Boltzmann machines.

 

Hopfield networks: practice

After having discussed Hopfield networks from a more theoretical point of view, let us now see how we can implement a Hopfield network in Python.

First let us take a look at the data structures. We will store the weights and the state of the units in a class HopfieldNetwork. The weights are stored in a matrix, the states in an array.

class HopfieldNetwork:

    #
    # Initialize a Hopfield network with N
    # neurons
    #
    def __init__(self, N):
        self.N = N
        self.W = np.zeros((N,N))
        self.s = np.zeros((N,1))

Next we write a method for the update rule. In a matrix notation, the activation of unit i is given as the dot product of the current state and the i-th row of the weight matrix (or the i-th column, as the matrix is symmetric). Therefore we can use the following update function.

#
# Run one simulation step
#
def runStep(self):
    i = np.random.randint(0,self.N)
    a = np.matmul(self.W[i,:], self.s)
    if a < 0:
        self.s[i] = -1
    else:
        self.s[i] = 1

Finally, there is the Hebbian learning rule. If we store the sample set in a matrix S such that each row corresponds to one sample, the learning rule is equivalent to

W = S^T S

Thus we can again use the matrix multiplication from the numpy package, and our learning rule is simply

def train(self, S):
    self.W = np.matmul(S.transpose(), S)

Now we need some pattern to train our network. For the sake of demonstration, let us use a small network with 5 x 5 units, each of them representing a pixel in a grayscale 5 x 5 image. For this post, I have hardcoded five simple patterns, but we could of course use any other training set.

To illustrate how the Hopfield network operates, we can now use the method train to train the network on a few of these patterns that we call memories. We then take these memories and randomly flip a few bits in each of them, in other words we simulate random errors in the pattern. We then place the network in these states and run the update rule several times. If you want to try this yourself, get the script Hopfield.py from my GitHub repository.

tmp5355t6hr_Hopfield

 
The image above shows the result of this exercise. Here we have stored three memories – the first column of images – in the network. The second image in each row then shows the distorted versions of these patterns after flipping five bits randomly. The next images in each row show the state of the network after 20, 40, 60, 80 and 100 iterations of the update rule. We see that in this case, all errors could be removed and the network did in fact converge to the original image.

However, the situation becomes worse if we try to store more memories. The next image shows the outcome of a simulation with the same basic parameters, but five instead of three memories.

tmpjjlly5je_Hopfield

We clearly see that not a single one of the distorted patterns converges to the original image. Apparently, we have exceeded the capacity of the network. In his paper, Hopfield – based on theoretical considerations and simulations – argues that the network can only store approximately 0.15 \cdot N patterns, where N is the number of units. In our case, with 25 units, this would be approximately 3 to 4 patterns.

If we exceed this limit, we seem to create local minima of the energy that do not correspond to any of the stored patterns. This problem is known as the problem of spurious minima which can also occur if we stay below the maximum capacity – if you do several runs with three memories, you will find that also in this case, spurious minima can occur.

The update rule of the Hopfield network is deterministic, its energy can never increase. Thus if the system moves into one of those local minima, it can never escape again and gets stuck. An Ising model at a finite, non-zero temperature behaves differently. As the update rule is stochastic, it is possible that the system moves away from a local minimum in a Gibbs sampling step, and therefore has the chance to escape from a spurious minimum. This is one of the reasons why researchers have tried to come up with stochastic versions of the Hopfield network. One of these stochastic versions is the Boltzmann network, and we will start to look at its theoretical foundations in the next post in this series.

 

Hopfield networks: theory

Having looked in some detail at the Ising model, we are now well equipped to tackle a class of neuronal networks that has been studied by several authors in the sixties, seventies and early eighties of the last century, but has become popular by an article [1] published by J. Hopfield in 1982.

The idea behind this and earlier research is as follows. Motivated by the analogy between a unit in a neuronal network and a neuron in a human brain, researchers were trying to understand how the neurons needed to be organized to be able to create abilities like associative memories, i.e. a memory that can be navigated by associations that bring up additional stored memories. To explain how the human brain organizes the connections between the neurons in optimal (well, as least useful) way, analogies with physical systems like the Ising model covered in this post which also exhibit some sort of spontaneous self organization, were pursued.

In particular, the analogy with stability attracted attention. In many physical systems, there are stable states. If the system is put into a state which is sufficiently close to such a stable state, it will, over time, move back into that stable state. A similar property is desirable for associative memory systems. If, for instance, such a system has memorized an image and is then placed in a state which is somehow close to that image, i.e. only a part of the image or a noisy version of the image is presented, it should converge into the memorized state representing the original image.

With that motivation, Hopfield described the following model of a neuronal network. Our network consists of individual units that can be in any of two states, “firing” and “not firing”. The system consists of N such units, we will denote the state of unit i by s_i.

Any two units can be connected, and there is a matrix W whose elements represent the strength of the connection between the individual units, i.e. w_{ij} is the strength of the connection between the units i and j. We assume that no neuron is connected to ifself, i.e. that w_{ii} = 0, and that the matrix of weights is symmetric, i.e. that w_{ij} = w_{ji}.

The activation of unit i is then obtained by summing up the weighted values of all neurons connected to it, i.e. given by

a_i = \sum_j w_{ij} s_j

Hopfield used a slightly different notation in his paper and assigned the values 0 and 1 to the two states, but we will again use -1 and +1.

So how does the Hopfield network operate? Suppose that the network is in a certain state. i.e. some of the neurons will be “firing”, represented by the value +1, and others will be passive, represented by the value -1. We now choose a neuron at random and calculate its activation function according to the formula above. We then determine the new state by the rule

s_i' = \begin{cases} +1 & a_i \geq 0 \\ -1 & a_i < 0 \end{cases}

In most cases, the network will actually converge after a finite number of steps, i.e. this rule does not change the state any more. To see why this happens, let us consider the function

E(s) = - \frac{1}{2} \sum_{i,j} w_{ij} s_i s_j

which is called the energy function of the model. Suppose that we pass from a state s to a state s’ by applying the update rule above. Let us assume that we have updated neuron i and changed its state from s_i to s_i'. Let

\Delta s_i = s_i' - s_i

Using the fact that the matrix W is symmetric, we can then write

E(s') = -\frac{1}{2} \sum_{p,q \neq i} s_p s_q - \sum_p w_{pi} s_i' s_p

which is the same as

-\frac{1}{2} \sum_{p,q \neq i} s_p s_q - \sum_p w_{pi} s_i s_p - \sum_p w_{pi} (s_i' - s_i) s_p

Thus we find that

E(s') = E(s) - \Delta s_i \sum_p w_{ip} s_p

Now the sum is simply the activation of neuron i. As our update rule guarantees that the product of \Delta s_i and the activation of unit i is never negative, this implies that during the upgrade process, the energy function will always increase or stay the same. Thus the state will settle in a local minimum of the energy function.

At this point, we can already see some interesting analogies with the Ising model. Clearly, the units in a Hopfield network correspond to the particles in an Ising model. The state (firing or not) corresponds to the spin (upward or downward). The energy is almost literally the same as the energy of the Ising model without an external magnetic field.

Also the update rules are related. Recall that during a Gibbs sampling step for an Ising model, we calculate the conditional probability

P = \sigma(2 \beta \langle J_i, s \rangle)

Here the scalar product is the equivalent of the activation, and we could rewrite this as

P = \sigma(2 \beta a_i)

Let us now assume that the temperature is very small, so that \beta is close to infinity. If the activation of unit i is positive, the probability will be very close to one. The Gibbs sampling rule will then almost certainly set the spin to +1. If the activation is negative, the probability will be zero, and we will set the spin to -1. Thus the update role of a Hopfield network corresponds to the Gibbs sampling step for an Ising model at temperature zero.

At nonzero temperatures, a Hopfield network and an Ising model start to behave differently. The Boltzmann distribution guarantees that the state with the lowest energies are most likely, but as the sampling process proceeds, the random element built into the Gibbs sampling rule implies that a state can evolve into another of higher energy as well, even though this is unlikely. For the Hopfield network, the update rule is completely deterministic, and the states will always evolve into states of lower or at least equal energy.

The memories that we are looking for are now the states of minimum energy. If we place the system in a nearby state and let it evolve according to the update rules, it will move over time back into a minimum and thus “remember” the original state.

This is nice, but how do we train a Hopfield network? Given some state s, we want to construct a weight matrix such that s is a local minimum. More generally, if we have already defined weights giving some local minima, we want to adjust the weights in order to create an additional minimum at s, if possible without changing the already existing minima significantly.

In Hopfields paper, this is done with the following learning rule.

w_{ij} = \begin{cases} \sum_s S^{(s)}_i S^{(s)}_j & i \neq j \\ 0 & i = j \end{cases}

where S^{(1)}, \dots, S^{(K)} are the states that the network should remember (in a later post in this series, we will see that this rule can be obtained as the low temperature limit of a training algorithm called contrastive divergence that is used to train a certain class of Boltzmann machines).

Thus a state S contributes with a positive value to w_{ij} if S_i and S_j have the same sign, i.e. are in the same state. This corresponds to a rule known as Hebbian learning rule that has been postulated as a principle of learning by D. Hebb and basically states that during learning, connections between neurons are enforced if these neurons fire together ([2], chapter 4).

Let us summarize what we have done so far. We have described a Hopfield network as a fully connected binary neuronal network with symmetric weight matrices and have defined the update rule and the learning rule for these networks. We have seen that the dynamics of the network resembles that of an Ising model at low temperatures. We now expect that a randomly chosen initial state will converge to one of the memorized states and that therefore, this model can serve as an associative memory.

In the next post, we will put this to work and implement and train a Hopfield network in Python to study its actual behavior.

References

1. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proc. Nat. Acad. Sci. Vol. 79, No. 8 (1982), pp. 2554-2558
2. D.O. Hebb, The organization of behaviour, Wiley, New York 1949

The Ising model and Gibbs sampling

In the last post in the series on AI and machine learning, I have described the Boltzmann distribution which is a statistical distribution for the states of a system at constant temperature. We will now look at one of the most important applications of this distribution to an actual model, the Ising model.

This model was proposed by W. Lenz and first analysed in detail by his student E. Ising in his dissertation (of which [1] is a summary) to explain ferromagnetic behavior. In Isings model, a solid, like a piece of iron, is composed of a large number N of individual particles, each of them at a fixed location. A particle acts as a magnetic dipole that can be oriented in two different ways, corresponding to the different orientations of its spin. Ignoring the spatial structure for the time being, we can thus describe the state of the model as a point in the state space

{\mathcal S} = \{ -1, 1\}^N

We denote the elements of the state space by s and the i-th spin by s_i \in \{-1,1\} where a value of +1 is interpreted as “spin up” and a value of -1 as “spin down”.

Now, in general, a magnetic dipole which is exposed to a magnetic field B and has a magnetic dipole moment m will have a potential energy

E = - m \cdot B

which is the scalar product of m and B, i.e. the state of minimum energy is the one where the dipole is oriented along the magnetic field. In a solid, there are two sources for the magnetic field that act on each particle – there might be an external magnetic field H and there might be an interaction with the other magnetic dipoles in the model. We therefore model the total energy of a state s as

E(s) = - \frac{1}{2} \sum_{i,j} J_{ij} s_i s_j - \sum_j h_j s_j

where J_{ii} = 0 (i.e. we exclude self-interactions).

Here the coefficient h_j represents the external field acting on the particle at position j. The matrix J represents the interactions between the particles. In Isings original model, only nearby particles interact. In two dimensions, for instance, we think of the particles as being located on a grid and each particle has four neighbors: the particles immediately above and below it and the particles on the left and on the right. We can define a model which has a boundary or we can think of the grid as being toroidal, i.e. wrapping around.

We now consider the system as being in contact with a heat bath at a certain temperature T, i.e. the system can exchange internal energy with a thermal reservoir, leading to fluctuations in the orientations of the particles. This appears to be a reasonable model, we could, for instance, consider a comparatively small part of a solid and take the surrounding, much larger solid as the heat bath. The probability for the system to be in a state s is therefore given by the Boltzmann distribution:

p(s) = \frac{1}{Z} e^{-\beta E(s)}

The macroscopic quantities that are of primary interest are of course the average energy, but also the magnetization

M(s) = \frac{1}{N} \sum_i s_i

and its average value. At high temperatures and for H = 0, we expect that roughly half of the spins should be oriented in either direction, so the average magnetization should be zero. If we add an external field, then of course most of the dipoles will be aligned in the direction of this field, so the magnetization will be close to one or minus one. This fact – magnetization of a solid in the presence of an external magnetic field – is called paramagnetism. It turns out that for some materials, a non-zero magnetization can occur even if the external field is zero, as long as the temperature is below a certain value called the critical temperature – this behavior is known as ferromagnetism. Explaining this macroscopic behavior by a statistical model was the original intention of Isings work.

Now how do we actually evaluate our model? Our aim is to determine – for instance – the average magnetization at a given temperature. To do this naively, we would have to calculate the probabilities for all possible states s. Unfortunately, the number of states grows exponentially with the number N of particles. Suppose we wanted to use a toy model with only 40 x 40 spins – this is very small compared to the number of particles in an average macroscopic solid. As N = 1600 in this case,  we would have 2^{1600} different states, which is roughly 10^{482}. Comparing this to the estimated age of the universe in seconds (4 \cdot 10^{17} , see for instance this page), it is obvious that this is not a good idea.

However, we can try to approximate the average magnetization by using a sufficiently large sample. Thus we try to find a set of states s_i which is large, but doable – maybe a few million – and hope, based on the law of large numbers, that the sample average, i.e. the sum \sum_i M(s_i) , will provide a good approximation for the real value. This approach is sometimes called Monte Carlo integration and the workhorse of computational statistical mechanics.

How, then, do you create that sample? The answer is easy to write down, but difficult to motivate without the necessary background on Markov chains. Thus I will simply state the algorithm which is called Gibbs sampling and leave the theoretical background to another post (for the mathematically inclined reader, it is worth mentioning that the sample produced by the Gibbs sampler is not independent, but the law of large number still holds).

Before we can phrase the algorithm, we need another preparational step – we need to calculate a conditional probability. Suppose that the system is in a state s and we have chosen an arbitrary coordinate i. We can then ignore the actual state of the spin s_i and ask for the conditional probability that this spin points upwards given all other spins. A not too difficult calculation (which is carried out in detail for instance in my notes) shows that this conditional probability is given by

P(s_i = 1 | \{ s_j\}_{j \neq i}) = \sigma(2 \beta ( \langle J_i, s \rangle + h_i))

Here J_i denotes the i-th row of the matrix J and the brackets denote the ordinary scalar product. With this expression, a single Gibbs sampling step now proceeds as follows, given a state s.

  • Randomly pick a coordinate i
  • Calculate the conditional probability P = P(s_i = 1 | \{ s_j\}_{j \neq i}) using the formula above
  • Draw a real number U between 0 and 1 from the uniform distribution
  • If U is at most equal to P, set the spin at position i to +1, otherwise set it to -1

The algorithm then starts with a randomly chosen state and subsequently applies a large number of Gibbs sampling steps. After some time, called the burn-in time, the states after each step then form the sample we are looking for.

After all that theory, let us now turn to the practical implementation. We will restrict ourselves to the original model, i.e. J_{ij} = 1 if particles i and j are neighbors and zero otherwise, and also set the magnetic field to zero. The Gibbs sampling algorithm as outlined above is straightforward to implement in Python.  You can get my code from GitHub as follows.

$ git clone https://github.com/christianb93/MachineLearning.git

In the newly created directory MachineLearning, you should then see a file Ising.py. Run this as follows.

$ python IsingModel.py

This will create a new temporary directory with a name that is unique and specifies the run (on Linux / Unix systems, you will find the newly created directory in /tmp/. In this subdirectory, you will find three files. A file with extension .txt summarizes the parameters of the run. The file that ends with IsingPartI.png displays the simulation results. An example is

Ising2DPattern

Each of the little images represents one final state for a given temperature. In this example, a grid of 40 x 40 spins was calculated. The temperature was slowly decreased from 6.0 down to 0.2 in steps of 0.2. For each temperature, 4 million simulation steps were done, then the resulting grid was captured. The top row represents, from the left to the right, the temperatures 6.0, 5.8, 5.6, 5.4 and 5.2. Here we see the expected behaviour – patterns with roughly half of the particles in a spin-up position and half of the particles in a spin-down orientation.

In the bottom row of the diagram, that corresponds to the temperatures 1.0, 0.8, 0.6, 0.4 and 0.2, we also see the expected behavior for very low temperatures – all spins are oriented in the same direction. However, starting at temperatures 1.8, we see that large scale patterns start to emerge. especially for the temperatures 2.2 and 2.4 (rightmost pictures in the fourth row from the top). For these temperatures, entire connected regions display the same orientation of the spins and thus a non-zero mean magnetization. As the temperature rises, these patterns dissolve again.

This behaviour is typical for a so-called critical point and is what Ising was searching for. Ironically, Ising, who of course did not have the computational devices to run a simulation, concluded in his paper wrongly that this would not happen. Critical points are of great interest not only in statistical mechanics, but also in quantum field theory – we will not be able to explore this connection further, but it demonstrates how important the Ising model has become as a playground to bring together various branches from physics, computer science and mathematics.

The program has lots of parameters to play with – with the default values, for instance, it calculates comparatively small grids with 20 x 20 spins, so that a small number of iterations is sufficient, for larger grids you will need several million iterations. I recommend to play with this a bit to get a feeling for what happens. The image at the top of this article, for instance, was generated with the following command line:

$ python IsingModel.py --show=1 --N=160000 --rows=400 --cols=400 --steps=100000 --Tmax=2.4 --Tmin=1.7

and ran for roughly 13 minutes on my PC.

It is worth mentioning that neither the Gibbs sampling algorithm nor the chosen implementation are optimized. In fact, there are other algorithms like the exact sampling algorithm (see for instance [2]) that are providing much better results, and also the implementation could be improved greatly, for instance by using a Metropolis-Hastings checkerboard algorithm to allow for high parallelization and GPU computing (see for instance [3]). However, as understanding Gibbs sampling is vital for understanding Boltzmann machines, I have chosen to use a down-to-earth Gibbs sampling approach for this post – after all, its main purpose is not to gain new physical insights from simulation results but to get acquainted with Gibbs sampling as a standard sampling method, and, as always, to have some fun.

If you would like to learn more on the Ising model, please have a look at my notes that provide more details, show how to compute the conditional probability used for the Gibbs sampling and also cover the one-dimensional Ising model.

In the meantime, you might want to take a look at the beautiful article of J. Harder on WordPress who has some more sample pictures and a very interesting application of convolutional neuronal networks that he trained to be able to determine the temperatue at which the simulation was run from the visual representation of the simulation result.

References

1. E. Ising, Beitrag zur Theorie des Ferromagnetismus,
Zeitschrift f. Physik, Vol. 31, No.1 (1924), 253-258
2. D. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge University Press, Cambridge 2003
3. M. Weigel, Simulating spin models on GPU, Computer Physics Communication 182 (2011), 1833-1836

The Boltzmann distribution

Boltzmann machines essentially learn statistical distributions. During the training phase, we present them a data set called the sample data that follows some statistical distribution. As the weights of the model are adjusted as part of the learning algorithm, the statistical model represented by the Boltzmann machine changes, and the learning phase is successful if the model gets as closely to the training data as possible.

The distribution that is learned by a Boltzmann machine is called a Boltzmann distribution (obviously, this is where the name comes from…) and is of great importance in statistical physics. In this post, we will try to understand how this distribution arises naturally from a few basic assumptions in statistical mechanics.

The first thing that we will need is a state space. Essentially, a state space is a (probability) space whose points are in one-to-one correspondence with the physical states of the system that we want to describe. To illustrate this, suppose you are given a system of identical particles. These particles can move freely in a certain are of space.  The location of each individual particle can be described by its location in space, thus by three real numbers, and its velocity is described by three additional real numbers, one for each spatial component of the velocity. Thus as a state space for a single particle, we could choose a six-dimensional euclidian space. For a system of N particles, we would therefore need a state space of dimension 6N.

But a state space does not need to be somehow related to a position in some physical space. As another example, consider a solid, modeled as N particles with fixed locations in space. Let us also assume that each of these particles has a spin, and that this spin can either point upwards or downwards. We could then describe the spin of an individual particle by +1 (spin up) or -1 (spin down), and therefore our state space for N particles would be

\Omega = \{-1, +1\}^N

For every point s in the state space, i.e. for every possible state of the system, we assume that we can define an energy E(s), so that the energy defines a function E on the state space. One of the fundamental questions of statistical physics is now:

Given a state space and an energy function E on that space, derive, for each state s, the probability that the system is in the state s

What exactly do we mean by the probability to be in a specific state? We assume that we could theoretically construct a very large number of identical systems that are isolated against each other and independent. All these systems evolve according to the same rules. At any point in time, we can then pick a certain state and determine the fraction of systems that are in this state. This number is the probability to be in that state. The set of all these systems is called a statistical ensemble. The assignment of a probability to each state then defines a probability distribution on the state space. To avoid some technicalities, we will restrict ourselves to the case that the state space is finite, but of course more general cases can be considered as well.

As it stands, the question phrased above is impossible to answer – we need some additional assumptions on the system to be able to write down the probability distribution. We could, for instance, assume that the number of particles and the energy are fixed.  A bit less restrictive is the assumption that the energy can vary, but that the temperature of the system (and the number of particles as well as the volume) is fixed – this is then called a canonical ensemble. Let us denote the temperate by T and assume now that it is constant.

We could describe a system for which these assumptions hold as being in contact with a second, very large system which is able to supply (or absorb) a virtually infinite amount of heat – this system is called a heat bath or thermal reservoir. The composite system that consists of our system of interest and the heat bath is assumed to be fully isolated. Thus its energy and the number of particles is constant. Our primary system can exchange heat with the heat bath and therefore its energy can change, while its temperate will stay constant and equal to the temperature of the heat bath.

CanonicalEnsemble

Let us denote the energy of the overall system by U_{tot}, its entropy by S_{tot}, the energy of the heat bath by U_R and the entropy of the heat bath by S_R (if energy and entropy are new concepts for you, I recommend my short introduction to thermodynamics which summarizes the most important fundamentals of thermodynamics on roughly 40 pages and also contains more details on statistical ensembles and the Boltzmann distribution).

Now it turns out that making only these very general assumptions (in contact with heat bath, constant temperature) plus one additional  assumption, we can already derive a formula for the probability distribution on the state space – even without knowing anything else on the state space! Let me sketch the argument to get a feeling for it, more details can be found in the document mentioned above.

Let us assume that we are given some state s of our primary system with energy E(s). We can combine this state with a state of the heat bath to an allowed state of the composite system if the energies of both states add up to the total energy U_{tot}, i.e. with any state of the heat bath which has energy U_{tot} - E. Let us denote the number of states of the heat bath with that energy by

W_R(U_{tot} - E)

and let W_{tot} denote the number of states of the composite system with energy U_{tot}. The major additional assumption that we now make is that for the composite system, all states with energy U_{tot} are equally likely – this is known as the principle of indifference.

Let us take a moment to reflect on this. The principle of indifference (sometimes called the principle of insufficient reason) states (translated into our case) that if the states of the system cannot be distinguished by any of the available observable quantities (in this case the energy which is equal to U_{tot} for all states of the composite system), all these states should be assigned equal probabilities. This appears somehow reasonable, if one state was more likely than any other state, this state would somehow be distinguished but could not be told apart by any of the measurable quantities. Of course, that something is not measurable does not mean that it does not exist – there could even be a quantity that is theoretically observable and distinguishes certain states, but is simply not on our radar. So this principle has to be applied with care, but it turns out that the resulting distribution gives a surprisingly good description of what we can measure in a large number of applications, so this assumption is somehow justified by its success (see also {1] chapter 15 and 21 for a short discussion and justification).

What does the principle of indifference imply for our primary system? The number of states of the composite system for which our primary system is in state s is exactly W_R(U_{tot} - E) , because every such state combines with s to an admissible state of the composite system. If all these states are equally likely, the probability for s is therefore just the fraction of these states among all possible states of the composite system, i.e.

p(s) = \frac{W_R(U_{tot} - E)}{W_{tot}}

The beauty of this equation is that we can express both, the numerator and the denominator, by the entropy. And this is where finally the Austrian physicist Ludwig Boltzmann comes into play. Boltzmann identified the entropy of a system with the logarithm of the number of states available to the system – that is the famous relation that is expressed as

S = k \ln W

on his tomb stone. Here W is the number of states available to the system, and k is the Boltzmann constant.

Let us now use this expression for both numerator and denominator. We start with the denominator. If the system has reached thermal equilibrium, the primary system will have a certain energy U which is related to the total energy and the energy of the reservoir by

U_{tot} = U + U_R

Using the additivity of the entropy, we can therefore write

W_{tot} = \exp \frac{1}{k} \left[    S(U) + S_R(U_{tot} - U)    \right]

Observe that we assume the number of particles and the volume of both the reservoir and the composite system to be constant, so that the entropy does really only depend on the volume (or at least we assume that the dependency on the volume is negligible). For the numerator, we need a different approach. Let us try a Taylor expansion around the energy U_{tot} - U = U_R. The first derivative of the entropy with respect to the energy is – by definition – the inverse of the temperature:

\frac{\partial S_R}{\partial U_R} = \frac{1}{T}

The second derivative turns out to be related to the heat capacity C_R of the reservoir. In fact, we have

\frac{\partial}{\partial U_R} \frac{1}{T} = - \frac{1}{T^2 C_R}

Now a heat bath is characterised by the property that we can add a virtually infinite amount of heat without raising the temperature. Thus the heat capacity is infinite, and the second derivative is (virtually) zero. Therefore our Taylor expansion is

S_R(U_{tot} - E(s)) = S_R(U_{tot} - U) + \frac{1}{T} (U - E(s))

Putting all this together, a few terms cancel out, and we find that

p(s) = \exp \frac{1}{kT} \left[ (U - TS(U)) - E(s) \right]

Now the energy U is the average, i.e. the macroscopically observable, energy of the primary system. Therefore U - TS is the Helmholtz energy F. If we also introduce the usual notation \beta for the inverse of kT, we finally find that

p(s) = e^{\beta F} e^{- \beta E(s)}

This is usually written down a bit differently. To obtain this form, let us sum this over all possible states. As U and therefore F are averages and do not depend on the actual state, but all probabilities have to add up to one, we find that

1 = e^{\beta F} \sum_s e^{- \beta E(s)}

The last sum is called the partition function and usually denoted by Z. Therefore we obtain

p(s) = \frac{1}{Z} e^{- \beta E(s)}

which is the Boltzmann distribution as it usually appears in textbooks. So we have found a structurally simple expression for the probability distribution, starting with a rather general set of assumptions.

Note that this equation tells us that the state with the lowest energy will have the highest probability. Thus the system will prefer states with low energies, and – due to the exponential – states with a significantly higher energy tend to be very unlikely.

Due to the very mild assumptions, the Boltzmann distribution applies to a large range of problems – it can be used to derive the laws for an ideal gas, for systems of spins in a solid or for a black body. However, the formula is only simple at the first glance – the real complexity is hidden in the partition function. With a few short calculations, one can for instance show that the entropy can be derived from the partition function as

S = k \frac{\partial}{\partial T} T \ln Z

Thus if we known the partition function as a function of the macroscopic variables, we know all the classical thermodynamical quantities like entropy, temperature and Helmholtz energy. In many cases, however, the partition function is not known, and we have to devise techniques to derive physically interesting results without a tractable expression for the partition function.

That was it for today. This blog was a bit theoretical, so next time we will apply this to a concrete model which is accessible for numerical simulations, namely the Ising model. Until then, you might want to go through my notes which contain some details on the Boltzmann distribution as well as examples and applications.

References

1. H.B. Callen, Thermodynamics and an introduction to thermostatistics, 2nd edition, John Wiley 1985
2. D.V. Schroeder, An introduction to thermal physics, Addison Wesley 2000

Boltzmann machines, spin, Markov chains and all that

The image above displays a set of handwritten digits on the left. They look a bit like being sketched on paper by someone in a hurry and then scanned and digitalized, not very accurate but still mostly readable – but they are artificial, produced by a neuronal network, more precisely a so called restricted Boltzmann machine.

On the right hand side, you see the (core part of) the code that has been used to produce this image. These are about forty lines of code, and there is some code around it which is not shown, but stripping off all the comments and boilerplate code, we could probably fit the algorithm into less than fifty lines of code.

I found this contrast always fascinating. Creating something that resembles handwritten digits sounds incredibly complex, but can be done with an algorithm that can  be quenched into a comparatively short program – how does that work? So I started to dig deeper, trying to understand neuronal networks in general and in particular Boltzmann machines – the mathematical foundations, the algorithm and the implementation.

Boltzmann machines are a rather special class of neuronal networks that have a reputation of being hard to train, but are important from a theoretical point of view, being closely related to seemingly remote fields like thermodynamics, statistical mechanics and stochastical processes. That makes them interesting, but also hard to understand. I embarked on that journey a couple of months ago, and I thought it would be nice to create a series of blog posts on this. My current thinking is to have one or two posts on each of the following topics over the next couple of weeks.

This is a lot of content, and please do not expect to see one post every other day. But maybe I should just start and we will see how it goes….

So let me try to roughly sketch what a Boltzmann machine is. First, it is a neuronal network. As such it is composed of units that can be compared to the neurons in the nervous system. Similar to a neuron, each unit has an input and an output.  The output of a neuron can serve as input for another neuron. In general, every neuron receives inputs from many other neurons and delivers outputs to many other neurons.

Perceptron

The diagram above shows a very simple neuronal network. It consists of four neurons. The three of them on the left serve as input to the network. Think of them as the equivalent of cells in – say – your visual cortex that are activated by some external stimulus. The cell on the right is the output of the network. Its activation is computed based on the outputs of the neurons on the left and some parameters of the network called the weights which model the strength of the connection between the neurons and which we denote by w_i, i = 0, 1, 2, as follows.

o = \sigma(w_0 x_0 + w_1 x_1 + w_2 x_2) = \sigma(w^T X)

Here \sigma is a function called the activation function and X is the vector that is formed by x_0, x_1 and x_2. There are a few standard choices for the activation function, a common one being the sigmoid function.

How is such a network applied? Let us look at this for a problem which is the “Hello World” of machine learning – recognition of handwritten digits. In that problem, you start with a collection of digitized images of handwritten digits, like those that are known as the MNIST database. These images have 28×28 = 784 pixels. We want to design a neuronal network that can classify these images. Such a network should consequently have 784 input units and 10 output units (bear with me that I did not produce a picture of that network, even though it would probably be fun to do this with Neo4J). We present an image to the network by setting the value of the input unit i to the intensity of pixel i. Our aim is to adjust the weights in such a way that if the image represents digit n, only output n is significantly different from zero. This will allow us to classify unknown images – we simply present the image to the network and then see which output is activated.

But how do we find the correct values for the weights? We need weights that connect each of the ten output units to each of the 784 input units, i.e. we have 7840 weights. Thus we are looking for a point in a 7840 dimensional vector space – not easy. Here the process of training comes into play. We take our set of sample images and present them to the network, with initially randomly chosen weights. The output will then not be what we want, but differ from the target output by an error. That error can be expressed as a function of the weights.  The task is then to find a minimum of the error function, and there are ways to do this, most notably the procedure which is known as gradient descent.

As the name suggests, we need the gradient of the error function for that purpose. Fortunately, for the type of neuronal network that we have sketched so far, the gradient can be computed fairly easily – in fact the activation function is chosen on purpose to make dealing with the derivatives easy. We end up with a comparatively simple training algorithm for this type of network and maybe I will show how a simple implementation in Python in a later post – but for now let us move on to Boltzmann machines.

Boltzmann machines or more precisely restricted Boltzmann machines (RBMs) are also composed of units and weights, but work a bit differently.  There are inputs, which are usually called visible units. But there are no classical outputs. Instead, there is a layer of units that is connected to the visible units and is called hidden units, as shown in the following picture.

RestrictedBoltzmannMachine

In the example of handwritten digits again, you would again have 784 visible units. However, the hidden units would not obviously correlate with the digit represented by the input. Instead, you would have a more or less arbitrary number of hidden units, say 300.

During training, you present the network one image, again by setting the values of the visible units (the input) to the pixel intensities. You then compute the value of the hidden units – but you do not do this deterministically, but bring in some random element. Roughly speaking, if the combined input to a hidden unit is p, you set the value of the unit to one with probability p. Then this process is repeated, this time starting from the hidden units. This gives you certain values for the visible units. You then compare this value to the original value and try to adjust the weights such that you get as closely as possible to the original input (this is not exactly true and a not very precise description of one of the possible learning algorithms called contrastive divergence, but we will get more precise later on).

If you succeed, you will be able to reconstruct the value of the visible units from the values of the hidden units. But there are less hidden units, so that the network has apparently learned a condensed representation of the input that still captures the essence of the input. In the case of digits, you can visualize the state of the network after training and obtain something like

ToyRBM_Weights

These are visualizations of the weights connected to some of the hidden units of a Bolzmann machine after the training phase. We can see that some of the units have appearantly “learned” some characteristic features of the digits, like the vertical stroke that appears in the digits one and seven. These features can be used for several purposes.

First, we can use the features as input for other neuronal networks. We now have only 300 inputs instead of originally 784, and that might simplify the problem a bit and make the process of training the next network easier.

Or, we could start with some random values for the hidden units and calculate the resulting values of the visible units to create sample images – and in fact this is basically what I have done to produce the samples at the top of this post (using an algorithm called PCD, but we will get to this).

Note that Boltzmann machines differ significantly from the type of neural network that we have considered earlier. One major difference is that to train a Boltzmann machine, you do not need to know the digit that the image represents. At no point have we used the information that the images in our database represent ten different digits – that information was not used in the design of the network nor during the training. This approach to machine learning is called unsupervised learning and is obviously very versatile – you do not have to tell the machine what the structure of the data is, it will detect the features independently.

Second, a Boltzmann machine can not only classify images, but can also create images that resemble a given set of input data. That can be used to reconstruct partially available images or to create new images from scratch – networks with this ability are called generative networks.

The price we have to pay for this is that Boltzmann machines are hard to train. The point is that we can still define some sort of error function, but we cannot easily calculate its gradient – there is no straightforward analytic expression for it that could be evaluated within a reasonable amount of time (if you know statistical physics a bit, that might remind you of the partition function, and that is more than pure coincidence, as I will show you in a later post). So we need to approximate the gradient. Technically, the gradient is an integral

\int f(x) dx

for some known, but complicated function f. Even if we cannot find an analytic expression for this, we can try to approximate it. Your first idea might be “Riemann sums”, but it turns out that this is not a good idea, as our function lives in the space of all weights which has a very high dimension. Instead, we will use an approach called Monte Carlo integration where we represent the integral as an expectation value, draw a sample and approximate the expectation value by the sample average. This is where stochastical methods like Markov chains will come into play.  And finally we will see that the behaviour of our network during training has some striking analogies with the behaviour of certain physical systems like solids exposed to a magnetic field at low temperatures, which are described by a model called the Ising model, and learn how techniques that physicists have  developed for this type of problems apply to neuronal networks.

That is it for today – I hope I could give you a rough idea of what is ahead of us. At least I hope that you start to be curious how all this works out – so looking forward to the next post where I will start with some background from physics.