Turn on the heating – from Hopfield networks to Boltzmann machines

In my recent post on Hopfield networks, we have seen that these networks suffer from the problem of spurious minima and that the deterministic nature of the dynamics of the network makes it difficult to escape from a local minimum. A possible approach to avoid this issue is to randomize the update rule. Intuitively, we want to move into a direction of lower energy most of the time, but sometimes allow the network to move a different direction, so that there is a certain probability to move away from a local minimum.

In a certain sense, a Boltzmann machine is exactly this – a stochastic version of a Hopfield network. If we want to pursue the physical analogy further, think of a Hopfield network as an Ising model at a very low temperature, and of a Boltzmann machine as a “warm” version of the same system – the higher the temperature, the higher the tendency of the network to behave randomly and to escape local minima. As for the Hopfield network, there are different versions of this model. We can allow the units to take any real value, or we can restrict the values to two values. In this post, we will restrict ourselves to binary units. Thus we consider a set of N binary units, taking values -1 and +1, so that our state space is again

$S = \{ -1, +1 \}^N$

Similar to a Hopfield network, each unit can be connected to every other unit, so that again the weights are given by an N x N matrix W that we assume to be symmetric and zero along the diagonal. For a state s, we define the energy to be

$E(s) = - \frac{1}{2} \langle s, Ws \rangle$

This energy defines a Boltzmann distribution on the state space, given by

$P(s) = \frac{1}{Z} e^{-\beta E(s)}$

Now our aim is to adjust the weights such that this distribution is the best possible approximation to the real distribution behind the training data.

How do we measure the distance between the current distribution and the target distribution? A common approach to do this is called the maximum likelihood approach: given a set of weights W, we try to maximize the probability for the training data under the distribution given by W. For convenience, one does usually not maximize this function directly, but instead minimizes minus the logarithm of this probability, divided by the number K of samples. In our case, we therefore try to minimize the loss function

$l({\mathcal D} | W) = - \frac{1}{K} \ln P({\mathcal D} | W)$

Now let us assume that our sample is given by K data points that we denote by $s^{(k]}$. Assuming that the sample states are independent, we can write the probability for the data given the weights W as the product.

$P({\mathcal D} | W) = \prod_k P(s^{(k)} | W)$

Using the definition of the Boltzmann distribution and the partition function Z, we can therefore express our loss function as

$l({\mathcal D} | W) = \ln Z + \frac{\beta}{K} \sum_k E(s^{(k)})$

where $s^{(k)}$ is the k-th sample point, where we assume that all sample points are drawn independently.

Now how do we minimize this function? An obvious approach would be to use the gradient descent algorithm or one of its variants. To be able to do this, we need the gradient of the loss function. Let us first calculate the partial derivative for the first term, the logarithm of the partition function. This is

$\frac{\partial}{\partial W_{ij}} \ln Z = \frac{1}{Z} \sum_s \frac{\partial}{\partial W_{ij}} e^{-\beta E(s)} = - \beta \frac{1}{Z} \sum_s e^{-\beta E(s)} \frac{\partial}{\partial W_{ij}} E(s) = - \beta \sum_s P(s) \frac{\partial}{\partial W_{ij}} E(s)$

Now the sum on the right hand side of this equation is of the form “probability of a state times a function of this state”. In other words, this is an expectation value. Using the standard notation for expectation values, we can therefore write

$\frac{\partial}{\partial W_{ij}} \ln Z = - \beta \langle \frac{\partial}{\partial W_{ij}} E(x) \rangle_P$

If you remember that expectation values can be approximated using Monte Carlo methods, this is encouraging, at least we would have an idea how to calculate this. Let us see whether the second term can be expressed as an expectation values as well. In fact, this is even easier.

$\frac{\partial}{\partial W_{ij}} \frac{\beta}{K} \sum_k E(s^{(k)}) = - \frac{\beta}{2} \frac{1}{K} \sum_k s_i^{(k)} s_j^{(k)}$

Now this is again an expectation value – it is not an expectation value under the model distribution (the Boltzmann distribution) but under the empirical distribution of the data set.

$\frac{\partial}{\partial W_{ij}} \frac{\beta}{K} \sum_k E(s^{(k)}) = - \frac{\beta}{2} \langle s_i s_j \rangle_{\mathcal D}$

Finally, our expression for the first term still contains the derivative of the energy, which is easily calculated. Putting all of this together, we now obtain a formula for the gradient of the loss function which is maybe the most important single formula for Boltzmann machines that you need to remember.

$\frac{\partial}{\partial W_{ij}} l({\mathcal D} | W) = - \frac{\beta}{2} \left[ \langle s_i s_j \rangle_{\mathcal D} - \langle s_i s_j \rangle_P \right]$

When using the standard gradient descent algorithm, this expression for the gradient leads to the following update rule for the weights, where $\lambda$ is the step size.

$\Delta W_{ij} = \frac{1}{2} \lambda \beta \left[ \langle s_i s_j \rangle_{\mathcal D} - \langle s_i s_j \rangle_P \right]$

Let us pause for a moment and reflect what this formula tells us. First, if we have reached our goal – model distribution and sample distribution are identical – the gradient is zero and the algorithm stops.

Second, the first term is essentially the Hebbian learning rule that we have used to train our Hopfield network. In fact, this is the weighted sum over the product $s_i s_j$ across all sample points, i.e. we strengthen a connection between two units if the two units are strongly correlated in the sample set, and weaken the connection otherwise. The second term is a correction to the Hebbian rule that does not appear in a Hopfield network.

The third point that we can observe is a bit more subtle. To explain it, assume for a moment that the data has been normalized (which is very often done in actual applications) so that its average is zero, in other words such that the expectation values

$\langle s_i \rangle_{\mathcal D}$

of the coordinates are all zero. For the Boltzmann distribution, this will be the case anyway as the distribution is symmetric in s (and it would therefore not even make sense to try to achieve convergence with unnormalized data). Thus the two terms that appear in the above equation are simply the elements of the covariance matrix under empirical and model distribution. This implies that a Boltzmann machine is not able to distinguish two distributions that have the same second moments as the covariance matrix is all that it sees.

That is a bit disappointing as it limits the power of our model significantly. But this is not the only problem with Boltzmann machines. Whereas we could easily calculate the first term in our formula for the weight change, the second term is more difficult. In our discussion of the Ising model, we have already seen that we could use Gibbs sampling for this, but would need to run a Gibbs sampling chain to convergence which can easily take one million steps or more for large networks. Now this is embedded into gradient descent which is by itself an iterative algorithm! Imagine that one single gradient descent step could take a few minutes and then remember that we might need several thousand of these steps and you see that we are in trouble.

Fortunately, help is on the way – with a slightly simplified model called restricted Boltzmann machines, both problems can be solved. I will look at this class of networks in my next post in this series. If you do not want to wait until then, you can take a look at  my notes on Boltzmann machines that also give you some more background on what we have discussed in this post.

Before we close, let me briefly describe what we could do with a Boltzmann machine if we had found a way to train it. Similar to a Hopfield network, Boltzmann machines are generative models. Thus once they are trained, we can either use them to create samples or to correct errors. If, for instance, each unit corresponds to a pixel in an image of a handwritten digit, we could sample from the model to obtain artificially created images that resemble handwritten digits. We could also use the network for pattern completion – if we have an image where a few pixels have been erased, we could start the network in the state given by the remaining pixels and some random values for the unknown pixels and hope that it converges to the memorized state, thus reconstructing the unknown part of the picture. However, specifically for restricted Boltzmann machines, we will see that an even more important application is to be used as feature extractor in deep layered networks.

So there are good reasons to continue analyzing these networks – so join me again in my next post when we discuss restricted Boltzmann machines.