The Ising model and Gibbs sampling

In the last post in the series on AI and machine learning, I have described the Boltzmann distribution which is a statistical distribution for the states of a system at constant temperature. We will now look at one of the most important applications of this distribution to an actual model, the Ising model.

This model was proposed by W. Lenz and first analysed in detail by his student E. Ising in his dissertation (of which [1] is a summary) to explain ferromagnetic behavior. In Isings model, a solid, like a piece of iron, is composed of a large number N of individual particles, each of them at a fixed location. A particle acts as a magnetic dipole that can be oriented in two different ways, corresponding to the different orientations of its spin. Ignoring the spatial structure for the time being, we can thus describe the state of the model as a point in the state space

{\mathcal S} = \{ -1, 1\}^N

We denote the elements of the state space by s and the i-th spin by s_i \in \{-1,1\} where a value of +1 is interpreted as “spin up” and a value of -1 as “spin down”.

Now, in general, a magnetic dipole which is exposed to a magnetic field B and has a magnetic dipole moment m will have a potential energy

E = - m \cdot B

which is the scalar product of m and B, i.e. the state of minimum energy is the one where the dipole is oriented along the magnetic field. In a solid, there are two sources for the magnetic field that act on each particle – there might be an external magnetic field H and there might be an interaction with the other magnetic dipoles in the model. We therefore model the total energy of a state s as

E(s) = - \frac{1}{2} \sum_{i,j} J_{ij} s_i s_j - \sum_j h_j s_j

where J_{ii} = 0 (i.e. we exclude self-interactions).

Here the coefficient h_j represents the external field acting on the particle at position j. The matrix J represents the interactions between the particles. In Isings original model, only nearby particles interact. In two dimensions, for instance, we think of the particles as being located on a grid and each particle has four neighbors: the particles immediately above and below it and the particles on the left and on the right. We can define a model which has a boundary or we can think of the grid as being toroidal, i.e. wrapping around.

We now consider the system as being in contact with a heat bath at a certain temperature T, i.e. the system can exchange internal energy with a thermal reservoir, leading to fluctuations in the orientations of the particles. This appears to be a reasonable model, we could, for instance, consider a comparatively small part of a solid and take the surrounding, much larger solid as the heat bath. The probability for the system to be in a state s is therefore given by the Boltzmann distribution:

p(s) = \frac{1}{Z} e^{-\beta E(s)}

The macroscopic quantities that are of primary interest are of course the average energy, but also the magnetization

M(s) = \frac{1}{N} \sum_i s_i

and its average value. At high temperatures and for H = 0, we expect that roughly half of the spins should be oriented in either direction, so the average magnetization should be zero. If we add an external field, then of course most of the dipoles will be aligned in the direction of this field, so the magnetization will be close to one or minus one. This fact – magnetization of a solid in the presence of an external magnetic field – is called paramagnetism. It turns out that for some materials, a non-zero magnetization can occur even if the external field is zero, as long as the temperature is below a certain value called the critical temperature – this behavior is known as ferromagnetism. Explaining this macroscopic behavior by a statistical model was the original intention of Isings work.

Now how do we actually evaluate our model? Our aim is to determine – for instance – the average magnetization at a given temperature. To do this naively, we would have to calculate the probabilities for all possible states s. Unfortunately, the number of states grows exponentially with the number N of particles. Suppose we wanted to use a toy model with only 40 x 40 spins – this is very small compared to the number of particles in an average macroscopic solid. As N = 1600 in this case,  we would have 2^{1600} different states, which is roughly 10^{482}. Comparing this to the estimated age of the universe in seconds (4 \cdot 10^{17} , see for instance this page), it is obvious that this is not a good idea.

However, we can try to approximate the average magnetization by using a sufficiently large sample. Thus we try to find a set of states s_i which is large, but doable – maybe a few million – and hope, based on the law of large numbers, that the sample average, i.e. the sum \sum_i M(s_i) , will provide a good approximation for the real value. This approach is sometimes called Monte Carlo integration and the workhorse of computational statistical mechanics.

How, then, do you create that sample? The answer is easy to write down, but difficult to motivate without the necessary background on Markov chains. Thus I will simply state the algorithm which is called Gibbs sampling and leave the theoretical background to another post (for the mathematically inclined reader, it is worth mentioning that the sample produced by the Gibbs sampler is not independent, but the law of large number still holds).

Before we can phrase the algorithm, we need another preparational step – we need to calculate a conditional probability. Suppose that the system is in a state s and we have chosen an arbitrary coordinate i. We can then ignore the actual state of the spin s_i and ask for the conditional probability that this spin points upwards given all other spins. A not too difficult calculation (which is carried out in detail for instance in my notes) shows that this conditional probability is given by

P(s_i = 1 | \{ s_j\}_{j \neq i}) = \sigma(2 \beta ( \langle J_i, s \rangle + h_i))

Here J_i denotes the i-th row of the matrix J and the brackets denote the ordinary scalar product. With this expression, a single Gibbs sampling step now proceeds as follows, given a state s.

  • Randomly pick a coordinate i
  • Calculate the conditional probability P = P(s_i = 1 | \{ s_j\}_{j \neq i}) using the formula above
  • Draw a real number U between 0 and 1 from the uniform distribution
  • If U is at most equal to P, set the spin at position i to +1, otherwise set it to -1

The algorithm then starts with a randomly chosen state and subsequently applies a large number of Gibbs sampling steps. After some time, called the burn-in time, the states after each step then form the sample we are looking for.

After all that theory, let us now turn to the practical implementation. We will restrict ourselves to the original model, i.e. J_{ij} = 1 if particles i and j are neighbors and zero otherwise, and also set the magnetic field to zero. The Gibbs sampling algorithm as outlined above is straightforward to implement in Python.  You can get my code from GitHub as follows.

$ git clone

In the newly created directory MachineLearning, you should then see a file Run this as follows.

$ python

This will create a new temporary directory with a name that is unique and specifies the run (on Linux / Unix systems, you will find the newly created directory in /tmp/. In this subdirectory, you will find three files. A file with extension .txt summarizes the parameters of the run. The file that ends with IsingPartI.png displays the simulation results. An example is


Each of the little images represents one final state for a given temperature. In this example, a grid of 40 x 40 spins was calculated. The temperature was slowly decreased from 6.0 down to 0.2 in steps of 0.2. For each temperature, 4 million simulation steps were done, then the resulting grid was captured. The top row represents, from the left to the right, the temperatures 6.0, 5.8, 5.6, 5.4 and 5.2. Here we see the expected behaviour – patterns with roughly half of the particles in a spin-up position and half of the particles in a spin-down orientation.

In the bottom row of the diagram, that corresponds to the temperatures 1.0, 0.8, 0.6, 0.4 and 0.2, we also see the expected behavior for very low temperatures – all spins are oriented in the same direction. However, starting at temperatures 1.8, we see that large scale patterns start to emerge. especially for the temperatures 2.2 and 2.4 (rightmost pictures in the fourth row from the top). For these temperatures, entire connected regions display the same orientation of the spins and thus a non-zero mean magnetization. As the temperature rises, these patterns dissolve again.

This behaviour is typical for a so-called critical point and is what Ising was searching for. Ironically, Ising, who of course did not have the computational devices to run a simulation, concluded in his paper wrongly that this would not happen. Critical points are of great interest not only in statistical mechanics, but also in quantum field theory – we will not be able to explore this connection further, but it demonstrates how important the Ising model has become as a playground to bring together various branches from physics, computer science and mathematics.

The program has lots of parameters to play with – with the default values, for instance, it calculates comparatively small grids with 20 x 20 spins, so that a small number of iterations is sufficient, for larger grids you will need several million iterations. I recommend to play with this a bit to get a feeling for what happens. The image at the top of this article, for instance, was generated with the following command line:

$ python --show=1 --N=160000 --rows=400 --cols=400 --steps=100000 --Tmax=2.4 --Tmin=1.7

and ran for roughly 13 minutes on my PC.

It is worth mentioning that neither the Gibbs sampling algorithm nor the chosen implementation are optimized. In fact, there are other algorithms like the exact sampling algorithm (see for instance [2]) that are providing much better results, and also the implementation could be improved greatly, for instance by using a Metropolis-Hastings checkerboard algorithm to allow for high parallelization and GPU computing (see for instance [3]). However, as understanding Gibbs sampling is vital for understanding Boltzmann machines, I have chosen to use a down-to-earth Gibbs sampling approach for this post – after all, its main purpose is not to gain new physical insights from simulation results but to get acquainted with Gibbs sampling as a standard sampling method, and, as always, to have some fun.

If you would like to learn more on the Ising model, please have a look at my notes that provide more details, show how to compute the conditional probability used for the Gibbs sampling and also cover the one-dimensional Ising model.

In the meantime, you might want to take a look at the beautiful article of J. Harder on WordPress who has some more sample pictures and a very interesting application of convolutional neuronal networks that he trained to be able to determine the temperatue at which the simulation was run from the visual representation of the simulation result.


1. E. Ising, Beitrag zur Theorie des Ferromagnetismus,
Zeitschrift f. Physik, Vol. 31, No.1 (1924), 253-258
2. D. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge University Press, Cambridge 2003
3. M. Weigel, Simulating spin models on GPU, Computer Physics Communication 182 (2011), 1833-1836

The Boltzmann distribution

Boltzmann machines essentially learn statistical distributions. During the training phase, we present them a data set called the sample data that follows some statistical distribution. As the weights of the model are adjusted as part of the learning algorithm, the statistical model represented by the Boltzmann machine changes, and the learning phase is successful if the model gets as closely to the training data as possible.

The distribution that is learned by a Boltzmann machine is called a Boltzmann distribution (obviously, this is where the name comes from…) and is of great importance in statistical physics. In this post, we will try to understand how this distribution arises naturally from a few basic assumptions in statistical mechanics.

The first thing that we will need is a state space. Essentially, a state space is a (probability) space whose points are in one-to-one correspondence with the physical states of the system that we want to describe. To illustrate this, suppose you are given a system of identical particles. These particles can move freely in a certain are of space.  The location of each individual particle can be described by its location in space, thus by three real numbers, and its velocity is described by three additional real numbers, one for each spatial component of the velocity. Thus as a state space for a single particle, we could choose a six-dimensional euclidian space. For a system of N particles, we would therefore need a state space of dimension 6N.

But a state space does not need to be somehow related to a position in some physical space. As another example, consider a solid, modeled as N particles with fixed locations in space. Let us also assume that each of these particles has a spin, and that this spin can either point upwards or downwards. We could then describe the spin of an individual particle by +1 (spin up) or -1 (spin down), and therefore our state space for N particles would be

\Omega = \{-1, +1\}^N

For every point s in the state space, i.e. for every possible state of the system, we assume that we can define an energy E(s), so that the energy defines a function E on the state space. One of the fundamental questions of statistical physics is now:

Given a state space and an energy function E on that space, derive, for each state s, the probability that the system is in the state s

What exactly do we mean by the probability to be in a specific state? We assume that we could theoretically construct a very large number of identical systems that are isolated against each other and independent. All these systems evolve according to the same rules. At any point in time, we can then pick a certain state and determine the fraction of systems that are in this state. This number is the probability to be in that state. The set of all these systems is called a statistical ensemble. The assignment of a probability to each state then defines a probability distribution on the state space. To avoid some technicalities, we will restrict ourselves to the case that the state space is finite, but of course more general cases can be considered as well.

As it stands, the question phrased above is impossible to answer – we need some additional assumptions on the system to be able to write down the probability distribution. We could, for instance, assume that the number of particles and the energy are fixed.  A bit less restrictive is the assumption that the energy can vary, but that the temperature of the system (and the number of particles as well as the volume) is fixed – this is then called a canonical ensemble. Let us denote the temperate by T and assume now that it is constant.

We could describe a system for which these assumptions hold as being in contact with a second, very large system which is able to supply (or absorb) a virtually infinite amount of heat – this system is called a heat bath or thermal reservoir. The composite system that consists of our system of interest and the heat bath is assumed to be fully isolated. Thus its energy and the number of particles is constant. Our primary system can exchange heat with the heat bath and therefore its energy can change, while its temperate will stay constant and equal to the temperature of the heat bath.


Let us denote the energy of the overall system by U_{tot}, its entropy by S_{tot}, the energy of the heat bath by U_R and the entropy of the heat bath by S_R (if energy and entropy are new concepts for you, I recommend my short introduction to thermodynamics which summarizes the most important fundamentals of thermodynamics on roughly 40 pages and also contains more details on statistical ensembles and the Boltzmann distribution).

Now it turns out that making only these very general assumptions (in contact with heat bath, constant temperature) plus one additional  assumption, we can already derive a formula for the probability distribution on the state space – even without knowing anything else on the state space! Let me sketch the argument to get a feeling for it, more details can be found in the document mentioned above.

Let us assume that we are given some state s of our primary system with energy E(s). We can combine this state with a state of the heat bath to an allowed state of the composite system if the energies of both states add up to the total energy U_{tot}, i.e. with any state of the heat bath which has energy U_{tot} - E. Let us denote the number of states of the heat bath with that energy by

W_R(U_{tot} - E)

and let W_{tot} denote the number of states of the composite system with energy U_{tot}. The major additional assumption that we now make is that for the composite system, all states with energy U_{tot} are equally likely – this is known as the principle of indifference.

Let us take a moment to reflect on this. The principle of indifference (sometimes called the principle of insufficient reason) states (translated into our case) that if the states of the system cannot be distinguished by any of the available observable quantities (in this case the energy which is equal to U_{tot} for all states of the composite system), all these states should be assigned equal probabilities. This appears somehow reasonable, if one state was more likely than any other state, this state would somehow be distinguished but could not be told apart by any of the measurable quantities. Of course, that something is not measurable does not mean that it does not exist – there could even be a quantity that is theoretically observable and distinguishes certain states, but is simply not on our radar. So this principle has to be applied with care, but it turns out that the resulting distribution gives a surprisingly good description of what we can measure in a large number of applications, so this assumption is somehow justified by its success (see also {1] chapter 15 and 21 for a short discussion and justification).

What does the principle of indifference imply for our primary system? The number of states of the composite system for which our primary system is in state s is exactly W_R(U_{tot} - E) , because every such state combines with s to an admissible state of the composite system. If all these states are equally likely, the probability for s is therefore just the fraction of these states among all possible states of the composite system, i.e.

p(s) = \frac{W_R(U_{tot} - E)}{W_{tot}}

The beauty of this equation is that we can express both, the numerator and the denominator, by the entropy. And this is where finally the Austrian physicist Ludwig Boltzmann comes into play. Boltzmann identified the entropy of a system with the logarithm of the number of states available to the system – that is the famous relation that is expressed as

S = k \ln W

on his tomb stone. Here W is the number of states available to the system, and k is the Boltzmann constant.

Let us now use this expression for both numerator and denominator. We start with the denominator. If the system has reached thermal equilibrium, the primary system will have a certain energy U which is related to the total energy and the energy of the reservoir by

U_{tot} = U + U_R

Using the additivity of the entropy, we can therefore write

W_{tot} = \exp \frac{1}{k} \left[    S(U) + S_R(U_{tot} - U)    \right]

Observe that we assume the number of particles and the volume of both the reservoir and the composite system to be constant, so that the entropy does really only depend on the volume (or at least we assume that the dependency on the volume is negligible). For the numerator, we need a different approach. Let us try a Taylor expansion around the energy U_{tot} - U = U_R. The first derivative of the entropy with respect to the energy is – by definition – the inverse of the temperature:

\frac{\partial S_R}{\partial U_R} = \frac{1}{T}

The second derivative turns out to be related to the heat capacity C_R of the reservoir. In fact, we have

\frac{\partial}{\partial U_R} \frac{1}{T} = - \frac{1}{T^2 C_R}

Now a heat bath is characterised by the property that we can add a virtually infinite amount of heat without raising the temperature. Thus the heat capacity is infinite, and the second derivative is (virtually) zero. Therefore our Taylor expansion is

S_R(U_{tot} - E(s)) = S_R(U_{tot} - U) + \frac{1}{T} (U - E(s))

Putting all this together, a few terms cancel out, and we find that

p(s) = \exp \frac{1}{kT} \left[ (U - TS(U)) - E(s) \right]

Now the energy U is the average, i.e. the macroscopically observable, energy of the primary system. Therefore U - TS is the Helmholtz energy F. If we also introduce the usual notation \beta for the inverse of kT, we finally find that

p(s) = e^{\beta F} e^{- \beta E(s)}

This is usually written down a bit differently. To obtain this form, let us sum this over all possible states. As U and therefore F are averages and do not depend on the actual state, but all probabilities have to add up to one, we find that

1 = e^{\beta F} \sum_s e^{- \beta E(s)}

The last sum is called the partition function and usually denoted by Z. Therefore we obtain

p(s) = \frac{1}{Z} e^{- \beta E(s)}

which is the Boltzmann distribution as it usually appears in textbooks. So we have found a structurally simple expression for the probability distribution, starting with a rather general set of assumptions.

Note that this equation tells us that the state with the lowest energy will have the highest probability. Thus the system will prefer states with low energies, and – due to the exponential – states with a significantly higher energy tend to be very unlikely.

Due to the very mild assumptions, the Boltzmann distribution applies to a large range of problems – it can be used to derive the laws for an ideal gas, for systems of spins in a solid or for a black body. However, the formula is only simple at the first glance – the real complexity is hidden in the partition function. With a few short calculations, one can for instance show that the entropy can be derived from the partition function as

S = k \frac{\partial}{\partial T} T \ln Z

Thus if we known the partition function as a function of the macroscopic variables, we know all the classical thermodynamical quantities like entropy, temperature and Helmholtz energy. In many cases, however, the partition function is not known, and we have to devise techniques to derive physically interesting results without a tractable expression for the partition function.

That was it for today. This blog was a bit theoretical, so next time we will apply this to a concrete model which is accessible for numerical simulations, namely the Ising model. Until then, you might want to go through my notes which contain some details on the Boltzmann distribution as well as examples and applications.


1. H.B. Callen, Thermodynamics and an introduction to thermostatistics, 2nd edition, John Wiley 1985
2. D.V. Schroeder, An introduction to thermal physics, Addison Wesley 2000