## Single qubit NMR based quantum computation

In the previous post, we have sketched the basic ideas behind NMR based quantum computation. In this post, we will discuss single qubits and single qubit operations in more depth.

## The rotating frame of reference

In NMR based quantum computing, quantum gates are realized by applying oscillating magnetic fields to our probe. As an oscillating field is time dependent, the Hamiltonian will be time dependent as well, making some calculations a bit more difficult. To avoid this, it is useful to pass to a different frame of reference, called the rotating frame of reference.

To explain this, let us first study a more general setting. Assume that we are looking at a quantum system with Hamiltonian H and a state vector $|\psi \rangle$. Suppose further that we are given a unitary group, i.e. a time-dependent unitary operator

$T(t) = e^{-iAt}$

with a hermitian matrix A. We can then consider the transformed vector

$|\tilde{\psi}\rangle = T(t) |\psi \rangle$

Using the product rule and the Schrödinger equation, we can easily calculate the time derivative of this vector and obtain

$i \hbar \frac{d}{dt} |\tilde{\psi}\rangle = \hbar A |\tilde{\psi} \rangle + T(t) H |\psi\rangle = \tilde{H} |\tilde{\psi}\rangle$

with

$\tilde{H} = THT^* + \hbar A$

In other words, the transformed vector again evolves over time according to an equation which is formally a Schrödinger equation if we replace the original Hamiltonian by the transformed Hamiltonian $\tilde{H}$.

Let us now apply this to the system describing a single nuclear spin with spin 1/2 in a constant magnetic field B along the z-axis of the laboratory system. In the laboratory frame, the Hamiltonian is then given by

$H = \omega I_z$

with the Larmor frequency $\omega$ and the spin operator

$I_z = \frac{\hbar}{2} \sigma_z$

We now pass into a new frame of reference by applying the transformation

$T(t) = \exp ( \frac{i\omega_{ref}}{\hbar} t I_z )$

with an arbitratily chosen reference frequency $\omega_{ref}$. Geometrically, this is a rotation around the z-axis by the angle $\omega_{ref}t$. Using the formula above and the fact that T commutes with the original Hamiltonian, we find that the transformed Hamiltonian is

$\tilde{H} = (\omega - \omega_{ref}) I_z$

This is of the same form as the original Hamiltonian, with a corrected Larmor frequency

$\Omega = \omega - \omega_{ref}$

In particular, the Hamiltonian is trivial if the reference frequency is equal to the Larmor frequency.

Intuitively, this is easy to understand. We know that the time evolution in the laboratory frame is described by a precession with the frequency $\omega$. When we choose $\omega_{ref} = \omega$, we place ourselves in a frame of reference rotating with the same frequency around the z-axis. In this rotating frame, the state vector will be constant, corresponding to the fact that the new Hamiltonian vanishes. If we choose a reference frequency different from the Larmor frequency, we will observe a precession with the frequency $\Omega$.

Let us now repeat this for a different Hamiltonian – the Hamiltonian that governs the time evolution in the presence of an oscillating magnetic field. More precisely, we will look at the Hamiltonian of a rotating magnetic field (which is a good approximation for an oscillating magnetic field by an argument known as rotating wave approximation, see my notes for more details on this). In the presence of such a field, the Hamiltonian in the laboratory frame is

$H = \omega I_z + \omega_{nut} [I_x \cos (\omega_{ref}t + \Phi_p) + I_y \sin (\omega_{ref}t + \Phi_p) ]$

To calculate the Hamiltonian in the rotating frame, we have – according to the above formula – to apply the conjugation with T to each of the terms appearing in this Hamiltonian and add a correction term.

Now the transformation T is a rotation around the z-axis, and the result of applying a rotation around the z-axis to the operators Ix and Iy is well known and in fact easy to calculate using the commutation relations between the Pauli matrices. The correction term cancels the first term of the Hamiltonian as above. The transformed Hamiltonian is then given by

$\tilde{H} = \omega_{nut} [I_x \cos \Phi_p + I_y \sin \Phi_p ]$

In other words, the time dependence has disappeared and only the phase term remains. Again, this is not really surprising – if we look at a rotating magnetic field from a frame of reference that is rotating around the same axis with the same frequency, the result is a constant magnetic field.

## The density matrix of a single qubit

We are now ready to formally describe a single qubit, given by all nuclei at a specific position in a system consisting of a large number of molecules. According to the formalism of statistical quantum mechanics, this ensemble is described by a 2×2 density matrix $\rho$. The time evolution of this density matrix is governed by the Liouville-von Neumann equation

$i\hbar \frac{d}{dt} \rho = [H,\rho]$

The density matrix is a hermitian matrix with trace 1. Therefore the matrix $\rho - \frac{1}{2}$ is a traceless hermitian matrix. Any such matrix can be expressed as a linear combination of the Pauli matrices with real coefficients. Consequently, we can write

$\rho(t) = \frac{1}{2} + f(t) \cdot I$

where f is a three-vector with real coefficients and the dot product is a shorthand notation for

$f \cdot I = f_x I_x + f_y I_y + f_z I_z$

Similarly, the most general time-independent Hamiltonian can be written as

$H = \frac{1}{2} tr(H) + a \cdot I$

We can remove the trace by adding a constant, which does not change the physics and corresponds to a shift of the energy scale. Further, we can express the vector a as the product of a unit vector and a scalar. Thus the most general Hamiltonian we need to consider is

$H = \omega_{eff} n \cdot I$

with a real number $\omega_{eff}$ (the reason for this notation will become apparent in a second) and a unit vector n.

Let us now plug this into the Liouville equation. Applying the very useful general identity (which is easily proved by a direct calculation)

$[a\cdot I, b \cdot I] = i \hbar (a \times b) \cdot I$

for any two vectors a and b, we find that

$\dot{f} = - \omega_{eff} [f \times n]$

This equation is often called the Bloch equation. By splitting f into a component perpendicular to n and a component parallel to n, one can easily see that the solution is a rotation around the axis n with frequency $\omega_{eff}$.

What is the physical interpretation of this result and the physical meaning of f? To see this, let us calculate the expectation value of the magnetic moment induced by the spin of our system. The x-component of the magnetic moment, for instance, corresponds to the observable $\gamma I_x$. Therefore, according to the density matrix formalism, the expectation value of the x-component of the magnetic moment $\mu$ is

$\langle \mu_x \rangle = \gamma tr (\rho I_x)$

If we compute the matrix product $\rho I_x$ and use the fact that the trace of a product of two different Pauli matrices is zero, we find that the only term that contributes to the trace is the term fx, i.e.

$\langle \mu_x \rangle = \frac{\gamma \hbar^2}{4} f_x$

Similar calculations work for the other components and we find that, up to a constant, the vector f is the net magnetic moment of the probe.

## A typical NMR experiment

After all these preparations, we now have all tools at our disposal to model the course of events during a typical NMR experiment in terms of the density matrix.

Let us first try to understand the initial state of the system. In a real world experiment, none of the qubits is fully isolated. In addition, the qubits interact, and they interact with the surroundings. We can model these interactions by treating the qubits as being in contact with a heat bath of constant temperature T. According to the rules of quantum statistical mechanics, the equilibrium state, i.e. the state into which the system settles down after some time, is given by the Boltzmann distribution, i.e.

$\rho(t=0) = \frac{1}{Z} \exp (- \frac{H}{kT})$

In the absence of an additional rotating field, the Hamiltonian in the laboratory frame is given by

$H = \omega I_z$

Therefore

$\rho(t=0) = \frac{1}{Z} \exp ( \frac{-\omega}{kT} I_z)$

Using the relations

$I_z = \frac{\hbar}{2} \sigma_z$

and

$\omega = -\gamma B \sigma_z$

with the gyromagnetic moment $\gamma$, we can write this as

$\rho(t=0) = \frac{1}{Z} \exp ( \frac{1}{2} {{\hbar \gamma B}\over{kT}} \sigma_z)$

Let us now introduce the energy ratio

$\beta = \frac{\hbar \gamma B}{kT}$

The energy in the numerator is the energy scale associated with the Larmor frequency. For a proton in a magnetic field of a few Tesla, for example, this will be in the order of 10-25 Joule. The energy in the denominator is the thermal energy. If the experiment is conducted at room temperature, say T=300K, then this energy is in the order of 10-21 Joule (see this notebook for some calculations). This yields a value of roughly 10-4 for $\beta$. If we calculate the exponential in the Boltzmann distribution by expanding into a power series, we can therefore neglect all terms except the constant term and the term linear in $\beta$. This gives the approximation

$\rho(t=0) = \frac{1}{Z} (1 + \frac{1}{2} \beta \sigma_z)$

called the high temperature approximation. We can determine the value of Z by calculating the trace and find that Z = 2, so that

$\rho(t=0) = \frac{1}{2} + \frac{1}{4} \beta \sigma_z = \begin{pmatrix} \frac{1}{2} + \frac{\beta}{4} & 0 \\ 0 & \frac{1}{2} - \frac{\beta}{4} \end{pmatrix} = \frac{1}{2} + \frac{1}{2} \frac{\beta}{\hbar} I_z$

If we compare this to the general form of the density matrix discussed above, we find that the thermal state has a net magnetization in the direction of the z-axis (for positive $\beta$). This is what we expect – with our sign conventions, the energy is lowest if the spin axis is in the direction of the z-axis, so that slightly more nuclei will have their spins oriented in this direction, leading to a net magnetic moment.

To calculate how this state changes over time, we again pass to the rotating frame of reference. As the initial density matrix clearly commutes with the rotation around the z-axis, the density matrix in the rotating frame is the same. If we choose the reference frequency to be exactly the Larmor frequency, the Hamiltonian given by the static magnetic field along the z-axis vanishes, and the density matrix does not change over time.

When, however, we apply an additional pulse, i.e. an additional rotating magnetic field, for some time $\tau$, this changes. We have already seen that in the rotating frame, this pulse adds an additional term

$\omega_{nut} (I_x \cos \Phi_p + I_y \sin \Phi_p)$

to the Hamiltonian. This has the form discussed above – a scalar times a dot product of a unit vector with the vector I = (Ix, Iy, Iz). Therefore, we find that the time evolution induced by this Hamiltonian is a rotation around the vector

$(\cos \Phi_p, \sin \Phi_p, 0)$

with the frequency $\omega_{nut}$. If, for instance, we choose $\Phi_p = 0$, the vector f – and thus the magnetic moment – will slowly rotate around the x-axis. If we turn off the pulse after a time $\tau$ such that

$\omega_{nut} \tau = \frac{\pi}{2}$

the net magnetization will be parallel to the y-axis. After the pulse has been turned off, the density matrix in the rotating frame is again constant, so the magnetic moment stays along the y-axis. In the laboratory frame, however, this corresponds to a magnetic moment which rotates with the Larmor frequency around the z-axis. This magnetic moment will induce a voltage in a coil placed in the x-y-plane which can be measured. The result will be an oscillating current, with frequency equal to the Larmor frequency. Over time, the state will slowly return into the thermal equilibrium, resulting in a decay of the oscillation.

This is a good point in time to visualize what is happening. Given the explicit formulas for the density matrix derived above, it is not difficult to numerically simulate the state changes and NMR signals during an experiment as the one that we have just described (if you want to take a look at the required code, you can find a Python notebook here)

The diagram below shows the result of such a simulation. Here, we have simulated a carbon nucleus in a TCE (Trichloroethylene) molecule. This molecule – pictured below (source Wikipedia) – has two central carbon nuclei.

A small percentage of all TCE molecules in a probe will have two 13C nuclei instead of the more common 12C nuclei, which have spin 1/2 and therefore should be visible in the NMR spectrum. At 11.74 Tesla, an isolated 13C carbon nucleus has a Larmor precession frequency of 125 MHz. However, when the nuclei are part of a larger molecule as in our case, each nucleus is shielded from an external magnetic field by the surrounding cloud of electrons. As the electron configuration for both nuclei is different, the observed Larmor frequencies differ by a small amount known as the chemical shift.

At the start of the simulation, the system was put into a thermal state. Then, an RF pulse was applied to flip the magnetization in the direction of the x-axis, and then a sample was taken over 0.1 seconds, resulting in the following signal.

The signal that we see looks at the first glance as expected. We see an oscillating signal with an amplitude that is slowly decaying. However, you might notice that the frequency of the oscillation is clearly not 125 MHz. Instead, the period is roughly 0.001 seconds, corresponding to a frequency of 1200 Hz. The reason for this is that in an NMR spectrometer, the circuit processing the received signal will typically apply a combination of a mixer and a low pass filter to effectively shift the frequency by an adjustable reference frequency. In our case, the reference frequency was adjusted to be 1200 Hz above the Larmor frequency of the carbon nucleus, so the signal will oscillate with a frequency of 1200 Hz. In practice, the reference frequency determines a window in the frequency space in which we can detect signals, and all frequencies outside this window will be suppressed by the low pass filter.

Now let us take a look at a more complicated signal. We again place the system in the thermal equilibrium state first, but then apply RF pulses to flip the spin of both carbon nuclei into the x-axis (in a simulation, this is easy, in a real experiment, this requires some thought, as the Larmor frequencies of these two carbon nuclei differ only by a small chemical shift of 900 Hz). We then again take a sample and plot the signal. The result is shown below.

This time, we see a superposition of two oscillations. The first oscillations is what we have already seen – an oscillation with 1200 Hz, which is the difference of the chosen reference frequency and the Larmor frequency of the first carbon. The second oscillation corresponds to a frequency of roughly 300 Hz. This is the signal caused by the Larmor precession of the second spin. As we again measure the difference between the real frequency and the reference frequency, we can conclude that this frequency differs from the Larmor frequency of the first spin by 900 Hz.

In reality, the signal that we observe is the superposition of many different oscillations and is not easy to interpret – even with a few oscillations, it soon becomes impossible to extract the frequencies by a graphical analysis as we have done it so far. Instead, one usually digitizes the signal and applies a Fourier transform (or, more precisely, a discrete Fourier transform). The following diagram shows the result of applying such a DFT to the signal above.

Here, we have shifted the x-axis by the difference between the Larmor frequency $\omega_0$ of the first nucleus and the reference frequency, so that the value zero corresponds to $\omega_0$. We clearly see two peaks. The first peak at zero, i.e. at Larmor frequency $\omega_0$, is the signal emitted by the first nucleus. The second peak is shifted by 900 Hz and is emitted by the second nucleus. In general, each nucleus will result in one peak (ignoring couplings that we will study in a later post) and the differences between the peaks belonging to nuclei of the same isotope are the chemical shifts.

Let us quickly summarize what we have learned. An ensemble of spin systems is described by a density which in turn is given by the net magnetization vector f. The result of applying a pulse to this state is a rotation around an axis given by the phase of the pulse (in fact, the phase can be adjusted to rotate around any axis in the x-y-plane, and as any rotation can be written as a decomposition of such rotations, we can generate an arbitrary rotation). The net magnetization can be measured by placing a coil close to the probe and measuring the induced voltage.

This does already look like we are able to produce a reasonable single qubit. The vector f appears to correspond – after some suitable normalization – to points on the Bloch sphere, and as we can realize rotations, we should be able to realize arbitrary single qubit quantum gates. But what about multiple qubits? Of course a molecule typically has more than one nucleus, and we could try to use additional nuclei to create additional qubits, but there is a problem – in order to realize multi-qubit gates, these qubits have to interact. In addition, we need to be able to prepare our NMR system in a useful initial state and, at the end of the computation, we need to measure the outcome. These main ingredients of NMR based quantum computing will be the subject of the next post.

## Quantum error correction: an introduction to toric codes

While playing with the IBM Q experience in some of my recent posts, we have seen that real qubits are subject to geometric restrictions – two-qubit gates cannot involve arbitrary qubits, but only qubits that are in some sense neighbors. This suggests that efficient error correction codes need to tie to the geometry of the quantum device. Toric codes are an example of these codes that create interesting connections between topology and quantum computing.

## The stabilizer group of a toric code

In their original form introduced by A. Kitaev in 1997 (see e.g. [1]), toric codes are designed to operate on quantum circuits arranged on a torus. Specifically, we consider a 2-dimensional discrete lattice L with periodic boundary conditions. On each edge of this lattice, we place exactly one qubit.

In algebraic topology, one can associate with such a lattice an abelian group called the group of one-chains of the lattice and denoted by $C_1(L)$. The elements of this group are simply binary linear combinations of edges, i.e. subsets of all edges. Addition in this group is given by joining the subsets, while dropping elements that appear twice. As there is one qubit sitting on every edge, every chain c, i.e. every subset of edges, will yield an operator that we call Xc and which is simply given by applying the Pauli X operator to each of the qubits sitting on an edge appearing in c. Similarly, one can define an operator Zc for evey chain c by combining the Pauli Z operators for the qubits touched by c.

This sounds a bit daunting and formal, so time to look at a picture and a few examples.

This diagram shows a graphical representation of our lattice L. The solid lines represent the edges of the lattice, their intersections are the vertices, see the labels at the lower right corner of the image. Cornered by four vertices and surrounded by four edges, there are the faces of the lattice. The black dots in the diagram indicate the qubits on the lattice.

Now let us look at the face marked in red and labeled by f at the top of the diagram. This face is surrounded by four edges colored red as well whose union is called the boundary of the face. This boundary is formalized as a one-chain denoted by $\partial f \in C_1(L)$. As described above, we can associate an operator $Z_{\partial f}$ to this one chain, which is simply given by letting a Pauli Z operator act on each of the qubits sitting on the red edges. This operator is called the face operator associated with the face f and denoted by Zf

On the right of the face, we have – in blue – marked a single vertex v. This vertex again defines a chain, namely the subset of all edges – marked in blue – that touch the vertex. Again, we can associate an operator to this chain, this time we choose the product of the Pauli X operators acting on the qubits on the blue edges. This operator is called the vertex operator for the vertex v and denoted by Xv.

Our lattice has many faces and many vertices. For each face, we have a face operator and for each vertex, we have a vertex operator. It is not difficult to check that all these operators do in fact commute (note that a vertex touches every boundary twice). Therefore the group S generated by all these operators is an abelian group.

Now recall from one of my previous posts that the stabilizer formalism can be used to associate to every abelian subgroup of the Pauli group a code. The code obtained from the group S generated by all face and vertex operators is called the toric code.

Let us try to better understand the code space TS. Suppose that our lattice has E edges, V vertices and F faces. Then there are E qubits, i.e. the Hilbert space of physical qubits has dimension 2E. The stabilizer group is generated by the V vertex operators and the F face operators, i.e. by V + F elements. However, these elements are not independent – there are relations imposed by the toric geometry. In fact, the product of all vertex operators is one and the product of all face operators is one (this is easiest seen with a bit of background in algebraic topology – for those readers who have that background, I refer to my notes on quantum error correction for the mathematical details behind this and some other statements in this post). Thus there are in fact V + F – 2 generators. As each generator introduces a constraint on the code space that cuts the Hilbert space dimension down by a factor two, the dimension of the code space is

2E – (V+F – 2) = 22 – (V + F – E)

Now the number V + F – E is known as the Euler number of the torus in algebraic topology, and it also known that the Euler number of the torus is zero. Therefore the dimension of the code space is 22 = 4, i.e. the code space encodes two logical qubits (this will change if we later move on to a planar geometry).

Let us now try to understand the logical gates acting on these two logical qubits. By the general theory of stabilizer codes, we know that the logical operators are those operators that commute will all elements in the stabilizer. What does this mean geometrically? To see this, consider the following diagram.

Consider the vertical red line in the diagram, labeled by c1. This line is a subset of edges, i.e. a one chain. To this chain, we can associate a Z operator Zc1, which is the product of all Pauli Z operators acting on the qubits touched by the red line. Being a Z operator, this operator commutes with all face operators. What about the vertex operators?

Close to the right border of the lattice, we have indicated the vertex operator Xv associated with the vertex v. This vertex operator acts on four qubits, indicated by the edges marked with the blue dashed line. Two of these edges also appear in the chain c1. Therefore, when we commute Xv and Zc1, we pick up two signs that cancel and the operators commute. Thus Zc1 commutes with all vertex and face operators and is in the normalizer. The same is true for the operator associated with the second red line. These two Z operators therefore act on the logical qubit. By convention, we can define the logical state $|0 \rangle_L$ to be the state which is left invariant by these two operators.

It is interesting to note that the vertex operator Xv commutes with Zc1 because the chain c1 crosses the vertex v and therefore contains both edges touching v. If however, a chain c ends at v, then we will only pick up one sign when commuting the X past the Z operators. Geometrically, a vertex operator Xv anti-commutes with a Z-operator Zc associated with a chain c if and only if the vertex v is in the boundary of c, otherwise it commutes. The red line in our diagram is distinguished by the fact that its boundary is empty – it is a closed circle or a cycle in the language of algebraic topology that goes around the torus once. It therefore commutes with all vertex operators and thus is an element of the stabilizer.

Similarly, the dotted blue lines represent X operators, given by the product of the Pauli X operators acting on those qubits intersected by the blue line. A similar argument shows that, as these lines also go once around the torus, the X operators associated with them are in the normalizer of S as well. Thus we obtain two X operators that act as logical Pauli X operators on our code space. Given the logical X and Z operators, we can therefore implement all single qubit gates on our code space. It turns out that multi-qubit gates can be implemented by braid operations, but we will leave this to a later post when we study planar codes.

## The particle interpretation of the toric code and anyons

Let us now turn to the particle interpretation of error correction in these codes, that will bring the famous anyons and ideas from topological quantum computing into play. For that purpose, consider the operator

$H = - \sum_v X_v - \sum_f Z_f$

which is the sum of all face and vertex operators. Clearly, this operator is hermitian, and can therefore be thought of as the operator of a physical system. Does this system have a particle interpretation?

To see this, let us try to determine the eigenstates of the Hamiltonian, i.e. the energy levels of our system. As the Xv and the Zf all commute, we can find a simultaneous basis of eigenstates that will diagonalize the Hamiltonian. Let us call this basis $\{ |\psi_i \rangle \}$. Each of these vectors is an eigenstate of all face and vertex operators, and we denote the corresponding eigenvalues by xvi and zfi. Clearly, $|\psi_i \rangle$ is then an eigenvector of H with eigenvalue

$- \sum_f z_{fi} - \sum_v x_{vi}$

This is minimal if and only if all the xvi and zfi are equal to one, which is the case if $|\psi_i \rangle$ is an element of the code space. Thus we have identified the elements of the code space as the ground states of our system!

Now let us introduce an error into our system. One way to do this is to pick a chain c, i.e. a path on the lattice, and apply the Zc operator to one of the ground states $|g \rangle$. Let us try to determine the error syndrome of this state, i.e. let us try to figure out for which elements of the stabilizer, this state is no longer in the +1 eigenspace.

As all Z operators commute, the state is clearly still in the +1 eigenspace of all the face operators Zf. If Xv is a vertex operator, the state will be in the -1 eigenspace if and only if Zc anti-commutes with Xv. But we have already seen that this happens if and only if v is in the boundary of c, i.e. v is one of the two vertices where the path c starts and ends! Thus the error syndrome is localized at those two vertices – if we imagine the vertex operator measurements as being signaled by green (outcome plus one) and red (outcome minus one) lights attached to the respective vertex, the error would be represented two red lights sitting at these vertices.

If we now expand the state $Z|g \rangle$ into eigenstates of the Hamiltonian, the above discussion shows that the eigenvalue has increased by four. Thus errors correspond to excited states, and the operators Zc are creation operators that create excited states that are in a certain sense localized at the vertices of the lattice. It is tempting and common to therefore call these excitations quasi-particles, similar to the phonons in a crystal lattice.

So far we have associated an error operator Zc and therefore a particle with every chain, i.e. a path that runs along the edges. We can also consider a path that runs perpendicular to all the edges – in algebraic topology, this would be called a chain in the dual lattice or a co-chain. To every such path, we can associate an X operator, and this X operator acting on a ground state will create another excitation. The particles associated with a chain is called an electric charge e by Kitaev, and the particle created by a co-chain is called an magnetic vortex m.

As indicated in the diagram above, electric charges come in pairs, are located on vertices of the lattice and manifest themselves as -1 eigenvalues of the corresponding vertex operators, whereas magnetic particles live on pairs of faces of the vertex and appear as -1 eigenvalues of face operators. In this diagram, the solid red line creates a pair of electric particles, and the solid blue line creates a pair of magnetic particles.

How do these particles relate to the composition of chains? Suppose that we are given a chain c12 running from vertex 1 to vertex 2 and a second chain c23 running from vertex 2 to vertex 3. Then there are two ways to create the particle sitting at the vertices 1 and 3 – we can either first act with the Z operator associated with the chain c12 and then let the operator associated with the chain c23 act, or we can form the sum c13 of the chains which now runs from 1 to 3 and let the operator associated with this chain act. Of course, the result will be the same. Thus we can interpret the act of applying the operator given by the chain c23 to an excited state as moving one of the particles from vertex 2 to vertex 3. The same applies to magnetic particles.

Now let us again consider the image above. Look at the blue dashed loop. According to what we have just said, we can start with the particle pair created by the solid blue line and then move one of the particles around that closed loop. The result will be a particle pair living at the same location as the original pair. You might think that the new state is in fact identical to the old state, but this is not true – the state picks up a phase factor minus one. This is the reason why Kitaev calls our quasi-particles anyons, a term generally used in particle physics to characterize particles with non-trivial statistics (in our case, mutual statistics)

Exchanging particles by moving them around each other can be described as an action of the pure braid group, i.e. as an action of the group of braids that return each particle to its original position (this subgroup, the kernel of the natural homomorphism from the braid group to the permutation group, is sometimes called the group of colored braids). If we perturb a braid slightly, the action will remain unchanged, so the unitary operators which are given by that action are by construction rather stable in the presence of noise. In [1], Kitaev developed the idea to use braid group representations to implement quantum gates. Unfortunately, the set of operators that we can obtain in this way using the toric code is not sufficient to implement a universal quantum computer – we will get back to this point in a later post when we look at the actual procedures used to perform computations in a toric or surface code. Therefore Kitaev goes on to describe certain hypothetical particles called non-abelian anyons which, if they can be physically realized, provide sufficiently rich unitary operations. This approach is called topological quantum computation. We will not go into further details but refer to the original work [1].

## Measurement and error correction for toric codes

Let us quickly summarize what we have found so far. The code space of a toric code can be thought of as the ground state of a physical system, describing a collection of two-state systems located on the edges of a lattice. Errors correspond to fundamental excitations, i.e. quasi-particles, that come in pairs and sit on vertices and faces of the lattice. These errors can be detected by measuring face and vertex operators.

How can we perform these measurements in practice? To understand this, let us first recall a general approach for measuring eigenvalues. Suppose that U is an operator that is hermitian and at the same time unitary. Consider the following circuit which is a combination of two Hadamard gates with a controlled-U operation.

Let us calculate the state after applying this circuit. The initial state is

$|0 \rangle |\psi \rangle$

After applying the Hadamard gate to the first qubit, we obtain

$\frac{1}{\sqrt{2}} \big[ |0 \rangle |\psi \rangle + |1 \rangle |\psi \rangle \big]$

which the controlled U operation turns into

$\frac{1}{\sqrt{2}} \big[ |0 \rangle \otimes |\psi \rangle + |1 \rangle \otimes U |\psi \rangle \big]$

Now U is unitary, and therefore its eigenvalues are plus and minus one. So we can decompose our state $|\psi \rangle$ as

$\alpha |\psi_+ \rangle + \beta |\psi_- \rangle$

where $|\psi_+\rangle$ is an eigenstate with eigenvalue plus one and $|\psi_-\rangle$ an eigenstate with eigenvalue minus one. Plugging this into the above equation and applying the final Hadamard operator shows that the final outcome of the circuit will be

$\alpha |0 \rangle \otimes |\psi_+\rangle + \beta |1 \rangle \otimes |\psi_- \rangle$

If we now apply a measurement to the control qubit, which serves as an ancilla here, the state is projected onto either $|1 \rangle \otimes |\psi_- \rangle$ or $|0 \rangle \otimes |\psi_+ \rangle$, with probabilities $|\beta|^2$ and $|\alpha|^2$. The remaining qubits end up in the state $|\psi_-\rangle$ or $|\psi_+ \rangle$, i.e. in an eigenstate. Thus applying this circuitry and measuring the ancilla has the same implications as measuring the operator U and yields the same information.

Let us now apply this idea to the problem of measuring one of the operators Xv. This operator is a tensor product of four $\sigma_x$ operators, i.e. inversion operators. Therefore a controlled-Xv circuit is nothing but a sequence of controlled inversions, i.e. CNOT gates. Thus we can measure a vertex operator using the following circuit.

Here the four qubits at the bottom correspond to the four edges meeting the vertex v. So we need one ancilla qubit for doing the measurement, and we need to be able to combine it in gates with all the four qubits sitting on the edges meeting v. For the physical implementation, the easiest approach is to place this additional qubit called measurement qubit in the vertex itself, in contrast to the data qubits that we have already placed on the edges. So we add one data qubit to each vertex.

A circuit along the same lines can be constructed to measure Zf for a face f. Using that $H \sigma_x H = \sigma_z$, we can construct such a circuit as follows. In the first time step, apply a Hadamard to both the ancilla and all target qubits. Then apply CNOT operators to all target qubits, controlled by the ancilla, and then apply Hadamard operators again. Now, it is well known that conjugating the CNOT with Hadamard gates switches the roles of target qubits and control qubits. This gives us the circuit displayed below.

Again, for each face f, we need an ancilla qubit to measure Zf. This qubit will have to interact with all data qubits on the edges bounding this face, so it is easiest to position it in the center of that face.

Overall, we have now added F + V measurement qubits to our original E data qubits. Now the Euler characteristic of the torus is zero, and therefore F + V = E, showing that this procedure actually doubles the number of physical qubits.

We have now arrived at a description of the toric code (and its cousin the surface code) that one can often find as a definition in the literature but which somehow obscures the relation to the surface topology a bit.

• We start with a lattice L with E edges and identify opposite borders
• On each vertex, we place a data qubit
• On each face, we place a measurement qubit called a Z-qubit
• On each vertex, we place a measurement qubit called an X-qubit
• To each X-qubit, we associate the operator that acts as a bit flip on the four data qubits next to the measurement qubit
• To each Z-qubit, we associate the operator that acts a a phase flip on the four data qubits next to the measurement qubit

Thus every data qubit is acted on by four operators, the two Z-operators corresponding to the adjacent faces and the two X-operators corresponding to the adjacent vertices. The additional measurement qubits – indicated by white circles in contrast to the dark circles being the data qubits – are shown below.

For one face and one vertex, we have indicated the associated operators Xv and Zf by adding letters X and Z to show how these operators act on the nearby data qubits. This is the graphical representation typically used in the literature (like [3]). Note that some authors use different conventions and associate Z-operators with faces and X-operators with vertices, but this is of course completely equivalent due to the duality relations between a lattice and a dual lattice.

This completes our short introduction to toric codes. In the next post in my series on quantum error correction, we will pass on to a planar geometry which turns the toric code into a surface code.

## References

1 A. Kitaev, Fault-tolerant quantum computation by anyons, arXiv:quant-ph/9707021
2. H. Bombin, An introduction to topological quantum codes, in Quantum error correction, edited by D.A. Lidar and T.A. Brun, Cambridge University Press, New York, 2013, available as arXiv:quant-ph/1311.0277
3. A.G. Fowler, M. Mariantoni, J.M. Martinis, A.N. Cleland, Surface codes: Towards practical large-scale quantum computation, arXiv:1208.0928v2

## Quantum error correction with stabilizer codes

In our previous discussion of quantum error correction, we have assumed that quantum gates can act on any two physical qubits. In reality, however, this is not true – only nearby qubits and interact, and our error correction needs to take the geometric arrangements of the qubits into account. The link between these geometric constraints and the theory of quantum error correction is the stabilizer formalism described first in [2].

This post will be a bit more formal, but is a necessary preparation to be able to define and understand the surface code and other topological codes. To introduce and motivate the formalism, let us for a moment come back to the simple example studied before – the three-bit code. In this code, the logical states were

$|0 \rangle_L = |000 \rangle$

and

$|1 \rangle_L = |111 \rangle$

These logical states span our two-dimensional code space and describe one logical qubit. A single bit flip error will flip one of the three involved qubits. Suppose, for instance, that a bit flip occurs on the first or second qubit. This error will modify our logical states as follows (the first two lines represent a bit flip on the first qubit, the next two lines a bit flip on the second qubit

$|0 \rangle_L \mapsto |100 \rangle$

$|1 \rangle_L \mapsto |011 \rangle$

$|0 \rangle_L \mapsto |010 \rangle$

$|1 \rangle_L \mapsto |101 \rangle$

Now consider the operator

$M_1 = \sigma_z^1 \sigma_z^2$

where the upper index on a Pauli matrix indicates the qubit on which the operator acts. The error-free logical states are both eigenstates of this operator with eigenvalue +1. The states that result out of a bit-flip error are also eigenstates of this operator, but with eigenvalue -1. Thus, a measurement of this operator will tell us whether a bit flip on the first or second qubit has occurred. Similarly, to also detect a bit flip on the third qubit, we need a measurement for a second operator, namely

$M_2 = \sigma_z^1 \sigma_z^3$

Let us now write down a few properties of these operators. First, they are hermitian and therefore correspond to measurements. Second, the logical states are eigenvectors of these operators with eigenvalues +1 and the code space is exactly the subspace of the three-qubit state space that is fixed by M1 and M2. We can think of these operators as defining linear constraints that together define the code space, as indicated below.

Moreover, if a bit flip error occurs, the resulting state will again be an eigenvalue, but for at least one of the operators the new eigenvalue will be -1. Thus, errors correspond to violated constraints. Finally, the two operators commute and can therefore be measured simultaneously.

These properties allow us to express the three bit code entirely in terms of the operators M1 and M2. The code space is the subspace which is left invariant by these operators. The syndrome measurement can be done by measuring both operators simultaneously, which is possible because they commute. Finally, all involved states – code space and states after bit flip errors – are eigenstates and therefore the measurement process does not collapse a superposition of these states and can therefore be executed without interfering with the actual quantum algorithm that we want to run.

This correspondence between sets of operators – the Mi in our case – and quantum error correction is the core of the stabilizer formalism. In a more general setting, we are looking at the Hilbert space of dimension 2n spanned by n physical qubits. Products of n Pauli matrices (where we include the identity matrix) are acting on this Hilbert space and form a group ${\mathcal G}_n$ called the Pauli group. Put differently, the Pauli group consists of those linear operators on the Hilbert space that can be written as a product

$g = \mu (\sigma_x^1)^{a_1} (\sigma_y^1)^{b_1} (\sigma_z^1)^{c_1}\cdots (\sigma_x^n)^{a_n}(\sigma_y^n)^{b_n} (\sigma_z^n)^{c_n}$

with coefficients $a_i, b_ic, c_i$ and an overall phase factor $\mu \in \{ \pm 1, \pm i \}$ (we can even assume that all the $b_i$ are equal to one as $\sigma_x \sigma_z$ is a multiple of $\sigma_y$). Any two elements of the Pauli group either commute or anti-commute. Within this group, we now consider a finite set $\{ M_i \}$ of commuting elements and the subgroup S of the Pauli group generated by this set. This group (which is abelian as it is generated by mutually commuting elements) is called the stabilizer group. To this group, we can associate the subspace that consists of all vectors that are fixed by all elements of the group, i.e. the space

$T_S = \{ |\psi \rangle \, \text{such that} \, s|\psi \rangle = |\psi \rangle \forall s \in S \}$

This space will be the code space of our code. If the group S is generated by l independent generators, the dimension of the code space can be seen to be 2n-l.

In the example of the three bit code, we had n=3 and l=2, which gives us a two-dimensional code space, corresponding to one logical qubit (if you try work out the details, you will see that in order for this to be true, we have to assume that $-1 \notin S$ – you might want to take a look at my notes or the chapters on quantum error correction in [1] for the mathematical details).

Which errors is our code able to detect? Suppose that E is an error operator that is itself in the Pauli group. We know that any two elements of the Pauli group either commute or anti-commute. Let us assume that E does in fact anti-commute with some element s of S. If $|\psi \rangle$ is a state in the code space, we can then calculate

$s E |\psi \rangle = - E s |\psi \rangle = - E |\psi \rangle$

Therefore the state $E |\psi \rangle$ is now in the -1 eigenspace of s. Thus if we measure all elements of S, the outcome -1 for one of the measurements will tell us that an error has occurred. A similar argument shows that if the product of any two errors anti-commutes with at least one element of S, then we can also correct the error based on only the measurement results of elements in S. Mathematically speaking, the set of all elements of the Pauli group that commute with all elements of S is the centralizer of S, which in this case turns out to be equal to the normalizer N(S) of S, and a set $\{E_\alpha \}$ of errors can be detected and corrected if and only if

$E_\alpha E_\beta \in S \cup (\mathcal{G} - N(S))$

for any two error operators $E_\alpha, E_\beta$. So the elements of the Pauli group that are neither in S nor in N(S) correspond to correctable errors.

What do the elements of the normalizer correspond to? Suppose that n is an element in the normalizer and therefore commutes with all elements of S. Then, for any element s in S and any element $|\psi \rangle$ of the code space, we have

$s n |\psi \rangle = n s |\psi \rangle = n |\psi \rangle$

Consequently, the element $n |\psi \rangle$ is again fixed by all elements of S and therefore in the code space TS. Thus the elements of N(S) generate automorphisms of the code space, i.e. logical operations. It is one of the strengths of the stabilizer formalism that once we have the stabilizer group S, we can not only derive the code space and the correctable errors, but can also use group theoretic considerations to describe logical operations – this will become clearer as we study surface codes in a later post.

For later reference, let us briefly summarize what we have found. Given l independent generators si of the stabilizer S, we can define a code space as the set of all vectors that are fixed by the si. This subspace will encode n – l logical qubits in n physical qubits. To detect an error, we measure all the si. Each measurement will give us minus one or plus one. Any occurrence of minus one indicates an error (this is why the combined result of all measurements is usually called the error syndrome) and will also tell us which operation we have to apply to correct the error. To implement logical quantum gates on our code space, we can apply elements in the normalizer N(S) that map the code space into itself. Thus we have expressed error detection, error correction and logical operations entirely in terms of the stabilizer group and the language of group theory.

But how do we find useful stabilizer codes? One approach is given by the check matrix formalism, which allows us to express stabilizer codes in terms matrices and metrics, i.e. in terms of linear algebra. This approach is generalizing the parity check matrix from the classical theory of error correction. A second source of stabilizers, however, comes from a more surprising direction – algebraic topology. In fact, given a surface described in terms of a cell complex, we can associate elements of the Pauli group with every cycle and every co-cycle. For certain choices of cycles and cocycles, this will give us abelian subgroups of the Pauli group which in turn create error correction codes. This is how geometrical constraints in the actual implementation enter the scene and leads to a class of codes known as surface codes and toric codes that we will start to study in my next post.

1. E. Rieffel, W. Polak, Quantum computing – a gentle introduction, MIT Press, Cambridge 2011
2. D. Gottesman, Stabilizer Codes and Quantum Error Correction, Caltech Ph.D. Thesis, available as arXiv:quant-ph/9705052
3. J. Kempe, Approaches to quantum error correction, S\’eminaire Poincar\’e 1 (2005), pp. 65–93, available as arXiv:quant-ph/0612185

## Into the quantum lab – first steps with IBMs Q experience

Even though physical implementations of quantum computers make considerable progress, it is not likely that you will have one of them under your desk in the next couple of years. Fortunately, some firms like IBM and Rigetti have decided to make some of their quantum devices available only so that you can play with them. In this post, we will have a first look at IBMs cloud offering – the IBM Q experience.

IBM offers different ways to interact with their quantum computers. The easiest access is provided by the interactive composer which allows you to create a quantum circuit online and to run it on a 5 qubit quantum computer based on superconducting qubits after registering.

The screenshot below shows the editor screen. At the top, you find an overview of the available quantum devices, both realising five (physical) qubits, and some of the main characteristics like error rates and decoherence times.

In the lower area of the screen, you can compose quantum circuits graphically. You can add quantum gates that manipulate the five qubits q[0] – q[4] and measurements. After setting up your circuit, you can either simulate the circuit or you can run it on the actual device. It is also possible to export the circuit as a QASM code snippet and to save experiments and results as so-called quantum scores.

Let us now run this circuit on the actual quantum hardware. Doing this consumes units, i.e. credits. A standard run consists of 1024 repetitions and one measurement after each repetition and consumes three units (which are replenished after the execution has completed). The execution is queued and, typically after a few minutes, the results are made available online (and you receive a mail informing you about the execution). When running a circuit, the platform also checks whether that circuit was executed before (by you our someone else) and if yes offers you immediate access to the previous results without consuming any units. The screenshot below shows the result of one of these runs.

The interesting part is the diagram at the top that shows the measurement outcomes in the computational basis, i.e. for each member of the computational basis on the x-axis, the value shows the percentage of measurements having that basis element as result. In this example, we see that almost always, the result is 00000, as expected from the circuit in which we do a double inversion of the first qubit. However, in some cases – corresponding to 1,3 percent of all runs – the outcome is 00001. This is a nice example for an error that occurs whenever we switch from a simulator to a real physical device. Note that in the IBM Q experience, the qubit q[0] is the least significant qubit and the rightmost qubit in the notation of the computational basis.

Now let us try something else – we want to put our system into an equal superposition

$\sum_i |i \rangle$

which is the usual starting point for most quantum algorithms. We know that we can achieve this by applying a Hadamard gate on each qubit. The following screenshot shows the resulting circuit for three qubits and the results of a (cached) run.

As expected, we see that our measurement results are spread across eight (23) vectors of the computational basis. However, we also see that again, the result does not exactly match the theoretical prediction – the values are not exactly equally distributed but we see slight deviations as before.

Finally, let us try out a multi-qubit gate. The screenshot below shows the result of a run with two CNOT gates. The first CNOT gate receives a logical one on its control bit, the second CNOT gate a logical zero. Thus, the result should be 00011, i.e. the first (least significant) qubits are inverted. In reality, we again get this result plus some noise, represented by unexpected outcomes in the measurements.

Playing with the composer is fun, and makes it easy to create simple circuits from scratch. However, more complicated circuits soon become difficult to define in this way. For those circuits, the IBM platform offers an API that can be used to run experiments using Python scripts that can much easier be developed and debugged step by step. We will look at this option in a later post. Until then, I recommend you get an account and create some circuits yourself – happy hacking!

## The EM algorithm and Gaussian mixture models – part II

In this post, I will discuss the general form of the EM algorithm to obtain a maximum likelihood estimator for a model with latent variables.

First, let us describe our model. We suppose that we are given some joint distribution of a random variable X (the observed variables) and and random variable Z (the latent variables) and are interested in maximizing the likelihood of an observed sample x of the visible variable X. We also assume that the joint distribution depends on a parameter $\Theta$, in practice this could be weights, bias terms or any other parameters. To simplify things a bit, we will also assume that the latent variable is finite. Our aim is to maximize the log likelihood, which we can – under these assumptions – express as follows.

$\ln P(x |\Theta) = \ln \sum_z P(x,z | \Theta)$

Even if the joint distribution belongs to some exponential family, the fact that we need to consider the logarithm of the sum and not the sum of the logarithms makes this expression and its gradient difficult to calculate. So let us look for a different approach.

To do this, let us assume that we are given a value $\Theta$ of the parameter and let us try to understand how the likelihood changes if we pass from $\Theta$ to some other value $\Theta'$. For that purpose, we introduce a term that is traditionally called Q and defined as follows (all this is a bit abstract, but will become clearer later when we do an example):

$Q(\Theta'; \Theta) = E \left[ \ln P(x,z | \Theta') | x, \Theta \right]$

That looks a bit complicated, so let me explain the notation a bit. We want to define a function Q that will be a function of the new parameter value $\Theta'$. This function will, in addition, depend on the current value $\Theta$ which we consider as a parameter.

The right hand side is an expectation value. In fact, for each value of the visible variable x and the parameter $\Theta$, we have a probability distribution on the space in which Z lives, given by the conditional probability of Z given x and $\Theta$. Whenever we have a function depending on z, we can therefore form the expectation value as usual, i.e. as the weighted sum over all function values, weighted by the probability of z. In particular, we can do this for the function $\ln P(x,z | \Theta')$ of z. Thus the right hand side is, spelled out

$E \left[ \ln P(x,z | \Theta') | x, \Theta \right] = \sum_z \ln P(x,z | \Theta') P(z | x, \Theta)$

That is now again a sum of logarithms, not a logarithm of a sum, and we can hope to be able to deal with this much better.

This is nice, but so far we have only introduced a rather complicated additional object – what do we gain? It turns out that essentially, maxizing Q will effectively maximize the likelihood. Let us make this bold statement a bit more precise. Suppose we are able to iteratively maximize Q. Expressed formally, this would mean that we are able to find a sequence $\Theta^0, \Theta^1, \dots$ of parameters such that when passing from $\Theta^t$ to $\Theta^{t+1}$, the value of Q does not decrease, i.e.

$Q(\Theta^{(t+1)};\Theta^{(t)}) \geq Q(\Theta^{(t)} ; \Theta^{(t)})$

Then this very same sequence will be a sequence of parameters for which the log-likelihood is non-decreasing as well, i.e.

$\ln P(x | \Theta^{(t+1)}) \geq \ln P(x | \Theta^{(t)})$

I will not include a proof for this in this post (the proof identifies the difference between any two subsequent steps as a Kullback-Leibler divergence and makes use of the fact that a Kullback-Leibler divergence is never negative, you can find a a proof in the references, in particular in [2], or in my more detailed notes on the EM algorithm and Gaussian mixture models, where I also briefly touch on convergence). Instead, let us try to understand how this can be used in practice.

Suppose that we have already constructed some parameter $\Theta^t$. The algorithm then proceeds in two steps. First, we calculate Q as a function of the new parameter $\Theta^{t+1}$. As Q is defined as an expectation value, this step is called the expectation step. Once we have that, we try to maximize Q, i.e. we try to find a value for the parameter $\Theta^{t+1}$ such that $Q(\Theta^{t+1}; \Theta^t)$ is maximized. This part of the algorithm is therefore called the maximization step. Then we start over, using $\Theta^{t+1}$ as new starting point. Thus the algorithm alternates between an expectation and a maximization step, leading to the name EM algorithm.

To see how this works in practice, let us now return to our original example – Gaussian mixtures. Here the parameter $\Theta$ is given by

$\Theta = (\mu, \pi, \Sigma)$

We consider a random variable X which has N components, each of which being a vector Xn in a d-dimensional space and corresponding to one sample vector (so in the language of machine learning, N will be our batch size). Similarly, Z consists of N components zn which again are subject to the restriction that only one of the znk be different from zero. We assume that the joint representation of our model is given by

$P(x , z) = \prod_n \prod_k \pi_k^{z_{nk}}{\mathcal N} (x_n, \mu_k , \Sigma_k)^{z_{nk}}$

Now we need to compute Q. The calculation is a bit lengthy, and I will skip most of it here (you can find the details in the references or my notes for this post). To state the result, we have to introduce a quantity called responsibility which is defined as follows.

$r_{nk} = \frac{\pi_k {\mathcal N}(x_n ; \mu_k, \Sigma_k)} {\sum_{j} \pi_j {\mathcal N}(x_n ; \mu_j, \Sigma_j) }$

From this definition, is it clear that the responsibility is always between zero and one, and in fact is has an interpretation of a (conditional) probability that sample n belongs to cluster k. Note that the responsibility is a function of the model parameters $\mu, \pi$ and $\Sigma$. Using this notation, we can now write down the result of calculating the Q function:

$Q(\Theta'; \Theta) = \sum_n \sum_k r_{nk} \ln \pi'_k + \sum_n \sum_k r_{nk} \ln {\mathcal N}(x ; \mu'_k, \Sigma'_k)$

where the responsibilities are calculated using the old parameter set $\Theta = (\mu, \pi, \Sigma)$. For a fixed $\Theta$, this is a function of the new parameters $\mu', \pi'$ and $\Sigma'$, and we can now try to maximize this function with respect to these parameters. This calculation is not difficult, but again a bit tiresome (and requires the use of Lagrangian multipliers as there is a constraint on the $\pi_k$), and I again refer to my notes for the details. When the dust settles, we obtain three simple expressions. First, the new values for the cluster means are given by

$\mu'_k = \frac{\sum_n r_{nk} x_n}{\sum_n r_{nk}}$

This starts to look familiar – this is the same expression that we did obtain for the cluster centers for the k-means algorithm! In fact, if we arrange the rnk as a matrix, the denominator is the sum across column k and the numerators is the weighted sum over all data points. However, there is one important difference – it is no longer true that given n, only one of the rnk will be different from one. Instead, the rnk are soft assignments that model the probability that the data point xn belongs to cluster k.

To write down the expression for the new value of the covariance matrix, we again need a notation first. Set

$N_k = \sum_n r_{nk}$

which can be interpreted as the number of soft assignments of points to cluster k. With that notation, the new value for the covariance matrix is

$\Sigma'_k = \frac{1}{N_k} \sum_n r_{nk} (x_n - {\mu'}_k)(x_n - {\mu'}_k)^T$

Finally, we can use the method of Lagrange multipliers to maximize with respect to the weights $\pi_k$ (which always need to sum up to one), and again obtain a rather simple expression for the new value.

$\pi'_k = \frac{N_k}{N}$

Again, this is intuitively very appealing – the probability to be in cluster k is updated to be the number of points with a soft assignment to k divided by the total number of assignments

We now have all the ingredients in place to apply the algorithm in practice. Let us summarize how this will work. First, we start with some initial value for the parameters – the weights $\pi$, the covariance matrices $\Sigma_k$ and the means $\mu_k$ – which could, for instance, be chosen randomly. Then, we calculate the responsibilities as above – essentially, this is the expectation step, as it amounts to finding Q.

In the M-step, we then use these responsibilities and the formulas above to calculate the new values of the weights, the means and the covariance matrix. This involves a few matrix operations, which can be nicely expressed by the operations provide by the numpy library.

In the example above, we have created two sets of 500 sample points from different Gaussian mixture distributions and then applied the k-means algorithm and the EM algorithm. In the top row, we see the results of running the k-means algorithm. The color indicates the result of the algorithm, the shape of the marker indicates the original cluster to which the point belongs. The bottom row displays the results of the EM algorithm, using the same pattern.

We see that while the first sample (diagrams on the left) can be clustered equally well by both algorithms, the k-means algorithm is not able to properly cluster the second sample (diagrams on the right), while the EM algorithm is still able to assign most of the points to the correct cluster.

If you want to run this yourself, you can – as always – find the source code on GitHub. When playing with the code and the parameters, you will notice that the results can differ substantially between two consecutive runs, this is due to the random choice of the initial parameters that have a huge impact on convergence (and sometimes the code will even fail because the covariance matrix can become singular, I have not yet fixed this). Using the switch --data=Iris, you can also apply the EM algorithm to the Iris data set (you need to have a copy of the Iris data file in the current working directory) and find again that the results vary significantly with different starting points.

The EM algorithm has the advantage of being very general, and can therefore be applied to a wide range of problems and not just Gaussian mixture models. A nice example is the Baum Welch algorithm for training Hidden Markov models which is actually an instance of the EM algorithm. The EM algorithm can even be applied to classical multi-layer feed forward networks, treating the hidden units as latent variables, see [3] for an overview and some results. So clearly this algorithm should be part of every Data Scientists toolbox.

### References

1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006
2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38
3. Shu-Kay Ng, G.J. McLachlan, Using the EM Algorithm to Train Neural Networks: Misconceptions and a New Algorithm for Multiclass Classification, IEEE Transaction on Neural Networks, Vol. 15, No. 3, May 2004

## The EM algorithm and Gaussian mixture models – part I

In the last few posts on machine learning, we have looked in detail at restricted Boltzmann machines. RBMs are a prime example for unsupervised learning – they learn a given distribution and are able to extract features from a data set, without the need to label the data upfront.

However, there are of course many other, by now almost classical, algorithms for unsupervised learning. In this and the following posts, I will explain one of the most commonly applied algorithms in this field, namely the EM algorithm.

As a preparation, let us start with a very fundamental exercise – clustering. So suppose that we are given a data set which we suspect to consist of K clusters. We want to identify these clusters, i.e. to each point in the dataset, we want to assign the number of the cluster to which the point belongs.

More precisely, let us suppose that we are given a set ${\mathcal D} = \{ x_i, \dots, x_N\}$ of points in some euclidian space and a number K of clusters. We want to identify the centre $\mu_i$ of each cluster and then assign the data points xi to some of the $\mu_j$, namely to the $\mu_j$ which is closest to xi. We can encode the assignment of data points to clusters in a matrix Rij where Rij = 1 if data point xi belongs to cluster j with centre $\mu_j$. Thus each row of R corresponds to a data point and is non-zero in exactly one column, where it is one. This is known as 1-of-K encoding.

For each value of i, the squared distance of xi to the centre of the cluster to which it is assigned is then given by

$\sum_j R_{ij} \| x_i - \mu_j |^2$

where we note that only one of the summands is different from zero. Assigning the data points to the clusters and positioning the centre points of the clusters then amounts to minimizing the loss function

$l(\mu, R) = \sum_i \sum_j R_{ij} \langle x_i - \mu_j , x_i - \mu_j \rangle$

where the brackets denote the usual euclidean scalar product.

Now let us see how we can optimize this function. For a fixed vector $\mu$ of cluster centers, we can easily minimize Rij by assigning each data point to the cluster whose center is closest to xi. Thus we set

$R_{ij} = \begin{cases} 1 & \text{if} \, j = \arg \min \| x_i - \mu_j \|^2 \\ 0 & \text{otherwise} \end{cases}$

Conversely, given a matrix R which we hold fixed, it is likewise easy to minimize $\mu_j$. As there are no further constraints on $\mu_j$, we can find the minimum by differentiating the loss function. We find that the derivative is zero if and only if

$0 = \sum_i R_{ij}(x_i - \mu_j)$

holds for all j. Assuming for a moment that each cluster has at least one data point assigned to it, i.e. that none of the columns of R contains zeroes only, we can solve this by

$\mu_j = \frac{\sum_i x_i R_{ij}}{\sum_i R_{ij}}$

which is also easily seen to be a local minimum by taking the second derivative.

Note that this term has an obvious geometric interpretation. The denominator in this expression is the number of data points that are currently assigned to cluster j. The numerator is the sum of all data points assigned to this cluster. Thus the minimum is the mean value of the positions of the data points that are currently assigned to the cluster (a physicist would call this the center of gravity of the cluster). This is the reason why this method is called the k-means algorithm. If no data points are assigned to the cluster, the loss function does not depend on $\mu_j$ and we can choose any value that we want.

The algorithm now works as follows. First, we choose centers $\mu_j$ at will – it is common to use some of the data points for this purpose. Then we assign each data point to a center using the formula for Rij derived above. We then adjust the center points $\mu_j$ and reallocate the points to the new centers and so forth.

From our discussion above, it is clear that each full iteration of this procedure will reduce the loss function. This does of course not imply that the algorithm converges to a global minimum, as it might get stuck in local minima or saddle points. In practice, the algorithm is executed until the cluster assignments and centers do not change any more substantially or for a given number of steps.

The diagram above shows the result of applying this algorithm to a set of points that is organized in two clusters. To generate the data, 100 samples were drawn from 2-dimensional Gaussian distributions. On the left hand side, half of the the samples were centered at the point (5,1), the other samples at (1,4), and both had standard deviation 0.6. On the right hand side, the same centers were used, but only a small number of samples were drawn from the second distribution which had standard deviation 0.5, whereas most samples came from the first distribution with standard deviation 0.8. Then 10 iterations of the k-means algorithm were applied to the resulting sample set. The points in the sample were then plotted with a color indicating the assignment to clusters resulting from the matrix R. The actual cluster from which the sample was drawn is indicated by the shape – a point is cluster one, a diamond is cluster two.

We see that in the example on the left hand side, the algorithm has correctly assigned all points to their original cluster. For the example on the right hand side, the situation is different – the algorithm did actually assign significantly more points to the blue cluster, i.e. there are many wrong assignments (blue points). This does not change substantially if we increase the number of iterations, even with 100 iterations, we still see many wrong assigments for this example. If you want to reproduce the example and play a bit with the parameters, you can get the sourcecode for a straightforward implementation in Python from my GitHub repository.

The K-means algorithm is very simple and straightforward, but seems to have limitations because it cannot determine the shape of a distribution, only its center. It turns out that K-means is a special case of a broader class of algorithms that we now study, hoping to find more robust algorithms.

In our example, we have generated sample data as a combination of two Gaussian distributions. What if we just change the game and simply assume that our data is of this type? In other words, we assume an underlying probabilistic model for our sample data. Once we have that, we can pull all the tricks that statistical inference can offer – we can calculate maximum likelihood and maximum posterior probability, we can try to determine the posterior or even sample from the model.

Thus let us try to fit our data to a model of the form

$P(x) = \sum_k \pi_k {\mathcal N}(x ; \mu_k, \Sigma_k)$

where ${\mathcal N}(x ; \mu_k, \Sigma_k)$ is a multivariate Gaussian distribution with mean $\mu_k$ and covariance matrix $\Sigma_k$, i.e.

${\mathcal N}(x ; \mu_k, \Sigma_k) = \frac{1}{\sqrt{(2\pi)^d \det(\Sigma)}} \exp (-\frac{1}{2} \langle x - \mu_k, \Sigma_k^{-1}(x - \mu_k)\rangle)$

and the $\pi_k$ are non-negative real numbers with $\sum_k \pi_k = 1$.

Let us now see how this equation looks like if we use a 1-of-K encoding. We introduce a random variable Z that takes values in $\{ 0, 1\}^K$ with the additional constraint that only one of the Zk is allowed to be different from zero. We interpret $\pi_k$ as the probability

$\pi_k = P(Z_k = 1)$

Then

$P(Z = z) = \prod_k \pi_k ^{z_k}$

and we can write

$P(X=x) = \sum_z P(Z=z) P(X=x | Z=z) = \sum_z P(x,z)$

where $P(z)$ is as above and

$P(X = x | Z = z) = \prod_k {\mathcal N}(x ; \mu_k, \Sigma_k)^{z_k}$

This is a very general type of distribution which reflects a common pattern in machine learning, namely the introduction of so called latent or hidden variables. In general, latent or hidden variables are random variables that are a part of the model which cannot be observed, i.e. are not part of the input or the output of the model. We have seen latent variables in action several times – adding hidden units to a neural network introduces latent variables and makes the model much more powerful, the hidden layer of a restricted Boltzmann machine serves as memory to learn features, and latent variables that are used to construct a mixture of Gaussians as above allow us to model a much broader class of distributions than a model with just one Gaussian.

Intuitively, it is also clear how to sample from such a model. In a first step, we sample from Z, in other words we determine the index k randomly according to the distribution given by the $\pi_k$. Once we have k, we then sample from the conditional distribution $P(X = x | Z = z)$. As we already have k, this amounts to sampling from the Gaussian distribution ${\mathcal N}(x ; \mu_k, \Sigma_k)$ with mean $\mu_k$ and covariance matrix $\Sigma_k$.

In the example above, we have first applied this procedure to a one-dimensional Gaussian mixture with K=2. The histogram shows the result of the sampling procedure, the solid lines are the probability density functions of the two individual Gaussian distributions, multiplied by the respective weight. On the right hand side, the result of sampling a Gaussian mixture in two dimensions is displayed (you can download the notebook used to produce this example here).

When we now try to fit this model to the sample data, we again have to calculate the likelihood function and try to maximize it. However, it turns out that the gradient of the likelihood is not so easy to calculate, and it might very well be that there is no closed form solution or that we obtain rather complicated expressions for the gradient. Fortunately, there is an alternative that works for very general models and does not require knowledge of the gradient – the EM algorithm. In the next post, I will present the algorithm in this general setup, before we apply it to our original problem and compare the results with the k-means algorithm.

### References

1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006
2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38

## Why you need statistics to understand neuronal networks

When I tried to learn about neuronal networks first, I did what probably most of us would do – I started to look for tutorials, blogs etc. on the web and was surprised by the vast amount of resources that I found. Almost every blog or webpage about neuronal networks has a section on training a simple neuronal network, maybe on the MNIST data set, using a framework like TensorFlow, Theano or MXNET. When you follow such a tutorial, a network is presented as a collection of units and weights. You see how the output of the network is calculated and then an error function – sometimes least squares, sometimes something else – is presented. Often, a regulation term is applied, and then you are being told that the automatic gradient calculation features of the framework will do the gradient descent algorithm for you and you just have to decide on an optimizer and run the network and enjoy the results.

Sooner or later, however, you will maybe start to ask a few questions. Why that particular choice of the loss function? Where does the regulator come from? What is a good initial value for the weights and why? Where does the sigmoid function come from? And many, many other questions….

If you then decide to dig deeper, using one of the many excellent textbooks or even try to read some of the original research papers (and some are actually quite readable), you will very soon be confronted with terms like entropy, maximum likelihood, posterior distribution, Gaussian mixtures and so on, and you will realize that the mathematics of neuronal networks has a strong overlap with mathematical statistics. But why? Why is that a good language to discuss neuronal networks, and why should you take the time to refresh your statistics knowledge if you really want to understand neuronal networks? In this post, I will try to argue that statistical inference comes up very naturally when you try to study neuronal networks.

Many neuronal networks are designed to excel at classification tasks. As an example, suppose you wanted to design and train a neuronal network that, given data about an animal, classifies the animal as either a bird or some other animal (called a “non-bird” for convenience). So our starting point is a set modelling all possible objects that could be presented to the network. How exactly we model this set is not so important, more important is that in general, the network will not have access to all the data about the animal, but only to certain attributes of elements in the set called features. So there could be a feature which we call X1 and which is defined as

$X_1 = \text{the animal can fly}$

taking values in $\{0,1\}$. Another data point the network could get is

$X_2 = \text{length of animal in cm}$

taking values in the reals and so forth. More generally, we assume that on the set of all possible objects, we have certain functions Xi taking values in, say, the real numbers. Based on these numbers, the network will then try to take a decision whether a given animal is a bird or not. Thus we do not have to deal directly with our space of objects, but use the functions Xi as primary objects.

If the network had a chance to look at every possible animal, this would be easy, even though it would cost a lot memory – it could simply remember all possible combinations of features and for each feature, store the correct answer. In reality however, this does not work. Instead, we have access to a small subset of the data – a sample – for which we can evaluate the Xi. Based on this subset, we then have to derive a model which gives the right answer in as many cases as possible. Thus we try to make a statement about the full space of things that could be presented to our network for classification based on a small sample.

This is where probabilities come into play very naturally. We need to assume that our sample has been chosen randomly, but still we need to make assertions about the full set. This is exactly what inferential statistics is doing. The fact that our sample is chosen randomly turns our Xi into random variables. Similarly, the variable

$Y = \text{is a bird}$

taking values in $\{0,1\}$ is a random variable, and we try to gain information on the distribution of Y across the full population based on its values on a given set of labelled samples, i.e. a set of samples where Y is known. Thus Y would represent the labels or targets in the language of neuronal networks. Applying the methods of statistical inference to this situation would typically start by choosing a statistical model and than using estimators or hypothesis testing to make deductions.

Apart from the fact that we have to derive information on the full population based on a sample, there is another reason why probabilities appear naturally in the theory of machine learning. In many cases, the available input – being a reduction of the full set of data – is not sufficient to classify the sample with full certainty. To see this, let us go back to our examples. How would you derive the property “bird” from the given data “can fly” and “length”? Not all animals than can fly are birds – and not all birds can fly. So we have to try to distinguish for instance a butterfly from a hummingbird based on the length. The smallest hummingbird – a bee hummingbird – is about 5 cm in length. The largest known butterfly – the Queen’s Alexandra birdwing – can be as long as 8 cm (both informations taken from Wikipedia). Thus our data is not sufficient to clearly distinguish butterflies and birds in all cases!

However, very small birds and very large butterflies have one thing in common – they are rare. So chances are that a given animal that can fly and is larger than 5 cm is actually a bird (yes, I know, there are bats….). In other words, if again Y denotes the variable which is 1 on birds and 0 on all other animals, we can in general not hope that Y is a function of the Xi, but we can hope that given some values of the Xi, the probability $P(Y=1)$ to be a bird is a function of the Xi. In other words, using the language of conditional probabilities,

$P(Y=1 | X = x) = f(x)$

with some unknown function f. In a Bayesian interpretation of probability, the certainty with which can say “this animal is a bird” is a function of the values xi of the observable variables Xi.

With these considerations, we now arrive at the following mathematical model for what a classification algorithm is about. We are given a probability space $(P, \Omega)$ with a vector valued random variable X. The attributes of a sample are described by the feature vector X in some subset of m-dimensional euclidian space, where m is the number of different features. In our example, m=2, as we try to classify animals based on two properties. The result of the classification is described by a random variable Y taking – for the simple case of a binary classification problem – values in $\{0,1\}$. We then assume that

$P(Y =1 | X=x) = f(x;w_0)$

where $f(\cdot;w)$ is a function parametrized by some parameter w that we call the weights of the model. The actual value w0 of w is unknown. Based on a sample for X and Y, we then try to fit the model, i.e. we try to find a value for w such that $f(\cdot, w)$ models the actual conditional distribution of Y as good as possible. Once the fitting phase is completed, we can then use the model to derive predictions about objects which are not in our initial sample set.

This model sounds a bit abstract, but many feed forward neuronal networks can be described with this or similar models. And we can now apply the full machinery of mathematical statistics – we can calculate cross entropies and maximum likelihood, we can analyse converge and variance, we can apply the framework of Bayesian statistics and Monte Carlo methods. This is the reason why statistics is so essential when it comes to describing and analyzing neuronal networks. So on the next rainy Sunday afternoon, you might want to grab a steaming hot cup of coffee, head towards your arm chair and spent some time with one of the many good exposures on this topic, like chapter IV in MacKays book on Machine Learning, or Bishops “Pattern recognition and machine learning” or chapter 3 of the deep learning book by Goodfellow, Bengio and Courville.

## The Metropolis-Hastings algorithm

In this post, we will investigate the Metropolis-Hastings algorithm, which is still one of the most popular algorithms in the field of Markov chain Monte Carlo methods, even though its first appearence (see [1]) happened in 1953, more than 60 years in the past. It does for instance appear on the CiSe top ten list of the most important algorithms of the 20th century (I got this and the link from this post on WordPress).

Before we get into the algorithm, let us once more state the problem that the algorithm is trying to solve. Suppose you are given a probability distribution $\pi$ on some state space X (most often this will be a real euclidian space on which you can do floating point arithmetic). You might want to imagine the state space as describing possible states of a physical system, like spin configurations in a ferromagnetic medium similar to what we looked at in my post on the Ising model. The distribution $\pi$ then describes the probability for the system to be in a specific state. You then have some quantity, given as a function f on the state space. Theoretically, this is a quantity that you can calculate for each individual state. In most applications, however, you will never be able to observe an individual state. Instead, you will observe an average, weighted by the probability of occurence. In other words, you observe the expectation value

$\langle f \rangle = \int_X f d\pi$

of the quantity f. Thus to make a prediction that can be verified or falsified by an observation, you will have to calculate integrals of this type.

Now, in practice, this can be very hard. One issue is that in order to naively calculate the integral, you would have to transverse the entire state space, which is not feasible for most realistic problems as this tends to be a very high dimensional space. Closely related to this is a second problem. Remember, for instance, that a typical distribution like the Boltzmann distribution is given by

$\pi(x) = \frac{1}{Z} e^{-\beta E(x)}$

The term in the numerator is comparatively easy to calculate. However, the term in the denominator is the partition function, and is itself an integral over the state space! This makes even the calculation of $\pi(x)$ for a single point in the state space intractable.

But there is hope – even though calculating the values of $\pi$ for one point might be impossible, in a distribution like this, calculating ratios of probabilities is easy, as the partition function cancels out and we are left with the exponential of an energy difference! The Metropolis-Hasting algorithm leverages this and also solves our state space problem by using a Markov chain to approximate the integral. So the idea is to build a Markov chain Xt that converges and has $\pi$ as an invariant distribution, so that we can approximate the integral by

$\langle f \rangle = \int_X f d\pi \approx \frac{1}{N} \sum_{t=1}^N f(X_t)$

for large values of N.

But how do we construct a Markov chain that converges to a given distribution? The Metropolis Hastings approach to solve this works as follows.

The first thing that we do is to choose a proposal density q on our state space X, i.e. a measurable function

$q \colon X \times X \rightarrow [0,\infty)$

such that for each x, $\int q(x,y) dy = 1$.

Then q defines a Markov chain, where the probability to transition into a measurable set A being at a point x is given by the integral

$Q(x,A) = \int_{\mathcal X} q(x,y) dy$

Of course this is not yet the Markov chain that we want – it has nothing to do with $\pi$, so there is no reason it should converge to $\pi$. To fix this, we now adjust the kernel to incorporate the behaviour of $\pi$. For that purpose, define

$\alpha(x,y) = \begin{cases} \min \{ 1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \} & \text{if } \pi(x) q(x,y) > 0 \\ 1 & \text{if } \pi(x) q(x,y) = 0 \end{cases}$

This number is called the acceptance probability, and, as promised, it only contains ratios of probabilities, so that factors like the partition function cancel and do not have to be computed.

The Metropolis Hastings algorithm now proceeds as follows. We start with some arbitrary point x0. When the chain has arrived at xn, we first draw a candidate y for the next location from the proposal distribution $q(x_n, \cdot)$. We now calculate $\alpha$ according to the formula above. We then accept the proposal with probability $\alpha$, i.e. we draw a random sample U from a uniform distribution and accept if $U \leq \alpha$. If the proposal is accepted, we set xn+1 = y, otherwise we set xn+1 = xn, i.e. we stay where we are.

Clearly, the xn are samples from a Markov chain, as the position at step xn only depends on the position at step xn-1. But is still appears to be a bit mysterious why this should work. To shed light on this, let us consider a case where the expressions above simplify a bit. So let us assume that the proposal density q is symmetric, i.e. that

$q(x,y) = q(y, x)$

This is the original Metropolis algorithm as proposed in [1]. If we also assume that $\pi$ and q are nowhere zero, the acceptance probability simplifies to

$\alpha(x,y) = \min \{ 1, \frac{\pi(y)}{\pi(x)} \}$

Thus we accept the proposal if $\pi(y) \geq \pi(x)$ with probability one. This is very similar to a random search for a global maximum – we start at some point x, choose a candidate for a point with higher value of $\pi$ at random and proceed to this point. The major difference is that we also accept candidates with $\pi(y) < \pi(x)$ with a non-zero probability. This allows the algorithm to escape a local maximum much better. Intuitively, the algorithm will still try to spend more time in regions with large values of $\pi$, as we would expect from an attempt to sample from the distribution $\pi$.

The image above illustrates this procedure. The red graph displays the distribution $\pi$. If our algorithm is currently at step xn, the purpose is to move “up-hill”, i.e. to the left in our example. If we draw a point like y from q which goes already in the right directory, we will always accept this proposal and move to y. If, however, we draw a point like y’, at which $\pi$ is smaller, we would accept this point with a non-zero probability. Thus if we have reached a local maximum like the one on the right hand side of the diagram, there is still a chance that we can escape from there and move towards the real maximum to the left.

In this form, the algorithm is extremely easy to implement. All we need is a function propose that creates the next proposal, and a function p that calculates the value of the probability density $\pi$ at some point. Then an implementation in Python is as follows.

import numpy as np
chain = []
X = 0
chain.append(X)
for n in range(args.steps):
Y = propose(X)
U = np.random.uniform()
alpha = p(Y) / p(X)
if (U <= alpha):
X = Y
chain.append(X)


In the diagram below, this algorithm has been applied to a Cauchy distribution with mode zero and scale one, using a normal distribution with mean x and standard deviation 0.5 as a proposal for the next location. The chain was calculated for 500.000 steps. The diagram in the upper part shows the values of the chain during the simulation.

Then the first 100.000 steps were discarded and considered as "burn-in" time for the chain to stabilize. Out of the remaining 400.000 sample points, points where chosen with a distance of 500 time steps to obtain a sample which is approximately independent and identically distributed. This is called subsampling and typically not necessary for Monte Carlo integration (see [2], chapter 1 for a short discussion of the need of subsampling), but is done here for the sake of illustration. The resulting subsample is plotted as a histogramm in the lower left corner of the diagram. The yellow line is the actual probability density.

We see that after a few thousand steps, the chain converges, but continues to have spikes. However, the sampled distribution is very close to the sample generated by the Python standard method (which is to take the quotient of two independent samples from a standard normal distribution).

In the diagram at the bottom, I have displayed how the integral of two functions ($\sin(x)$ and $\cos(x)$) approximated using the partial sums develops over time. We see that even though we still have huge spikes, the integral remains comparatively stable and converges already after a few thousand iterations. Even if we run the simulation only for 1000 steps, we already get close to the actual values zero (for $\sin(x)$ for symmetry reasons) and $\approx 0.3678$ (for $\cos(x)$, obtained using the scipy.integrate.quad integration routine).

In the second diagram in the middle row, I have plotted the autocorrelation versus the lag, as an indicator for the failure of the sample points to be independent. Recall that for two samples X and Y, the Pearson correlation coefficient is the number

$\frac{E((X-\bar{X})(Y-\bar{(Y)})}{\sigma_X \sigma_Y}$

where $\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y. In our case, given a lag, i.e. a number l less than the length of the chain, we can form two samples, one consisting of the points $X_0, X_2, \dots$ and the second one consisting of the points of the shifted series $X_l, X_{l+1}, X_{l+2}, \dots$. The autocorrelation with lag l is then defined to be the correlation coefficient between these two series. In the diagram, we can see how the autocorrelation depends on the lag. We see that for a large lag, the autocorrelation becomes small, supporting our intuition that the series and the shifted series become independent. However, if we execute several simulation runs, we will also find that in some cases, the convergence of the autocorrelation is very slow, so care needs to be taken when trying to obtain a nearly independent sample from the chain.

In practice, the autocorrelation is probably not a good measure for the convergence of a Markov chain. It is important to keep in mind that obtaining an independent sample is not the point of the Markov chain – the real point is that even though the sample is autocorrelated, we can approximate expectation values fairly well. However, I have included the autocorrelation here for the sake of illustration.

This form of proposal distributions is not the only one that is commonly used. Another choice that appears often is called an independence sampler. Here the proposal distribution is chosen to be independent of the current location x of the chain. This gives us an algorithm that resembles the importance sampling method and also shares some of the difficulties associated with it – in my notes on Markov chain Monte Carlo methods, I have included a short discussion and a few examples. These notes also contain further references and a short discussion of why and when the Markov chain underlying a Metropolis-Hastings sampler converges.

Other variants of the algorithm work by updating – in a high-dimensional space – either only one variable at a time or entire blocks of variables that are known to be independent.

Finally, if we are dealing with a state space that can be split as a product $X_1 \times X_2$, we can use the conditional probability given either x1 or x2 as a proposal distribution. Thus, we first fix x2, and draw a new value for x1 from the conditional probability for x1 given the current value of x2. Then we move to this new coordinate, fix x1, draw from the conditional distribution of x2 given x1 and set the new value of x2 accordingly. It can be shown (see for example [5]) that the acceptance probability is one in this case. So we end up with the Gibbs sampling algorithm that we have already used in the previous post on Ising models.

Monte Carlo sampling methods are a broad field, and even though this has already been a long post, we have only scratched the surface. I invite you to consult some of the references below and / or my notes for more details. As always, you will also find the sample code on GitHub and might want to play with this to reproduce the examples above and see how different settings impact the result.

In a certain sense, this post is the last post in the series on restricted Boltzmann machines, as it provides (at least some of) the mathematical background behind the Gibbs sampling approach that we used there. Boltzmann machines are examples for stochastic neuronal networks that can be applied to unsupervised learning, i.e. the allow a model to learn from a sample distribution without the need for labeled data. In the next few posts on machine learning, I will take a closer look at some other algorithms that can be used for unsupervised learning.

### References

1. N. Metropolis,A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculation by fast computing machines, J. Chem. Phys. Vol. 21, No. 6 (1953), pp. 1087-1092
2. S. Brooks, A. Gelman, C.L. Jones,X.L. Meng (ed.), Handbook of Markov chain Monte Carlo, Chapman Hall / CRC Press, Boca Raton 2011
3. W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, Vol. 57 No. 1 (1970), pp. 97-109
R.M. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993
5. C.P. Robert, G. Casella, Monte Carlo Statistical Methods,
Springer, New York 1999

## Recurrent and ergodic Markov chains

Today, we will look in more detail into convergence of Markov chains – what does it actually mean and how can we tell, given the transition matrix of a Markov chain on a finite state space, whether it actually converges.

So suppose that we are given a Markov chain on a finite state space, with transition probabilities described by a matrix K as in the previous post on this topic.

We have seen that the distribution of Xn is given by $\mu K^n$ where $\mu$ is the initial distribution. So if Kn converges, we can expect the distribution of Xn to converge as well. If there is a matrix $K^\infty$ such that

$\lim_{n \rightarrow \infty} K^n = K^\infty$

then of course

$K^\infty K = (\lim_{n \rightarrow \infty} K^n) K = \lim_{n \rightarrow \infty} K^{n+1} = K^\infty$

In other words, each row $\pi$ of $K^\infty$ will have the property that $\pi K = \pi$. Interpreting K as transition probabilities, this implies that if the distribution of Xn is $\pi$, the distribution of Xn+1 will again be $\pi$. Any distribution, described by a row vector $\pi$ with row sum 1, for which this holds is called an invariant distribution and traditionally denoted by $\pi$. In other words, invariant distributions correspond to eigenvectors of the transposed matrix KT with eigenvalue 1, and our argument has shown that if $K^n$ converges to $K^\infty$, then every row of $K^\infty$ will be an invariant distribution.

So convergence implies the existence of an invariant distribution. Let us next try to understand whether this invariant distribution is unique. There are obvious examples where this is not the case, the most trivial one being the unit matrix

$K = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$

Obviously $K^n = K$ and the chain converges, but every vector with row sum one is an invariant distribution. This chain is too rigid, because in whatever state we start, we will stay in this state forever. It turns out that in order to ensure uniquess of an invariant distribution, we need a certain property that makes sure that the states can move around freely which is called irreducibility.

Intuitively, we say that a Markov chain is irreducible if any state can be reached from any other state in finitely many steps. For a finite state space, this is rather easy to formalize. A finite Markov chain is called irreducible if, given two states i and j, we can find some power n such that $K^n_{ij} > 0$. In other words, given a row index i and a column index j, we can find a power n such that the element of Kn at (i,j) is not zero. It turns out that if a chain is not irreducible, we can split the state space into smaller areas that the chain – once it has entered one of them – does not leave again and on which it is irreducible. So irreducible Markov chains are the buildings blocks of more general Markov chains, and the study of many properties of Markov chains can be reduced to the irreducible case.

Now let us assume that our chain is in fact irreducible. What else do we have to ask for to make sure that it converges? Again, let us as look at an obvious example.

$K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$

All even powers of this matrix are the unit matrix and all odd powers are K itself, so the Markov chain described by this transition matrix is clearly irreducible. However, there is still something wrong. If, for instance, we are in state 1 at the time t, then the probability to still be in state 1 at time t+1 is zero. Thus we are actually forced to move to state 2. In a certain sense, there is still an additional constraint on the the behaviour of the chain – the state space can be split in two pieces and we cycle through these pieces with every step. Such a chain is called periodic. If a chain is periodic, it is again obvious that we can think of it as two different chains, one chain given by the random variables X0, X2,… at even times and the other chain given by the random variables X1, X3, … at odd times. Thus again, there is a way to split the chain into parts.

So let us now focus on chains that are irreducible and aperiodic. Can we tell whether the chain converges? The answer is surprisingly simple and one of the most powerful results in the theory or Markov chains: every finite Markov chain which is irreducible and aperiodic converges. Thus once we have verified aperiodicity and irreducibility, we can be sure that the limit $K^\infty$ exists. We can say even more – we have seen that the rows of the limit are invariant distributions for K. However, one can show that an irreducible Markov chain can only have one invariant distribution. Thus we can conclude that all rows of the matrix $K^\infty$ are identical. In fact, they are all equal to the (unique) invariant distribution that again corresponds to an eigenvector of KT with eigenvalue one.

This gives us two different approaches to calculating the invariant distribution and the limit $K^\infty$. First, we can form high powers Kn to approximate the limit $K^\infty$. Or, alternatively, we can search for eigenvectors of the transposed matrix KT with eigenvalue 1 using one of the known algorithms to calculate eigenvectors and eigenvalues.

Let us do this for the example of the random walk on a circle that we have studied earlier. For simplicity, we will use N = 4 this time. The transition matrix with p = 0.8 is

$K = \begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can then easily implement both approaches in Python using the numpy library.

Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

So we see that the approach to take the eigenvalues and eigenvectors gives us – in this case – the exact result (and a direct computation confirms that this is really an eigenvector), whereas the matrix power with n=20 gives a decent approximation. Both results confirm what we have observed visually in the last post – the distribution converges, and it does in fact converge to the uniform distribution. So roughly speaking, after sufficiently many steps, we end up everyhwere in the state space with equal probability.

So far, we have only discussed the case of a finite state space. In the general case, the situation is more complicated. First, we need a better definition of irreducibility, as on a general state space, the probability measure of a single point tends to be zero. It turns out that the solution is to define irreducibility with respect to a measure on the state space, which is called $\psi$-irreducibility. Once we have that, we find two potential behaviours that do not show up in the finite case.

First, it might happen that even though the chain is irreducible, it does not properly cycle through the state space and revisits every point with finite probability, but – intuitively speaking – heads off to infinity. Such a chain is called transient and does not converge.

Second, even if the chain is not transient, we might be able to find an invariant distribution (a measure in this case), but this measure might not be a probability distribution as it is not finite. Chains with this property are called null recurrent.

And finally, the concept of converge needs to be made more precise, and it turns out to have a subtle dependency on the starting point which can be fixed by assuming so called Harris recurrence. But all this can be done and delivers a very similar result – a convergence theorem for a large class of chains (in this case aperiodic Harris recurrent Markov chains). If you would like to get deeper into this and also see proofs for the claims made in this post and references, I invite you to take a look at my short introduction to Markov chains.

We close this post with one remark which, however, is of crucial importance for applications. Suppose we have a convergent Markov chain Xt, and a function f on the state space. If the Xt were independent and identically distributed, we could approximate the expectation value of f as

$\int_X f(x) dP = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

where dP is the (common) distribution of the Xt. For Markov chains, however, we know that the Xt are not independent – the whole point of having a Markov chain is that Xt+1 does actually depend on Xt. However, in our example above, we have seen that the dependency gets trivial for large t as in this example, all entries in the transition matrix become equal in the limit.

Of course this is a special case, but it turns out that even in general, the approximation of the expectation value by averages across the sample is still possible. Intuitively, this is not so surprising at all. If the distributions $K^n$ of the $X_n$ converge to some invariant measure $\pi$, the $X_n$ will, for large n, be approximately identically distributed, namely according to $\pi$. Moreover, heuristically we have for large n, m and a fixed starting point s:

$P(X_{n+m}=i, X_n=j) = K^m_{ji} K^n_{sj} \approx \pi_i \pi_j$

On the other hand, this product is again approximately the product

$\pi_i \pi_j \approx P(X_{n+m}=i) P(X_n=j)$

Thus, intuitively, Xn and Xm are almost independent if n and m are large, which makes us optimistic that the law of large numbers could actually hold.

In fact, if Xt converges (again I refer to my notes for a more precise definition of convergence in this case), it turns out that the law of large numbers remains valid if we integrate f with respect to the invariant distribution $\pi$:

$\int_X f(x) d\pi = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

This is extremely useful in applications, where often the primary purpose is not so much to obtain a sample, but to approximate otherwise intractable integrals! Once we have found a Markov chain that converges to the invariant distribution $\pi$, we can calculate integrals over $\pi$ by running a long Markov chain until converge and then taking the average value of the function that we want to integrate for a large number of subsequent points from this chain. In fact, this is how most applications work – and this is more or less what we also did when we applied Markov chains to the problem of calculating gradients in the PCD algorithm, where this chain is given by the state of the negative particles in subsequent iterations.

So given a distribution $\pi$, we are now lead to the question how we can possibly find a Markov chain for which this distribution is invariant and which converges. The most general answer to this question is a class of algorithms known as Metropolis-Hastings algorithms which we study in the next post in this series.

## Finite Markov chains

In this post, we will look in more detail into an important class of Markov chains – Markov chains on finite state spaces. Many of the subtleties that are present when studying Markov chains in general state spaces do not appear in the finite case, while most of the key ideas and features of Markov chains are still visible, so this is a good starting point if you want to grasp the key points.

So let us assume that our state space X is finite. For simplicity,  we label our states as $\{1, 2, \cdots N \}$ where N is the number of states. We also assume that all our points are measurable, i.e we consider our state space as a discrete probability space.

Now consider a sequence of random variables X0, X1, …. How can we formalize the idea that Xt+1 depends on Xt in a randomized way?

As so often in probability theory, let us model the dependency as a conditional probability. The conditional probability for Xt+1 to take on a value i given Xt

$P(X_{t+1} = i | X_t)$

considered as a function of Xt will assign a conditional probability to each of the states i and for each value of Xt. Therefore we can write this as a matrix

$P(X_{t+1} = i | X_t = j) = K_{ji}$

To obtain a time homogeneous Markov chain, we also assume that the matrix K does not depend on the time t. Therefore we define a Markov chain on X to be a sequence $\{X_t\}_t$ of random variables taking values in X with the Markov property saying that for all times t, the conditional distribution for Xt+1 given all previous values $X_0, X_1, \dots, X_t$ only depends on Xt, i.e.

$P(X_{t+1} | X_t, X_{t-1}, \cdots ) = P(X_{t+1} | X_t)$

We also require that this conditional probability  is independent of t and is therefore given by a matrix K as in the formula above.

Markov chains on finite state spaces are often visualized as a graph. Suppose, for instance, that our state space contains only two elements: $X = \{ 1 , 2\}$. We can think of the combined values $X_t$ for all times t as a history of states or as a random walk in the state space. The Markov property then means that the probability to transition into a next state does not depend on the full history, but only on the current state – Markov chains do not have a memory.

In our state space with two elements only, the Markov chain is then described by four transition probabilities: the probability to stay in state 1 when the chain is in state 1, the probability to move to state 2 when being in state 1, the probability to stay in state 2 and the probability to move to state 1 after being in state 2. This can be visualized as follows.

Let us now calculate a few probabilities to get an idea for the relevant quantities in such a model. First, suppose that at time 0, the model is in state j with probability $\mu_j$. What is the probability to be in state j after one step? Of course we can write

$P(X_1 = i) = \sum_j P(X_1 = i, X_0 = j)$

Now, according to the rules of conditional probabilities, we can express the joint probability as follows.

$P(X_1 = i, X_0 = j) = P(X_1 = i | X_0 = j) P(X_0 = j)$

Plugging this into our previous expression and using the definitions of K and $\pi$, we now obtain

$P(X_1 = i) = \sum_j K_{ji} \mu_j$

Thus if we think of $\mu$ as a row vector, then the probability after one step is described by the matrix product $\mu K$.

Now let us calculate a slightly different quantity. Assume that we know that a chain starts at j. What is the probability to be at i after two steps? Using once more the rules of conditional probability and the Markov property, we can write

$P(X_2=i | X_0=j) = \sum_k P(X_2=i | X_1=k)P(X_1=k | X_0=j)$

Intuitively, this is very appealing. To get from j to i in two steps, we can take the way via any intermediate state k. To get the total probability, we simply sum up all these different probabilities! If you have ever seen path integrals in quantum mechanics, this idea will look familiar.

Again, we can write this in matrix notation. Each of the conditional probabilities on the right hand side of the expression above is a matrix element, and we find that

$P(X_2 = i | X_0 = j) = \sum_k K_{ki} K_{jk}$

so that the probability to get in two steps from j to i is simply given by the elements of the matrix K2. Similarly, the n-step transition probabilities are the entries of the matrix Kn.

Let us look at an example to understand what is going on here, which is known as a finite random walk on a circle. Our state space consists of N distinct points which we place arbitrarily on a circle. We then define a Markov chain as follows. We start at some arbitrary point x. In each step, we move along the circle to a neighbored point – one point to the left with probability 1/2 and one point to the right with probability 1/2. By the very definition, the transition probabilities do not change over time and depend only on the current state, so is a finite Markov chain. If we label the points on the circle by $1, 2, \dots, N$, then the transition matrix is given by a matrix of the form (for instance for N=4)

$K = \begin{pmatrix} 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \\ 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \end{pmatrix}$

More generally, we can also allow the process to stay where it is with probability 1 – p, where p is then the probability to move, which leads us to the matrix

$K = \begin{pmatrix} 1-p & \frac{p}{2} & 0 & \frac{p}{2} \\ \frac{p}{2} & 1-p & \frac{p}{2} & 0 \\ 0 & \frac{p}{2} & 1-p & \frac{p}{2} \\ \frac{p}{2} & 0 & \frac{p}{2} & 1 - p \end{pmatrix}$

Let us try to figure out whether the target distribution, given by the matrix Kn for large n, somehow converges.

To see this, we do two numerical experiments. First, we can easily simulate a random walk. Suppose that we have a function draw which accepts a distribution (given by a vector p whose elements add up to one) and draws a random value according to that distribution, i.e. it returns 1 with probability p0, 2 with probability p1 and so on. We can then simulate a random walk on the circle as follows.

def simulate_chain(N, p, steps=100, start=5):
chain = []
x = start % N
chain.append(x)
for i in range(steps):
x = (x  + draw([p/2.0, 1.0 - p, p / 2.0]) - 2) % N
chain.append(x)
return chain


Here N is the number of points on the circle, steps is the number of simulation steps that we run and start is the starting point. The function draw then returns 1 with probability p/2, 3 with probability p/2 and 2 with probability 1 – p. Thus we move with probability p/2 to the right, with probability p/2 to the left and stay were we are with probability 1 – p. If we set p = 0.8, this results in the following transition matrix.

$\begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can visualize the results if we execute a large number of runs and draw histograms of the resulting positions 100, 200, 300, … steps. The result will look roughly like this (for this image, I have used p = 0.8 and N=10):

We see that after a few hundred steps, this seems to converge – actually this starts to look very much like a uniform distribution. As we already know that the distribution after n steps is given by the matrix Kn, it makes sense to look at high powers of this matrix as well. To visualize this, I will use a method that I have seen in MacKays excellent book and that works as follows.

Visualizing one row of a matrix is not difficult. We can plot the entries in a two-dimensional diagram, where the x-axis corresponds to the column index and the y-axis corresponds to the value. A flat line is then a row in which all entries have the same value.

To display a full matrix, we use colors – we assign one color to each row index and plot the individual rows as just described. If we do this for the powers of the matrix K with the parameters p = 0.8 and N = 10 (and only chose some rows, for instance 1,4,7, to not run out of colors…) we obtain an image like the one below.

We can clearly see that the powers of the matrix K converge towards a matrix

$K^\infty = \lim_{n \rightarrow \infty} K^n$

where all entries in each row seem to have the same value. We have seen that the distribution after n steps assuming an initial distribution described by a row vector $\mu$ is $\mu K^n$. Therefore the limit distribution is $\mu K^\infty$. If we take $\mu$ to be a unit vector, we find that the rows of the matrix $K^\infty$ do actually represent the limit distribution of all chains that have started at a specific point i of the state space. Therefore the entries in the rows need to sum up to one, and thus, if they are all equal, need to be equal to 1 / N. If you calculate and print out high powers of the matrix K, you will in fact see that they approach the matrix where all entries are 0.1 (as we have chosen N = 10 in this example).

To completes our short introduction into Markov chains. We have seen that Markov chains model stochastic processes in discrete time in which the state at step t+1 depends only on the state at step t and the dependency is given by a function independent of the current time. In finite state spaces, Markov chains are described by a transition matrix K. The i-th row of the matrix Kn is the distribution of chains starting at point i after n steps. Consequently, converge properties of the Markov chains can be related to converge of high powers Kn and the apparatus of linear algebra can be applied.

In the next post, we will learn more about convergence – how it can be made precise, and how we can tell whether a given Markov chain converges. This will then allow us to construct Markov chains that converge towards a given target distribution and use them for sampling.