Its been a few days since I started to play with Paperspace, and I have come across a couple of interesting features that the platform has – enough for a second post on this topic.

First, GIT integration. Recall that the usual process is to zip the current working directory and submit the resulting file along with the job, the ZIP file is then unzipped in the container in which the job is running and the contents of the ZIP file constitute the working directory. However, if you want to run code that requires, for instance, custom libraries, it is much easier to instruct Paperspace to get the contents of the working directory from GitHub. You can do that by supplying a GIT URL using the --workspace switch. The example below, for instance, instructs Paperspace to pull my code for an RBM from GitHub and to run it as a job.

#!/bin/sh
#
# Run the RBM as a job on Paperspace. Assume that you have the paperspace NodeJS
# CLI and have done a paperspace login before to store your credentials
#
#
~/node_modules/.bin/paperspace jobs create  \
--workspace "git+https://github.com/christianb93/MachineLearning" \
--command "export MPLBACKEND=AGG ; python3 RBM.py \
--N=28 --data=MNIST \
--save=1 \
--tmpdir=/artifacts \
--hidden=128 \
--pattern=256 --batch_size=128 \
--epochs=40000 \
--run_samples=1 \
--sample_size=6,6 \
--beta=1.0 --sample=200000 \
--algorithm=PCDTF --precision=32" \
--machineType K80 \
--container "paperspace/tensorflow-python" \
--project "MachineLearning"


Be careful, the spelling of the URL must be exactly like this to be recognized as a GIT URL, i.e. “git+https” followed by the hostname without the “www”, if you use http instead of https or http://www.github.com instead of github.com, the job will fail (the documentation at this point could be better, and I have even had to look at the source code of the CLI to figure out the syntax). This is a nice feature, using that along with the job logs, I can easily reconstruct which version of the code has actually been executed, and it supports working in a team that is sharing GitHub repositories well.

Quite recently, Paperspace did apparently also add the option to use persistent storage in jobs to store data across job runs (see this announcement). Theoretically, the storage should be shared between notebooks and jobs in the same region, but as I have not yet found out how to start a notebook in a specific region, I could not try this out.

Another feature that I liked is that the container that you specify can actually be any container from the Docker Hub, for instance ubuntu. The only restriction is that Paperspace seems to overwrite the entrypoint in any case and will try to run bashinside the container to finally execute the command that you provide, so containers that do not have a bash in the standard execution path will not work. Still, you could use this to prepare your own containers, maybe with pre-installed data sets or libraries, and ask Paperspace to run them.

Finally, for those of us who are Python addicts, there is also a Python API for submitting and managing jobs in Paperspace. Actually, this API offers you two ways to run a Python script on Paperspace. First, you can import the paperspace package into your script and then, inside the script, do a paperspace.run(), as in the following example.

import paperspace
paperspace.run()
print('This will only be running on Paperspace')


What will happen behind the scenes is that the paperspace module takes your code, removes any occurrences of the paperspace package itself, puts the code into a temporary file and submits that as a job to Paperspace. You can then work with that job as with any other job, like monitoring it on the console or via the CLI.

That is nice and easy, but not everyone likes to hardcode the execution environment into the code. Fortunately, you can also simply import the paperspace package and use it to submit an arbitrary job, much like the NodeJs based CLI can do it. The code below demonstrates how to create a job using the Python API and download the output automatically (this script can also be found on GitHub).

import paperspace.jobs
import paperspace.config

import requests

#
# Define parameters
#
params = {}
#
# We want to use GIT, so we use the parameter workspaceFileName
#
params['workspaceFileName'] = "git+https://github.com/christianb93/MachineLearning"
params['machineType'] = "K80"
params['command'] = "export MPLBACKEND=AGG ; python3 RBM.py \
--N=28 --data=MNIST \
--save=1 \
--tmpdir=/artifacts \
--hidden=128 \
--pattern=256 --batch_size=128 \
--epochs=40000 \
--run_samples=1 \
--sample_size=6,6 \
--beta=1.0 --sample=200000 \
--algorithm=PCDTF --precision=32"
params['container'] = 'paperspace/tensorflow-python'
params['project'] = "MachineLearning"
params['dest'] = "/tmp"

#
# Get API key
#
apiKey = apikey()
print("Using API key ", apiKey)
#
# Create the job. We do NOT use the create method as it cannot
# handle the GIT feature, but assemble the request ourselves
#
http_method = 'POST'
path = '/' + 'jobs' + '/' + 'createJob'

r = requests.request(http_method, paperspace.config.CONFIG_HOST + path,
params=params, files={})
job = r.json()

if 'id' not in job:
print("Error, could not get jobId")
print(job)
exit(1)

jobId = job['id']
print("Started job with jobId ", jobId)
params['jobId']  = jobId

#
# Now poll until the job is complete

if job['state'] == 'Pending':
print('Waiting for job to run...')
job = paperspace.jobs.waitfor({'jobId': jobId, 'state': 'Running'})

print("Job is now running")
print("Use the following command to observe its logs: ~/node_modules/.bin/paperspace jobs logs --jobId ", jobId, "--tail")

job = paperspace.jobs.waitfor({'jobId': jobId, 'state': 'Stopped'})
print("Job is complete: ", job)

#
# Finally get artifacts
#
paperspace.jobs.artifactsGet(params)


There are some additional features that the Python API seems to have that I have not yet tried out. First, you can apparently specify an init script that will be run before the command that you provide (though the use of that is limited, as you could put this into your command as well). Second, and more important, you can provide a requirements file according to the pip standard to ask Paperspace to install any libraries that are not available in the container before running your command.

Overall, my impression is that these APIs make it comparatively easy to work with jobs on Paperspace. You can submit jobs, monitor them and get their outputs, and you enjoy the benefit that you are only billed for the actual duration of the job. So if you are interested in a job based execution environment for your Machine Learning models, it is definitely worth a try, even though it takes some time to get familiar with the environment.

## First steps with Paperspace Gradient

So far, I have exclusively been using AWS EC2 when I needed access to a GPU – not because I have carefully compared the available offerings and taken a deliberate decision, but simply because I already had an EC2 account and know the platform.

However, I though it would be interesting to try out other platforms as well. In this post, I will talk a bit about my experience with Paperspace. This provider has several offerings – Core, which is basically an IaaS service, and Gradient, which allows you to access Jupyter notebooks and run jobs in ready-made environments optimized for Machine Learning – and of course I wanted to try this.

It should be noted that some time has passed between trying this out for the first time (roughly in May) and publication of this post in July, so bear with me when some of the details have changed in the meantime – Paperspace is still under development.

## First steps

After signing up, you are routed to a page where you can choose between two products – Paperspace Core and Paperspace Gradient. I did choose Gradient (after providing the requested credit card information). The first thing I did try was to bring up a Jupyter notebook.

When you select that option, you have to make two choices. First, Jupyter notebooks are started in Docker containers, and you have to pick one of the available containers. Second, and more important, you have to select a machine – you have a choice between several CPU based and several GPU based models with different fees associated with them.

After a few seconds, your notebook is up and running (with the base account, you can only have one notebook server at any point in time). If you hit “Open”, a new tab will open and you will see the usual Jupyter home screen.

Your notebook folder will be prepopulated with some tutorials. The one I tried first is one of the classical MNIST / CNN tutorials. Unfortunately, when I tried to run it, the kernel died several times in a row – not very encouraging (it worked two days later, and overall there seem to be a few sporadic errors that come and go over time..).

Next, I could not resist the temptation to open a terminal. The Docker image seems to be based on a very basic Ubuntu distribution. I could successfully do an apt-get update && apt-get install git. So you could probably start to download things and work directly from the console – but of course this is not really the idea.

After playing for some time with the notebook, you can – again on the Paperspace notebook console stop your notebook (make sure to do this, you will be charged while the notebook is running). Once the notebook has stopped, you can click on the little arrow to the right of the notebook name, which will give you the option to download any files in the notebook directory that you have created in your session.

Once stopped, there is no way to restart a notebook, but you can clone a notebook which will create a copy of the previous notebook and start that copy, so you can continue to work where you left off. This works, but is a bit tiresome as you have to delete the obsolete copy manually.

## Jobs

The next thing I tried is to create a job. For that purpose, you will first have to install the Paperspace CLI which in turn requires node.js and npm. So here is what you need to do on Ubuntu:

$cd ~$ apt install nodejs
$apt install npm$ npm install paperspace-node
$sudo ln -s /usr/bin/nodejs /usr/bin/node  This will create a directory node_modules in your home directory and within that directory, a directory .bin. To test the paperspace CLI, you can run a command like $ ./node_modules/.bin/paperspace --version


Next, switch to an empty directory and in that directory, run

$~/node_modules/.bin/paperspace project init  This will initiate a new paperspace project, i.e. it will create a subdirectory .ps_project containing a JSON configuration file. Next, you need an API key that you can get on your Paperspace home page. The API key is an authentication token that is used by the API – store that number in a safe place. Once we have that token, it is time to start our first job.  ~/node_modules/.bin/paperspace jobs create --container Test-Container --command "nvidia-smi" --apiKey "xxxxx" --machineType K80  where xxxxx needs to be replaced by your API key. Instead of providing your API key with every command, you can also run $ ~/node_modules/.bin/paperspace login


which will add your credentials to a file in the .paperspace directory in your home directory.

Essentially, what happens when you run a job is that the local directory and all its subdirectories will be zipped into a file, a container will be set up on a Paperspace server, the content of the ZIP file will be extracted into this container and the command that you have specified will execute.

You can now get a list of your processes and their status either on the Paperspace console, where you also have immediate access to the log output, or from the command line using

$~/node_modules/.bin/paperspace jobs list  At this point, I was again a bit disappointed – the job appears to be running and is even displayed in the web console, but when it completes, I get an error “503 – Service unavailable” and no log output is provided. I raised a request with the support, and roughly 2 hours later the submission suddenly worked – I have not yet found out whether the support has really done anything or whether a part of the infrastructure was really down at this point in time. As a temporary workaround, I managed to run a job by redirecting error output and standard output to a file. For instance, to run the script KMeans.py, I did use $ ~/node_modules/.bin/paperspace jobs create  --command "export MPLBACKEND=AGG ; python KMeans.py > /artifacts/log 2>&1" --machineType C2


Once the job is complete, you can download whatever it has added to the directory /artifacts using

$~/node_modules/.bin/paperspace jobs artifactsGet --jobId "js26y3pi6jk056"  where the job ID is the ID of the job and will be displayed by the create command. Finally, the command jobs destroy --jobId=... can be used to delete a job after execution. So far, I have to admit that I am not so happy with what I have seen. I hit upon several issues in the standard setup, and when playing around, I found that it can take a long time for a job to be scheduled on a GPU (depending very much on the machine type – my impression was that for machine types for which Paperspace uses GCP, like K80 or P100, your job will run quickly, but for other types like GPU+ it can take a long time). In addition, as everything is running in a container, the initial steps in job can be time consuming. TensorFlow, for instance, is known to take longer when it is started the first time, and in a fresh container, every time is a first time, so you will see a significant startup time. This gets worse if you need to download data sets, as this will have to be repeated with every new run. It is apparently not yet possible to mount a permanent volume into your container to avoid this or to reuse a stopped container (update: as of July, Paperspace has announced that the /storage directory is a persistent storage available across notebooks and jobs, but I have not yet tried this). But maybe this is a premature judgement, and I decided that I will still continue to try it out. In one of the next posts, I will present some more advanced features and in particular the Python API that Paperspace offers. ## The EM algorithm and Gaussian mixture models – part II In this post, I will discuss the general form of the EM algorithm to obtain a maximum likelihood estimator for a model with latent variables. First, let us describe our model. We suppose that we are given some joint distribution of a random variable X (the observed variables) and and random variable Z (the latent variables) and are interested in maximizing the likelihood of an observed sample x of the visible variable X. We also assume that the joint distribution depends on a parameter $\Theta$, in practice this could be weights, bias terms or any other parameters. To simplify things a bit, we will also assume that the latent variable is finite. Our aim is to maximize the log likelihood, which we can – under these assumptions – express as follows. $\ln P(x |\Theta) = \ln \sum_z P(x,z | \Theta)$ Even if the joint distribution belongs to some exponential family, the fact that we need to consider the logarithm of the sum and not the sum of the logarithms makes this expression and its gradient difficult to calculate. So let us look for a different approach. To do this, let us assume that we are given a value $\Theta$ of the parameter and let us try to understand how the likelihood changes if we pass from $\Theta$ to some other value $\Theta'$. For that purpose, we introduce a term that is traditionally called Q and defined as follows (all this is a bit abstract, but will become clearer later when we do an example): $Q(\Theta'; \Theta) = E \left[ \ln P(x,z | \Theta') | x, \Theta \right]$ That looks a bit complicated, so let me explain the notation a bit. We want to define a function Q that will be a function of the new parameter value $\Theta'$. This function will, in addition, depend on the current value $\Theta$ which we consider as a parameter. The right hand side is an expectation value. In fact, for each value of the visible variable x and the parameter $\Theta$, we have a probability distribution on the space in which Z lives, given by the conditional probability of Z given x and $\Theta$. Whenever we have a function depending on z, we can therefore form the expectation value as usual, i.e. as the weighted sum over all function values, weighted by the probability of z. In particular, we can do this for the function $\ln P(x,z | \Theta')$ of z. Thus the right hand side is, spelled out $E \left[ \ln P(x,z | \Theta') | x, \Theta \right] = \sum_z \ln P(x,z | \Theta') P(z | x, \Theta)$ That is now again a sum of logarithms, not a logarithm of a sum, and we can hope to be able to deal with this much better. This is nice, but so far we have only introduced a rather complicated additional object – what do we gain? It turns out that essentially, maxizing Q will effectively maximize the likelihood. Let us make this bold statement a bit more precise. Suppose we are able to iteratively maximize Q. Expressed formally, this would mean that we are able to find a sequence $\Theta^0, \Theta^1, \dots$ of parameters such that when passing from $\Theta^t$ to $\Theta^{t+1}$, the value of Q does not decrease, i.e. $Q(\Theta^{(t+1)};\Theta^{(t)}) \geq Q(\Theta^{(t)} ; \Theta^{(t)})$ Then this very same sequence will be a sequence of parameters for which the log-likelihood is non-decreasing as well, i.e. $\ln P(x | \Theta^{(t+1)}) \geq \ln P(x | \Theta^{(t)})$ I will not include a proof for this in this post (the proof identifies the difference between any two subsequent steps as a Kullback-Leibler divergence and makes use of the fact that a Kullback-Leibler divergence is never negative, you can find a a proof in the references, in particular in [2], or in my more detailed notes on the EM algorithm and Gaussian mixture models, where I also briefly touch on convergence). Instead, let us try to understand how this can be used in practice. Suppose that we have already constructed some parameter $\Theta^t$. The algorithm then proceeds in two steps. First, we calculate Q as a function of the new parameter $\Theta^{t+1}$. As Q is defined as an expectation value, this step is called the expectation step. Once we have that, we try to maximize Q, i.e. we try to find a value for the parameter $\Theta^{t+1}$ such that $Q(\Theta^{t+1}; \Theta^t)$ is maximized. This part of the algorithm is therefore called the maximization step. Then we start over, using $\Theta^{t+1}$ as new starting point. Thus the algorithm alternates between an expectation and a maximization step, leading to the name EM algorithm. To see how this works in practice, let us now return to our original example – Gaussian mixtures. Here the parameter $\Theta$ is given by $\Theta = (\mu, \pi, \Sigma)$ We consider a random variable X which has N components, each of which being a vector Xn in a d-dimensional space and corresponding to one sample vector (so in the language of machine learning, N will be our batch size). Similarly, Z consists of N components zn which again are subject to the restriction that only one of the znk be different from zero. We assume that the joint representation of our model is given by $P(x , z) = \prod_n \prod_k \pi_k^{z_{nk}}{\mathcal N} (x_n, \mu_k , \Sigma_k)^{z_{nk}}$ Now we need to compute Q. The calculation is a bit lengthy, and I will skip most of it here (you can find the details in the references or my notes for this post). To state the result, we have to introduce a quantity called responsibility which is defined as follows. $r_{nk} = \frac{\pi_k {\mathcal N}(x_n ; \mu_k, \Sigma_k)} {\sum_{j} \pi_j {\mathcal N}(x_n ; \mu_j, \Sigma_j) }$ From this definition, is it clear that the responsibility is always between zero and one, and in fact is has an interpretation of a (conditional) probability that sample n belongs to cluster k. Note that the responsibility is a function of the model parameters $\mu, \pi$ and $\Sigma$. Using this notation, we can now write down the result of calculating the Q function: $Q(\Theta'; \Theta) = \sum_n \sum_k r_{nk} \ln \pi'_k + \sum_n \sum_k r_{nk} \ln {\mathcal N}(x ; \mu'_k, \Sigma'_k)$ where the responsibilities are calculated using the old parameter set $\Theta = (\mu, \pi, \Sigma)$. For a fixed $\Theta$, this is a function of the new parameters $\mu', \pi'$ and $\Sigma'$, and we can now try to maximize this function with respect to these parameters. This calculation is not difficult, but again a bit tiresome (and requires the use of Lagrangian multipliers as there is a constraint on the $\pi_k$), and I again refer to my notes for the details. When the dust settles, we obtain three simple expressions. First, the new values for the cluster means are given by $\mu'_k = \frac{\sum_n r_{nk} x_n}{\sum_n r_{nk}}$ This starts to look familiar – this is the same expression that we did obtain for the cluster centers for the k-means algorithm! In fact, if we arrange the rnk as a matrix, the denominator is the sum across column k and the numerators is the weighted sum over all data points. However, there is one important difference – it is no longer true that given n, only one of the rnk will be different from one. Instead, the rnk are soft assignments that model the probability that the data point xn belongs to cluster k. To write down the expression for the new value of the covariance matrix, we again need a notation first. Set $N_k = \sum_n r_{nk}$ which can be interpreted as the number of soft assignments of points to cluster k. With that notation, the new value for the covariance matrix is $\Sigma'_k = \frac{1}{N_k} \sum_n r_{nk} (x_n - {\mu'}_k)(x_n - {\mu'}_k)^T$ Finally, we can use the method of Lagrange multipliers to maximize with respect to the weights $\pi_k$ (which always need to sum up to one), and again obtain a rather simple expression for the new value. $\pi'_k = \frac{N_k}{N}$ Again, this is intuitively very appealing – the probability to be in cluster k is updated to be the number of points with a soft assignment to k divided by the total number of assignments We now have all the ingredients in place to apply the algorithm in practice. Let us summarize how this will work. First, we start with some initial value for the parameters – the weights $\pi$, the covariance matrices $\Sigma_k$ and the means $\mu_k$ – which could, for instance, be chosen randomly. Then, we calculate the responsibilities as above – essentially, this is the expectation step, as it amounts to finding Q. In the M-step, we then use these responsibilities and the formulas above to calculate the new values of the weights, the means and the covariance matrix. This involves a few matrix operations, which can be nicely expressed by the operations provide by the numpy library. In the example above, we have created two sets of 500 sample points from different Gaussian mixture distributions and then applied the k-means algorithm and the EM algorithm. In the top row, we see the results of running the k-means algorithm. The color indicates the result of the algorithm, the shape of the marker indicates the original cluster to which the point belongs. The bottom row displays the results of the EM algorithm, using the same pattern. We see that while the first sample (diagrams on the left) can be clustered equally well by both algorithms, the k-means algorithm is not able to properly cluster the second sample (diagrams on the right), while the EM algorithm is still able to assign most of the points to the correct cluster. If you want to run this yourself, you can – as always – find the source code on GitHub. When playing with the code and the parameters, you will notice that the results can differ substantially between two consecutive runs, this is due to the random choice of the initial parameters that have a huge impact on convergence (and sometimes the code will even fail because the covariance matrix can become singular, I have not yet fixed this). Using the switch --data=Iris, you can also apply the EM algorithm to the Iris data set (you need to have a copy of the Iris data file in the current working directory) and find again that the results vary significantly with different starting points. The EM algorithm has the advantage of being very general, and can therefore be applied to a wide range of problems and not just Gaussian mixture models. A nice example is the Baum Welch algorithm for training Hidden Markov models which is actually an instance of the EM algorithm. The EM algorithm can even be applied to classical multi-layer feed forward networks, treating the hidden units as latent variables, see [3] for an overview and some results. So clearly this algorithm should be part of every Data Scientists toolbox. ### References 1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006 2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38 3. Shu-Kay Ng, G.J. McLachlan, Using the EM Algorithm to Train Neural Networks: Misconceptions and a New Algorithm for Multiclass Classification, IEEE Transaction on Neural Networks, Vol. 15, No. 3, May 2004 ## The EM algorithm and Gaussian mixture models – part I In the last few posts on machine learning, we have looked in detail at restricted Boltzmann machines. RBMs are a prime example for unsupervised learning – they learn a given distribution and are able to extract features from a data set, without the need to label the data upfront. However, there are of course many other, by now almost classical, algorithms for unsupervised learning. In this and the following posts, I will explain one of the most commonly applied algorithms in this field, namely the EM algorithm. As a preparation, let us start with a very fundamental exercise – clustering. So suppose that we are given a data set which we suspect to consist of K clusters. We want to identify these clusters, i.e. to each point in the dataset, we want to assign the number of the cluster to which the point belongs. More precisely, let us suppose that we are given a set ${\mathcal D} = \{ x_i, \dots, x_N\}$ of points in some euclidian space and a number K of clusters. We want to identify the centre $\mu_i$ of each cluster and then assign the data points xi to some of the $\mu_j$, namely to the $\mu_j$ which is closest to xi. We can encode the assignment of data points to clusters in a matrix Rij where Rij = 1 if data point xi belongs to cluster j with centre $\mu_j$. Thus each row of R corresponds to a data point and is non-zero in exactly one column, where it is one. This is known as 1-of-K encoding. For each value of i, the squared distance of xi to the centre of the cluster to which it is assigned is then given by $\sum_j R_{ij} \| x_i - \mu_j |^2$ where we note that only one of the summands is different from zero. Assigning the data points to the clusters and positioning the centre points of the clusters then amounts to minimizing the loss function $l(\mu, R) = \sum_i \sum_j R_{ij} \langle x_i - \mu_j , x_i - \mu_j \rangle$ where the brackets denote the usual euclidean scalar product. Now let us see how we can optimize this function. For a fixed vector $\mu$ of cluster centers, we can easily minimize Rij by assigning each data point to the cluster whose center is closest to xi. Thus we set $R_{ij} = \begin{cases} 1 & \text{if} \, j = \arg \min \| x_i - \mu_j \|^2 \\ 0 & \text{otherwise} \end{cases}$ Conversely, given a matrix R which we hold fixed, it is likewise easy to minimize $\mu_j$. As there are no further constraints on $\mu_j$, we can find the minimum by differentiating the loss function. We find that the derivative is zero if and only if $0 = \sum_i R_{ij}(x_i - \mu_j)$ holds for all j. Assuming for a moment that each cluster has at least one data point assigned to it, i.e. that none of the columns of R contains zeroes only, we can solve this by $\mu_j = \frac{\sum_i x_i R_{ij}}{\sum_i R_{ij}}$ which is also easily seen to be a local minimum by taking the second derivative. Note that this term has an obvious geometric interpretation. The denominator in this expression is the number of data points that are currently assigned to cluster j. The numerator is the sum of all data points assigned to this cluster. Thus the minimum is the mean value of the positions of the data points that are currently assigned to the cluster (a physicist would call this the center of gravity of the cluster). This is the reason why this method is called the k-means algorithm. If no data points are assigned to the cluster, the loss function does not depend on $\mu_j$ and we can choose any value that we want. The algorithm now works as follows. First, we choose centers $\mu_j$ at will – it is common to use some of the data points for this purpose. Then we assign each data point to a center using the formula for Rij derived above. We then adjust the center points $\mu_j$ and reallocate the points to the new centers and so forth. From our discussion above, it is clear that each full iteration of this procedure will reduce the loss function. This does of course not imply that the algorithm converges to a global minimum, as it might get stuck in local minima or saddle points. In practice, the algorithm is executed until the cluster assignments and centers do not change any more substantially or for a given number of steps. The diagram above shows the result of applying this algorithm to a set of points that is organized in two clusters. To generate the data, 100 samples were drawn from 2-dimensional Gaussian distributions. On the left hand side, half of the the samples were centered at the point (5,1), the other samples at (1,4), and both had standard deviation 0.6. On the right hand side, the same centers were used, but only a small number of samples were drawn from the second distribution which had standard deviation 0.5, whereas most samples came from the first distribution with standard deviation 0.8. Then 10 iterations of the k-means algorithm were applied to the resulting sample set. The points in the sample were then plotted with a color indicating the assignment to clusters resulting from the matrix R. The actual cluster from which the sample was drawn is indicated by the shape – a point is cluster one, a diamond is cluster two. We see that in the example on the left hand side, the algorithm has correctly assigned all points to their original cluster. For the example on the right hand side, the situation is different – the algorithm did actually assign significantly more points to the blue cluster, i.e. there are many wrong assignments (blue points). This does not change substantially if we increase the number of iterations, even with 100 iterations, we still see many wrong assigments for this example. If you want to reproduce the example and play a bit with the parameters, you can get the sourcecode for a straightforward implementation in Python from my GitHub repository. The K-means algorithm is very simple and straightforward, but seems to have limitations because it cannot determine the shape of a distribution, only its center. It turns out that K-means is a special case of a broader class of algorithms that we now study, hoping to find more robust algorithms. In our example, we have generated sample data as a combination of two Gaussian distributions. What if we just change the game and simply assume that our data is of this type? In other words, we assume an underlying probabilistic model for our sample data. Once we have that, we can pull all the tricks that statistical inference can offer – we can calculate maximum likelihood and maximum posterior probability, we can try to determine the posterior or even sample from the model. Thus let us try to fit our data to a model of the form $P(x) = \sum_k \pi_k {\mathcal N}(x ; \mu_k, \Sigma_k)$ where ${\mathcal N}(x ; \mu_k, \Sigma_k)$ is a multivariate Gaussian distribution with mean $\mu_k$ and covariance matrix $\Sigma_k$, i.e. ${\mathcal N}(x ; \mu_k, \Sigma_k) = \frac{1}{\sqrt{(2\pi)^d \det(\Sigma)}} \exp (-\frac{1}{2} \langle x - \mu_k, \Sigma_k^{-1}(x - \mu_k)\rangle)$ and the $\pi_k$ are non-negative real numbers with $\sum_k \pi_k = 1$. Let us now see how this equation looks like if we use a 1-of-K encoding. We introduce a random variable Z that takes values in $\{ 0, 1\}^K$ with the additional constraint that only one of the Zk is allowed to be different from zero. We interpret $\pi_k$ as the probability $\pi_k = P(Z_k = 1)$ Then $P(Z = z) = \prod_k \pi_k ^{z_k}$ and we can write $P(X=x) = \sum_z P(Z=z) P(X=x | Z=z) = \sum_z P(x,z)$ where $P(z)$ is as above and $P(X = x | Z = z) = \prod_k {\mathcal N}(x ; \mu_k, \Sigma_k)^{z_k}$ This is a very general type of distribution which reflects a common pattern in machine learning, namely the introduction of so called latent or hidden variables. In general, latent or hidden variables are random variables that are a part of the model which cannot be observed, i.e. are not part of the input or the output of the model. We have seen latent variables in action several times – adding hidden units to a neural network introduces latent variables and makes the model much more powerful, the hidden layer of a restricted Boltzmann machine serves as memory to learn features, and latent variables that are used to construct a mixture of Gaussians as above allow us to model a much broader class of distributions than a model with just one Gaussian. Intuitively, it is also clear how to sample from such a model. In a first step, we sample from Z, in other words we determine the index k randomly according to the distribution given by the $\pi_k$. Once we have k, we then sample from the conditional distribution $P(X = x | Z = z)$. As we already have k, this amounts to sampling from the Gaussian distribution ${\mathcal N}(x ; \mu_k, \Sigma_k)$ with mean $\mu_k$ and covariance matrix $\Sigma_k$. In the example above, we have first applied this procedure to a one-dimensional Gaussian mixture with K=2. The histogram shows the result of the sampling procedure, the solid lines are the probability density functions of the two individual Gaussian distributions, multiplied by the respective weight. On the right hand side, the result of sampling a Gaussian mixture in two dimensions is displayed (you can download the notebook used to produce this example here). When we now try to fit this model to the sample data, we again have to calculate the likelihood function and try to maximize it. However, it turns out that the gradient of the likelihood is not so easy to calculate, and it might very well be that there is no closed form solution or that we obtain rather complicated expressions for the gradient. Fortunately, there is an alternative that works for very general models and does not require knowledge of the gradient – the EM algorithm. In the next post, I will present the algorithm in this general setup, before we apply it to our original problem and compare the results with the k-means algorithm. ### References 1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006 2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38 ## Why you need statistics to understand neuronal networks When I tried to learn about neuronal networks first, I did what probably most of us would do – I started to look for tutorials, blogs etc. on the web and was surprised by the vast amount of resources that I found. Almost every blog or webpage about neuronal networks has a section on training a simple neuronal network, maybe on the MNIST data set, using a framework like TensorFlow, Theano or MXNET. When you follow such a tutorial, a network is presented as a collection of units and weights. You see how the output of the network is calculated and then an error function – sometimes least squares, sometimes something else – is presented. Often, a regulation term is applied, and then you are being told that the automatic gradient calculation features of the framework will do the gradient descent algorithm for you and you just have to decide on an optimizer and run the network and enjoy the results. Sooner or later, however, you will maybe start to ask a few questions. Why that particular choice of the loss function? Where does the regulator come from? What is a good initial value for the weights and why? Where does the sigmoid function come from? And many, many other questions…. If you then decide to dig deeper, using one of the many excellent textbooks or even try to read some of the original research papers (and some are actually quite readable), you will very soon be confronted with terms like entropy, maximum likelihood, posterior distribution, Gaussian mixtures and so on, and you will realize that the mathematics of neuronal networks has a strong overlap with mathematical statistics. But why? Why is that a good language to discuss neuronal networks, and why should you take the time to refresh your statistics knowledge if you really want to understand neuronal networks? In this post, I will try to argue that statistical inference comes up very naturally when you try to study neuronal networks. Many neuronal networks are designed to excel at classification tasks. As an example, suppose you wanted to design and train a neuronal network that, given data about an animal, classifies the animal as either a bird or some other animal (called a “non-bird” for convenience). So our starting point is a set modelling all possible objects that could be presented to the network. How exactly we model this set is not so important, more important is that in general, the network will not have access to all the data about the animal, but only to certain attributes of elements in the set called features. So there could be a feature which we call X1 and which is defined as $X_1 = \text{the animal can fly}$ taking values in $\{0,1\}$. Another data point the network could get is $X_2 = \text{length of animal in cm}$ taking values in the reals and so forth. More generally, we assume that on the set of all possible objects, we have certain functions Xi taking values in, say, the real numbers. Based on these numbers, the network will then try to take a decision whether a given animal is a bird or not. Thus we do not have to deal directly with our space of objects, but use the functions Xi as primary objects. If the network had a chance to look at every possible animal, this would be easy, even though it would cost a lot memory – it could simply remember all possible combinations of features and for each feature, store the correct answer. In reality however, this does not work. Instead, we have access to a small subset of the data – a sample – for which we can evaluate the Xi. Based on this subset, we then have to derive a model which gives the right answer in as many cases as possible. Thus we try to make a statement about the full space of things that could be presented to our network for classification based on a small sample. This is where probabilities come into play very naturally. We need to assume that our sample has been chosen randomly, but still we need to make assertions about the full set. This is exactly what inferential statistics is doing. The fact that our sample is chosen randomly turns our Xi into random variables. Similarly, the variable $Y = \text{is a bird}$ taking values in $\{0,1\}$ is a random variable, and we try to gain information on the distribution of Y across the full population based on its values on a given set of labelled samples, i.e. a set of samples where Y is known. Thus Y would represent the labels or targets in the language of neuronal networks. Applying the methods of statistical inference to this situation would typically start by choosing a statistical model and than using estimators or hypothesis testing to make deductions. Apart from the fact that we have to derive information on the full population based on a sample, there is another reason why probabilities appear naturally in the theory of machine learning. In many cases, the available input – being a reduction of the full set of data – is not sufficient to classify the sample with full certainty. To see this, let us go back to our examples. How would you derive the property “bird” from the given data “can fly” and “length”? Not all animals than can fly are birds – and not all birds can fly. So we have to try to distinguish for instance a butterfly from a hummingbird based on the length. The smallest hummingbird – a bee hummingbird – is about 5 cm in length. The largest known butterfly – the Queen’s Alexandra birdwing – can be as long as 8 cm (both informations taken from Wikipedia). Thus our data is not sufficient to clearly distinguish butterflies and birds in all cases! However, very small birds and very large butterflies have one thing in common – they are rare. So chances are that a given animal that can fly and is larger than 5 cm is actually a bird (yes, I know, there are bats….). In other words, if again Y denotes the variable which is 1 on birds and 0 on all other animals, we can in general not hope that Y is a function of the Xi, but we can hope that given some values of the Xi, the probability $P(Y=1)$ to be a bird is a function of the Xi. In other words, using the language of conditional probabilities, $P(Y=1 | X = x) = f(x)$ with some unknown function f. In a Bayesian interpretation of probability, the certainty with which can say “this animal is a bird” is a function of the values xi of the observable variables Xi. With these considerations, we now arrive at the following mathematical model for what a classification algorithm is about. We are given a probability space $(P, \Omega)$ with a vector valued random variable X. The attributes of a sample are described by the feature vector X in some subset of m-dimensional euclidian space, where m is the number of different features. In our example, m=2, as we try to classify animals based on two properties. The result of the classification is described by a random variable Y taking – for the simple case of a binary classification problem – values in $\{0,1\}$. We then assume that $P(Y =1 | X=x) = f(x;w_0)$ where $f(\cdot;w)$ is a function parametrized by some parameter w that we call the weights of the model. The actual value w0 of w is unknown. Based on a sample for X and Y, we then try to fit the model, i.e. we try to find a value for w such that $f(\cdot, w)$ models the actual conditional distribution of Y as good as possible. Once the fitting phase is completed, we can then use the model to derive predictions about objects which are not in our initial sample set. This model sounds a bit abstract, but many feed forward neuronal networks can be described with this or similar models. And we can now apply the full machinery of mathematical statistics – we can calculate cross entropies and maximum likelihood, we can analyse converge and variance, we can apply the framework of Bayesian statistics and Monte Carlo methods. This is the reason why statistics is so essential when it comes to describing and analyzing neuronal networks. So on the next rainy Sunday afternoon, you might want to grab a steaming hot cup of coffee, head towards your arm chair and spent some time with one of the many good exposures on this topic, like chapter IV in MacKays book on Machine Learning, or Bishops “Pattern recognition and machine learning” or chapter 3 of the deep learning book by Goodfellow, Bengio and Courville. ## The Metropolis-Hastings algorithm In this post, we will investigate the Metropolis-Hastings algorithm, which is still one of the most popular algorithms in the field of Markov chain Monte Carlo methods, even though its first appearence (see [1]) happened in 1953, more than 60 years in the past. It does for instance appear on the CiSe top ten list of the most important algorithms of the 20th century (I got this and the link from this post on WordPress). Before we get into the algorithm, let us once more state the problem that the algorithm is trying to solve. Suppose you are given a probability distribution $\pi$ on some state space X (most often this will be a real euclidian space on which you can do floating point arithmetic). You might want to imagine the state space as describing possible states of a physical system, like spin configurations in a ferromagnetic medium similar to what we looked at in my post on the Ising model. The distribution $\pi$ then describes the probability for the system to be in a specific state. You then have some quantity, given as a function f on the state space. Theoretically, this is a quantity that you can calculate for each individual state. In most applications, however, you will never be able to observe an individual state. Instead, you will observe an average, weighted by the probability of occurence. In other words, you observe the expectation value $\langle f \rangle = \int_X f d\pi$ of the quantity f. Thus to make a prediction that can be verified or falsified by an observation, you will have to calculate integrals of this type. Now, in practice, this can be very hard. One issue is that in order to naively calculate the integral, you would have to transverse the entire state space, which is not feasible for most realistic problems as this tends to be a very high dimensional space. Closely related to this is a second problem. Remember, for instance, that a typical distribution like the Boltzmann distribution is given by $\pi(x) = \frac{1}{Z} e^{-\beta E(x)}$ The term in the numerator is comparatively easy to calculate. However, the term in the denominator is the partition function, and is itself an integral over the state space! This makes even the calculation of $\pi(x)$ for a single point in the state space intractable. But there is hope – even though calculating the values of $\pi$ for one point might be impossible, in a distribution like this, calculating ratios of probabilities is easy, as the partition function cancels out and we are left with the exponential of an energy difference! The Metropolis-Hasting algorithm leverages this and also solves our state space problem by using a Markov chain to approximate the integral. So the idea is to build a Markov chain Xt that converges and has $\pi$ as an invariant distribution, so that we can approximate the integral by $\langle f \rangle = \int_X f d\pi \approx \frac{1}{N} \sum_{t=1}^N f(X_t)$ for large values of N. But how do we construct a Markov chain that converges to a given distribution? The Metropolis Hastings approach to solve this works as follows. The first thing that we do is to choose a proposal density q on our state space X, i.e. a measurable function $q \colon X \times X \rightarrow [0,\infty)$ such that for each x, $\int q(x,y) dy = 1$. Then q defines a Markov chain, where the probability to transition into a measurable set A being at a point x is given by the integral $Q(x,A) = \int_{\mathcal X} q(x,y) dy$ Of course this is not yet the Markov chain that we want – it has nothing to do with $\pi$, so there is no reason it should converge to $\pi$. To fix this, we now adjust the kernel to incorporate the behaviour of $\pi$. For that purpose, define $\alpha(x,y) = \begin{cases} \min \{ 1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \} & \text{if } \pi(x) q(x,y) > 0 \\ 1 & \text{if } \pi(x) q(x,y) = 0 \end{cases}$ This number is called the acceptance probability, and, as promised, it only contains ratios of probabilities, so that factors like the partition function cancel and do not have to be computed. The Metropolis Hastings algorithm now proceeds as follows. We start with some arbitrary point x0. When the chain has arrived at xn, we first draw a candidate y for the next location from the proposal distribution $q(x_n, \cdot)$. We now calculate $\alpha$ according to the formula above. We then accept the proposal with probability $\alpha$, i.e. we draw a random sample U from a uniform distribution and accept if $U \leq \alpha$. If the proposal is accepted, we set xn+1 = y, otherwise we set xn+1 = xn, i.e. we stay where we are. Clearly, the xn are samples from a Markov chain, as the position at step xn only depends on the position at step xn-1. But is still appears to be a bit mysterious why this should work. To shed light on this, let us consider a case where the expressions above simplify a bit. So let us assume that the proposal density q is symmetric, i.e. that $q(x,y) = q(y, x)$ This is the original Metropolis algorithm as proposed in [1]. If we also assume that $\pi$ and q are nowhere zero, the acceptance probability simplifies to $\alpha(x,y) = \min \{ 1, \frac{\pi(y)}{\pi(x)} \}$ Thus we accept the proposal if $\pi(y) \geq \pi(x)$ with probability one. This is very similar to a random search for a global maximum – we start at some point x, choose a candidate for a point with higher value of $\pi$ at random and proceed to this point. The major difference is that we also accept candidates with $\pi(y) < \pi(x)$ with a non-zero probability. This allows the algorithm to escape a local maximum much better. Intuitively, the algorithm will still try to spend more time in regions with large values of $\pi$, as we would expect from an attempt to sample from the distribution $\pi$. The image above illustrates this procedure. The red graph displays the distribution $\pi$. If our algorithm is currently at step xn, the purpose is to move “up-hill”, i.e. to the left in our example. If we draw a point like y from q which goes already in the right directory, we will always accept this proposal and move to y. If, however, we draw a point like y’, at which $\pi$ is smaller, we would accept this point with a non-zero probability. Thus if we have reached a local maximum like the one on the right hand side of the diagram, there is still a chance that we can escape from there and move towards the real maximum to the left. In this form, the algorithm is extremely easy to implement. All we need is a function propose that creates the next proposal, and a function p that calculates the value of the probability density $\pi$ at some point. Then an implementation in Python is as follows. import numpy as np chain = [] X = 0 chain.append(X) for n in range(args.steps): Y = propose(X) U = np.random.uniform() alpha = p(Y) / p(X) if (U <= alpha): X = Y chain.append(X)  In the diagram below, this algorithm has been applied to a Cauchy distribution with mode zero and scale one, using a normal distribution with mean x and standard deviation 0.5 as a proposal for the next location. The chain was calculated for 500.000 steps. The diagram in the upper part shows the values of the chain during the simulation. Then the first 100.000 steps were discarded and considered as "burn-in" time for the chain to stabilize. Out of the remaining 400.000 sample points, points where chosen with a distance of 500 time steps to obtain a sample which is approximately independent and identically distributed. This is called subsampling and typically not necessary for Monte Carlo integration (see [2], chapter 1 for a short discussion of the need of subsampling), but is done here for the sake of illustration. The resulting subsample is plotted as a histogramm in the lower left corner of the diagram. The yellow line is the actual probability density. We see that after a few thousand steps, the chain converges, but continues to have spikes. However, the sampled distribution is very close to the sample generated by the Python standard method (which is to take the quotient of two independent samples from a standard normal distribution). In the diagram at the bottom, I have displayed how the integral of two functions ($\sin(x)$ and $\cos(x)$) approximated using the partial sums develops over time. We see that even though we still have huge spikes, the integral remains comparatively stable and converges already after a few thousand iterations. Even if we run the simulation only for 1000 steps, we already get close to the actual values zero (for $\sin(x)$ for symmetry reasons) and $\approx 0.3678$ (for $\cos(x)$, obtained using the scipy.integrate.quad integration routine). In the second diagram in the middle row, I have plotted the autocorrelation versus the lag, as an indicator for the failure of the sample points to be independent. Recall that for two samples X and Y, the Pearson correlation coefficient is the number $\frac{E((X-\bar{X})(Y-\bar{(Y)})}{\sigma_X \sigma_Y}$ where $\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y. In our case, given a lag, i.e. a number l less than the length of the chain, we can form two samples, one consisting of the points $X_0, X_2, \dots$ and the second one consisting of the points of the shifted series $X_l, X_{l+1}, X_{l+2}, \dots$. The autocorrelation with lag l is then defined to be the correlation coefficient between these two series. In the diagram, we can see how the autocorrelation depends on the lag. We see that for a large lag, the autocorrelation becomes small, supporting our intuition that the series and the shifted series become independent. However, if we execute several simulation runs, we will also find that in some cases, the convergence of the autocorrelation is very slow, so care needs to be taken when trying to obtain a nearly independent sample from the chain. In practice, the autocorrelation is probably not a good measure for the convergence of a Markov chain. It is important to keep in mind that obtaining an independent sample is not the point of the Markov chain – the real point is that even though the sample is autocorrelated, we can approximate expectation values fairly well. However, I have included the autocorrelation here for the sake of illustration. This form of proposal distributions is not the only one that is commonly used. Another choice that appears often is called an independence sampler. Here the proposal distribution is chosen to be independent of the current location x of the chain. This gives us an algorithm that resembles the importance sampling method and also shares some of the difficulties associated with it – in my notes on Markov chain Monte Carlo methods, I have included a short discussion and a few examples. These notes also contain further references and a short discussion of why and when the Markov chain underlying a Metropolis-Hastings sampler converges. Other variants of the algorithm work by updating – in a high-dimensional space – either only one variable at a time or entire blocks of variables that are known to be independent. Finally, if we are dealing with a state space that can be split as a product $X_1 \times X_2$, we can use the conditional probability given either x1 or x2 as a proposal distribution. Thus, we first fix x2, and draw a new value for x1 from the conditional probability for x1 given the current value of x2. Then we move to this new coordinate, fix x1, draw from the conditional distribution of x2 given x1 and set the new value of x2 accordingly. It can be shown (see for example [5]) that the acceptance probability is one in this case. So we end up with the Gibbs sampling algorithm that we have already used in the previous post on Ising models. Monte Carlo sampling methods are a broad field, and even though this has already been a long post, we have only scratched the surface. I invite you to consult some of the references below and / or my notes for more details. As always, you will also find the sample code on GitHub and might want to play with this to reproduce the examples above and see how different settings impact the result. In a certain sense, this post is the last post in the series on restricted Boltzmann machines, as it provides (at least some of) the mathematical background behind the Gibbs sampling approach that we used there. Boltzmann machines are examples for stochastic neuronal networks that can be applied to unsupervised learning, i.e. the allow a model to learn from a sample distribution without the need for labeled data. In the next few posts on machine learning, I will take a closer look at some other algorithms that can be used for unsupervised learning. ### References 1. N. Metropolis,A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculation by fast computing machines, J. Chem. Phys. Vol. 21, No. 6 (1953), pp. 1087-1092 2. S. Brooks, A. Gelman, C.L. Jones,X.L. Meng (ed.), Handbook of Markov chain Monte Carlo, Chapman Hall / CRC Press, Boca Raton 2011 3. W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, Vol. 57 No. 1 (1970), pp. 97-109 R.M. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993 5. C.P. Robert, G. Casella, Monte Carlo Statistical Methods, Springer, New York 1999 ## Recurrent and ergodic Markov chains Today, we will look in more detail into convergence of Markov chains – what does it actually mean and how can we tell, given the transition matrix of a Markov chain on a finite state space, whether it actually converges. So suppose that we are given a Markov chain on a finite state space, with transition probabilities described by a matrix K as in the previous post on this topic. We have seen that the distribution of Xn is given by $\mu K^n$ where $\mu$ is the initial distribution. So if Kn converges, we can expect the distribution of Xn to converge as well. If there is a matrix $K^\infty$ such that $\lim_{n \rightarrow \infty} K^n = K^\infty$ then of course $K^\infty K = (\lim_{n \rightarrow \infty} K^n) K = \lim_{n \rightarrow \infty} K^{n+1} = K^\infty$ In other words, each row $\pi$ of $K^\infty$ will have the property that $\pi K = \pi$. Interpreting K as transition probabilities, this implies that if the distribution of Xn is $\pi$, the distribution of Xn+1 will again be $\pi$. Any distribution, described by a row vector $\pi$ with row sum 1, for which this holds is called an invariant distribution and traditionally denoted by $\pi$. In other words, invariant distributions correspond to eigenvectors of the transposed matrix KT with eigenvalue 1, and our argument has shown that if $K^n$ converges to $K^\infty$, then every row of $K^\infty$ will be an invariant distribution. So convergence implies the existence of an invariant distribution. Let us next try to understand whether this invariant distribution is unique. There are obvious examples where this is not the case, the most trivial one being the unit matrix $K = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$ Obviously $K^n = K$ and the chain converges, but every vector with row sum one is an invariant distribution. This chain is too rigid, because in whatever state we start, we will stay in this state forever. It turns out that in order to ensure uniquess of an invariant distribution, we need a certain property that makes sure that the states can move around freely which is called irreducibility. Intuitively, we say that a Markov chain is irreducible if any state can be reached from any other state in finitely many steps. For a finite state space, this is rather easy to formalize. A finite Markov chain is called irreducible if, given two states i and j, we can find some power n such that $K^n_{ij} > 0$. In other words, given a row index i and a column index j, we can find a power n such that the element of Kn at (i,j) is not zero. It turns out that if a chain is not irreducible, we can split the state space into smaller areas that the chain – once it has entered one of them – does not leave again and on which it is irreducible. So irreducible Markov chains are the buildings blocks of more general Markov chains, and the study of many properties of Markov chains can be reduced to the irreducible case. Now let us assume that our chain is in fact irreducible. What else do we have to ask for to make sure that it converges? Again, let us as look at an obvious example. $K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$ All even powers of this matrix are the unit matrix and all odd powers are K itself, so the Markov chain described by this transition matrix is clearly irreducible. However, there is still something wrong. If, for instance, we are in state 1 at the time t, then the probability to still be in state 1 at time t+1 is zero. Thus we are actually forced to move to state 2. In a certain sense, there is still an additional constraint on the the behaviour of the chain – the state space can be split in two pieces and we cycle through these pieces with every step. Such a chain is called periodic. If a chain is periodic, it is again obvious that we can think of it as two different chains, one chain given by the random variables X0, X2,… at even times and the other chain given by the random variables X1, X3, … at odd times. Thus again, there is a way to split the chain into parts. So let us now focus on chains that are irreducible and aperiodic. Can we tell whether the chain converges? The answer is surprisingly simple and one of the most powerful results in the theory or Markov chains: every finite Markov chain which is irreducible and aperiodic converges. Thus once we have verified aperiodicity and irreducibility, we can be sure that the limit $K^\infty$ exists. We can say even more – we have seen that the rows of the limit are invariant distributions for K. However, one can show that an irreducible Markov chain can only have one invariant distribution. Thus we can conclude that all rows of the matrix $K^\infty$ are identical. In fact, they are all equal to the (unique) invariant distribution that again corresponds to an eigenvector of KT with eigenvalue one. This gives us two different approaches to calculating the invariant distribution and the limit $K^\infty$. First, we can form high powers Kn to approximate the limit $K^\infty$. Or, alternatively, we can search for eigenvectors of the transposed matrix KT with eigenvalue 1 using one of the known algorithms to calculate eigenvectors and eigenvalues. Let us do this for the example of the random walk on a circle that we have studied earlier. For simplicity, we will use N = 4 this time. The transition matrix with p = 0.8 is $K = \begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$ We can then easily implement both approaches in Python using the numpy library. Sorry, something went wrong. Reload? Sorry, we cannot display this file. Sorry, this file is invalid so it cannot be displayed. So we see that the approach to take the eigenvalues and eigenvectors gives us – in this case – the exact result (and a direct computation confirms that this is really an eigenvector), whereas the matrix power with n=20 gives a decent approximation. Both results confirm what we have observed visually in the last post – the distribution converges, and it does in fact converge to the uniform distribution. So roughly speaking, after sufficiently many steps, we end up everyhwere in the state space with equal probability. So far, we have only discussed the case of a finite state space. In the general case, the situation is more complicated. First, we need a better definition of irreducibility, as on a general state space, the probability measure of a single point tends to be zero. It turns out that the solution is to define irreducibility with respect to a measure on the state space, which is called $\psi$-irreducibility. Once we have that, we find two potential behaviours that do not show up in the finite case. First, it might happen that even though the chain is irreducible, it does not properly cycle through the state space and revisits every point with finite probability, but – intuitively speaking – heads off to infinity. Such a chain is called transient and does not converge. Second, even if the chain is not transient, we might be able to find an invariant distribution (a measure in this case), but this measure might not be a probability distribution as it is not finite. Chains with this property are called null recurrent. And finally, the concept of converge needs to be made more precise, and it turns out to have a subtle dependency on the starting point which can be fixed by assuming so called Harris recurrence. But all this can be done and delivers a very similar result – a convergence theorem for a large class of chains (in this case aperiodic Harris recurrent Markov chains). If you would like to get deeper into this and also see proofs for the claims made in this post and references, I invite you to take a look at my short introduction to Markov chains. We close this post with one remark which, however, is of crucial importance for applications. Suppose we have a convergent Markov chain Xt, and a function f on the state space. If the Xt were independent and identically distributed, we could approximate the expectation value of f as $\int_X f(x) dP = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$ where dP is the (common) distribution of the Xt. For Markov chains, however, we know that the Xt are not independent – the whole point of having a Markov chain is that Xt+1 does actually depend on Xt. However, in our example above, we have seen that the dependency gets trivial for large t as in this example, all entries in the transition matrix become equal in the limit. Of course this is a special case, but it turns out that even in general, the approximation of the expectation value by averages across the sample is still possible. Intuitively, this is not so surprising at all. If the distributions $K^n$ of the$X_n$converge to some invariant measure $\pi$, the $X_n$ will, for large n, be approximately identically distributed, namely according to $\pi$. Moreover, heuristically we have for large n, m and a fixed starting point s: $P(X_{n+m}=i, X_n=j) = K^m_{ji} K^n_{sj} \approx \pi_i \pi_j$ On the other hand, this product is again approximately the product $\pi_i \pi_j \approx P(X_{n+m}=i) P(X_n=j)$ Thus, intuitively, Xn and Xm are almost independent if n and m are large, which makes us optimistic that the law of large numbers could actually hold. In fact, if Xt converges (again I refer to my notes for a more precise definition of convergence in this case), it turns out that the law of large numbers remains valid if we integrate f with respect to the invariant distribution $\pi$: $\int_X f(x) d\pi = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$ This is extremely useful in applications, where often the primary purpose is not so much to obtain a sample, but to approximate otherwise intractable integrals! Once we have found a Markov chain that converges to the invariant distribution $\pi$, we can calculate integrals over $\pi$ by running a long Markov chain until converge and then taking the average value of the function that we want to integrate for a large number of subsequent points from this chain. In fact, this is how most applications work – and this is more or less what we also did when we applied Markov chains to the problem of calculating gradients in the PCD algorithm, where this chain is given by the state of the negative particles in subsequent iterations. So given a distribution $\pi$, we are now lead to the question how we can possibly find a Markov chain for which this distribution is invariant and which converges. The most general answer to this question is a class of algorithms known as Metropolis-Hastings algorithms which we study in the next post in this series. ## Finite Markov chains In this post, we will look in more detail into an important class of Markov chains – Markov chains on finite state spaces. Many of the subtleties that are present when studying Markov chains in general state spaces do not appear in the finite case, while most of the key ideas and features of Markov chains are still visible, so this is a good starting point if you want to grasp the key points. So let us assume that our state space X is finite. For simplicity, we label our states as $\{1, 2, \cdots N \}$ where N is the number of states. We also assume that all our points are measurable, i.e we consider our state space as a discrete probability space. Now consider a sequence of random variables X0, X1, …. How can we formalize the idea that Xt+1 depends on Xt in a randomized way? As so often in probability theory, let us model the dependency as a conditional probability. The conditional probability for Xt+1 to take on a value i given Xt $P(X_{t+1} = i | X_t)$ considered as a function of Xt will assign a conditional probability to each of the states i and for each value of Xt. Therefore we can write this as a matrix $P(X_{t+1} = i | X_t = j) = K_{ji}$ To obtain a time homogeneous Markov chain, we also assume that the matrix K does not depend on the time t. Therefore we define a Markov chain on X to be a sequence $\{X_t\}_t$ of random variables taking values in X with the Markov property saying that for all times t, the conditional distribution for Xt+1 given all previous values $X_0, X_1, \dots, X_t$ only depends on Xt, i.e. $P(X_{t+1} | X_t, X_{t-1}, \cdots ) = P(X_{t+1} | X_t)$ We also require that this conditional probability is independent of t and is therefore given by a matrix K as in the formula above. Markov chains on finite state spaces are often visualized as a graph. Suppose, for instance, that our state space contains only two elements: $X = \{ 1 , 2\}$. We can think of the combined values $X_t$ for all times t as a history of states or as a random walk in the state space. The Markov property then means that the probability to transition into a next state does not depend on the full history, but only on the current state – Markov chains do not have a memory. In our state space with two elements only, the Markov chain is then described by four transition probabilities: the probability to stay in state 1 when the chain is in state 1, the probability to move to state 2 when being in state 1, the probability to stay in state 2 and the probability to move to state 1 after being in state 2. This can be visualized as follows. Let us now calculate a few probabilities to get an idea for the relevant quantities in such a model. First, suppose that at time 0, the model is in state j with probability $\mu_j$. What is the probability to be in state j after one step? Of course we can write $P(X_1 = i) = \sum_j P(X_1 = i, X_0 = j)$ Now, according to the rules of conditional probabilities, we can express the joint probability as follows. $P(X_1 = i, X_0 = j) = P(X_1 = i | X_0 = j) P(X_0 = j)$ Plugging this into our previous expression and using the definitions of K and $\pi$, we now obtain $P(X_1 = i) = \sum_j K_{ji} \mu_j$ Thus if we think of $\mu$ as a row vector, then the probability after one step is described by the matrix product $\mu K$. Now let us calculate a slightly different quantity. Assume that we know that a chain starts at j. What is the probability to be at i after two steps? Using once more the rules of conditional probability and the Markov property, we can write $P(X_2=i | X_0=j) = \sum_k P(X_2=i | X_1=k)P(X_1=k | X_0=j)$ Intuitively, this is very appealing. To get from j to i in two steps, we can take the way via any intermediate state k. To get the total probability, we simply sum up all these different probabilities! If you have ever seen path integrals in quantum mechanics, this idea will look familiar. Again, we can write this in matrix notation. Each of the conditional probabilities on the right hand side of the expression above is a matrix element, and we find that $P(X_2 = i | X_0 = j) = \sum_k K_{ki} K_{jk}$ so that the probability to get in two steps from j to i is simply given by the elements of the matrix K2. Similarly, the n-step transition probabilities are the entries of the matrix Kn. Let us look at an example to understand what is going on here, which is known as a finite random walk on a circle. Our state space consists of N distinct points which we place arbitrarily on a circle. We then define a Markov chain as follows. We start at some arbitrary point x. In each step, we move along the circle to a neighbored point – one point to the left with probability 1/2 and one point to the right with probability 1/2. By the very definition, the transition probabilities do not change over time and depend only on the current state, so is a finite Markov chain. If we label the points on the circle by $1, 2, \dots, N$, then the transition matrix is given by a matrix of the form (for instance for N=4) $K = \begin{pmatrix} 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \\ 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \end{pmatrix}$ More generally, we can also allow the process to stay where it is with probability 1 – p, where p is then the probability to move, which leads us to the matrix $K = \begin{pmatrix} 1-p & \frac{p}{2} & 0 & \frac{p}{2} \\ \frac{p}{2} & 1-p & \frac{p}{2} & 0 \\ 0 & \frac{p}{2} & 1-p & \frac{p}{2} \\ \frac{p}{2} & 0 & \frac{p}{2} & 1 - p \end{pmatrix}$ Let us try to figure out whether the target distribution, given by the matrix Kn for large n, somehow converges. To see this, we do two numerical experiments. First, we can easily simulate a random walk. Suppose that we have a function draw which accepts a distribution (given by a vector p whose elements add up to one) and draws a random value according to that distribution, i.e. it returns 1 with probability p0, 2 with probability p1 and so on. We can then simulate a random walk on the circle as follows. def simulate_chain(N, p, steps=100, start=5): chain = [] x = start % N chain.append(x) for i in range(steps): x = (x + draw([p/2.0, 1.0 - p, p / 2.0]) - 2) % N chain.append(x) return chain  Here N is the number of points on the circle, steps is the number of simulation steps that we run and start is the starting point. The function draw then returns 1 with probability p/2, 3 with probability p/2 and 2 with probability 1 – p. Thus we move with probability p/2 to the right, with probability p/2 to the left and stay were we are with probability 1 – p. If we set p = 0.8, this results in the following transition matrix. $\begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$ We can visualize the results if we execute a large number of runs and draw histograms of the resulting positions 100, 200, 300, … steps. The result will look roughly like this (for this image, I have used p = 0.8 and N=10): We see that after a few hundred steps, this seems to converge – actually this starts to look very much like a uniform distribution. As we already know that the distribution after n steps is given by the matrix Kn, it makes sense to look at high powers of this matrix as well. To visualize this, I will use a method that I have seen in MacKays excellent book and that works as follows. Visualizing one row of a matrix is not difficult. We can plot the entries in a two-dimensional diagram, where the x-axis corresponds to the column index and the y-axis corresponds to the value. A flat line is then a row in which all entries have the same value. To display a full matrix, we use colors – we assign one color to each row index and plot the individual rows as just described. If we do this for the powers of the matrix K with the parameters p = 0.8 and N = 10 (and only chose some rows, for instance 1,4,7, to not run out of colors…) we obtain an image like the one below. We can clearly see that the powers of the matrix K converge towards a matrix $K^\infty = \lim_{n \rightarrow \infty} K^n$ where all entries in each row seem to have the same value. We have seen that the distribution after n steps assuming an initial distribution described by a row vector $\mu$ is $\mu K^n$. Therefore the limit distribution is $\mu K^\infty$. If we take $\mu$ to be a unit vector, we find that the rows of the matrix $K^\infty$ do actually represent the limit distribution of all chains that have started at a specific point i of the state space. Therefore the entries in the rows need to sum up to one, and thus, if they are all equal, need to be equal to 1 / N. If you calculate and print out high powers of the matrix K, you will in fact see that they approach the matrix where all entries are 0.1 (as we have chosen N = 10 in this example). To completes our short introduction into Markov chains. We have seen that Markov chains model stochastic processes in discrete time in which the state at step t+1 depends only on the state at step t and the dependency is given by a function independent of the current time. In finite state spaces, Markov chains are described by a transition matrix K. The i-th row of the matrix Kn is the distribution of chains starting at point i after n steps. Consequently, converge properties of the Markov chains can be related to converge of high powers Kn and the apparatus of linear algebra can be applied. In the next post, we will learn more about convergence – how it can be made precise, and how we can tell whether a given Markov chain converges. This will then allow us to construct Markov chains that converge towards a given target distribution and use them for sampling. ## Monte Carlo methods and Markov chains – an introduction In our short series on machine learning, we have already applied sampling methods several times. We have used and implemented Gibbs sampling, and so far we have simply accepted that the approach works. Time to look at this in a bit more detail in order to understand why it works and what the limitations of the algorithm are. Regardless of whether you want to simulate ferromagnetic behavior in an Ising model, run a Hopfield network or train a Boltzmann machine, the fundamental problem that we have to solve is always the same. We are given a probability distribution P living on some state space X, and we are trying to create a sample, i.e. a set of points in the state space such that the probability for a point x to appear in this sample is equal to the probability P(x) given by the probability distribution. The naive approach to this is simple: visit every point x in the state space and include that point with probability P(x). However, it is clear that with a large state space, this approach is not computationally feasible. In the example of a Boltzmann machine trained on handwritten digits with 28 x 28 pixels, our state space has 2784 elements, and there is no way we can visit them all one by one. Instead, we would need something like a randomized walk through the state space. We could start with same randomly chosen state X0, then – using a randomized transition rule – move on to a point X1 and so forth. Intuitively, we want to select our transition rule in such a way that the state space elements Xi selected in this way form a sample, i.e. such that our chain of state space locations visits regions with large values for P(x) more often than regions with low values of P(x). Thus we would systematically ignore regions of the state space with low probability which would greatly reduce the number of states that we have to visit to obtain a valid sample. So, from a mathematical point of view, we consider a sequence of random variables Xi such that Xi+1 is related to Xi by some randomized transition rule. We also assume that this rule does not depend on the index i which is usually called the time. Thus we have a sequence of random variables Xi which is not independent, but almost independent – Xi+1 depends only on Xi and in way that itself does not vary with i. This is called a Markov chain (more precisely, a time homogeneous Markov chain). Let us consider an example to illustrate the idea. As our state space, we choose the space or real numbers. We fix a starting value, say X0=0, and we obtain the next value by adding a number that we draw from a standard normal distribution. Thus, mathematically, we assume that Wn are identically distributed and independent random variables, all distributed according to the standard normal distribution, and set $X_{n+1} = X_n + W_n$ This is a Markov chain: the value Xn+1 depends only on Xn, not on any earlier elements of the chain. The transition rule is randomized, but itself does not depend on the time step n – all Wn have the same distribution. Let us implement this in Python to see how it works (the full notebook can be downloaded here). Sorry, something went wrong. Reload? Sorry, we cannot display this file. Sorry, this file is invalid so it cannot be displayed. Here we have created and displayed three different random walks. All of them start at the same point (zero), and all of them follow the same transformation rule, but as the transformation rule is stochastic in nature, they all develop differently. Now let us try to turn the view on this upside down. This time, we execute a larger number – 1000 – of random walks with 5000 steps each. But instead of plotting the sequence of points Xi for every walk, we display the distribution of the last point of each walk, i.e. we plot the distribution of the random variable X4999. Sorry, something went wrong. Reload? Sorry, we cannot display this file. Sorry, this file is invalid so it cannot be displayed. This does in fact look more familiar. We see that most walks end up being close to zero at the end – steps in the positive direction and steps in the negative direction cancel each other. Only very few walks end up at an extreme position close to plus or minus 200 – this is not surprising as well, to arrive at an extreme point, we would need to draw 5000 times in a row an extreme value from the random normal distribution, which is a rather unlikely chain of events. In this case, the distribution does actually not converge if we increase the number of steps – you can try this out and play with different values, i.e. replace 5000 by 50000 (this will run some time) and look at the distribution of X49999 – you will see that this is now spread out to roughly plus / minus 750 (in fact, the distribution is obtained as a multiple convolution of the standard normal distribution with itself and thus is again a normal distribution). Even though the distribution does not converge, we have been able to sample from a specific distribution – in this case the distribution after 5000 steps – using only the ability to sample from a different distribution – in this case the random normal distribution. Obviously, in this special case, the result is trivial, but the principle that we have found looks interesting. Can we generalize this approach to obtain sampling methods for target distributions that are otherwise intractable? Now this is exactly the idea behind the sampling approach that is commonly known as Markov chain Monte Carlo (MCMC) and which has become very popular, with applications to complex simulations in theoretical physics, to machine learning and even asset pricing and value-at-risk calculations. So let us summarize how the MCMC approach works. Given a target distribution P(x), we first construct a Markov chain that converges to that target distribution. Once we have that, we can simulate a large number of runs and use the resulting points as our sample (in fact, in many cases we can also do with one run only, as we will see later). Thus in order to utilize Markov chains for sampling, we would need to understand under what conditions a Markov chain converges and if it converges, how we can relate the target distribution to the transformation rule. We will look into these points in more detail in future posts in this series. ## Training a restricted Boltzmann machine on a GPU with TensorFlow During the second half of the last decade, researchers have started to exploit the impressive capabilities of graphical processing units (GPUs) to speed up the execution of various machine learning algorithms (see for instance [1] and [2] and the references therein). Compared to a standard CPU, modern GPUs offer a breathtaking degree of parallelization – one of NVIDIAs current flagships, the Tesla V100, offers more than 5.000 CUDA cores that can perform work in parallel. As training and evaluating neural networks involves many floating operations on large matrices, they can benefit heavily from the special capabilities that a GPU provides. So how can we make our code execute on a GPU? Of course you could program directly against the CUDA interface or similar interfaces like OpenCL. But specifically for the purposes of machine learning, there are easier options – over the last years, several open source frameworks like Theano, Torch, MXNet or TensorFlow have become available that make it comparatively easy to leverage a GPU for machine learning. In this post, I will use the TensorFlow framework, simply because so far this is the only one of these frameworks that I have used (though MXNet looks very interesting too and I might try that out and create a post on it at some point in the future). It takes some time to get used to the programming model of TensorFlow which is radically different from the usual imparative programming style. As an example, let us suppose we wanted to add two matrices. In Python, using numpy, this would look as follows. import numpy as np a = np.matrix([[0, 1], [1, 0]]) b = np.matrix([[1, 0], [0, 1]]) c = a + b print(c)  This program is described by a sequence of instructions (let us ignore the fact for a moment that these are of course functions that we call – ultimately, functions are composed of instructions). When we execute this program, the instructions are processed one by one. First, we assign a value to the variable a, then we assign a value to a variable b, then we add these two values and assign the result to a variable c and finally we print out the value of c. The programming model behind TensorFlow (and other frameworks like Theano) is fundamentally different. Instead of describing a program as a sequence of instructions, the calculations are organized as a graph. The nodes in this graph correspond to operations. The edges joining the nodes represent the flow of data between the operations. In TensorFlow, data is always represented as a tensor, so the edges in the graph are tensors. An operation consumes data from its inputs, processes it and forwards it to the next operation in the graph as its output. A program using TensorFlow typically consists of two phases. In the first phase, we build the graph, i.e. we define the operations and their inputs and outputs that make up the calculation that we want to perform. However, in this phase, no calculations are actually performed. Instead, this happens in the second phase when we actually run the graph. For that purpose, we create a session. Roughly speaking, a session defines an environment in which a graph can be executed. Once the session has been defined, we can invoke its run method. To the run method, we pass as an argument the operation in the graph that we want to execute. The run method will then trace the graph backwards and evaluate all operations that provide input to our target operation recursively, i.e. it will identify the subgraph that needs to be executed to evaluate our target operation. Let us again use the example of a simple addition to illustrate this. The source code looks as follows. import tensorflow as tf # # Build the model # a = tf.constant([[0, 1], [1, 0]], name="a") b = tf.constant([[1, 0], [0, 1]], name="b") c = tf.add(a,b, name="c") # # Create a session and run it # session = tf.Session() print(session.run(c))  First, we import the tensorflow library itself. Then, in the next three lines, we build the graph. We define three nodes in the graph. The first two nodes are special operations that output simply a constant value. The third operation is the operation that performs the actual addition and uses the previously defined operations as input. Thus our final graph has three nodes and two edges, as shown below.h In the next line, we create a TensorFlow session which we then run. The argument specifies which operation we want to execute and therefore determines which part of the graph we will actually run. The output of the run method is an ordinary numpy array which we then print out. Let us now look at an example which is slightly more complicated. In the PCD algorithm, we can compute the contribution of the negative phase to the weight updates as follows. E = expit(self.beta*(np.matmul(S0, self.W) + self.c)) pos = np.tensordot(S0, E, axes=((0),(0)))  Here S0 is a batch from the sample set, W is the current value of the weights and c is the current value of the bias. In TensorFlow, the code to build the corresponding part of the model looks quite similar. S0 = tf.placeholder(name="S0", shape=[batch_size, self.visible], dtype=tf.float32) W = tf.get_variable(name="W", dtype=tf.float32, shape=[self.visible, self.hidden], initializer = tf.zeros_initializer(), trainable=False) c = tf.get_variable(name="c", dtype=tf.float32, shape=[1, self.hidden], initializer = tf.zeros_initializer(), trainable=False) E = tf.sigmoid(self.beta*(tf.matmul(S0, W) + c), name="E") pos = tf.tensordot(S0, E, axes=[[0],[0]], name="pos")  The first element that we define – S0 – is a so called placeholder. This is a bit like a constant, with the difference that its value can be specified per run, using an additional argument called feed dictionary to the Session.run method. The next two elements that we define are variables. Variables are similar to operations – they represent nodes in the network and provide an output, but have no input. Instead, they have a certain value and feed that value as outputs to other operations. We then use the built-in tensorflow operations sigmoid and tensordot to calculate the expectation values of the visible units and the positive phase. The full model to train a restricted Boltzmann machine is of course a bit more complicated. TensorFlow comes with a very useful device called TensorBoard that can be used to visualize a graph constructed in TensorFlow. The image below has been created using TensorFlow and shows the full graph of our restricted Boltzmann machine. TensorBoard offers the option to combine operations into groups which are then collapsed in the visual representation. In the image above, all groups are collapsed except the group representing the contribution from the positive phase. We can clearly see the flow of data as described above – we first multiply S0 and W, then add c to the result, multiply this by a constant (the inverse temperature, called x in the diagram) and then apply the sigmoid operation that we have called E. The result is then fed into other, collapsed groups like the group delta which holds the part of the model responsible for calculating the weight updates. I will not go through the full source code that you can find on GitHub as usual – you will probably find the well written tutorial on the TensorFlow homepage useful when going through this. Instead, let us play around a bit with the result. As the PC that is under my desk is almost seven years old and does not have a modern GPU, I did use a p2.xlarge instance from Amazon EC2 which gave me access to a Tesla K80 GPU and four Intel Xeon E5-2686 cores running at 2.3 GHz (be careful – this instance type is not covered by the free usage tier, so that will cost you a few dollars). I used the Amazon provided Deep Learning AMI based on Ubuntu 16.04. After logging into the instance, we first have to complete a few preparational steps. $ source activate tensorflow_p36
$git clone http://www.github.com/christianb93/MachineLearning.git$ cd MachineLearning
$export MPLBACKEND="AGG"$ conda install scikit-learn
$python3 RBM.py --algorithm=PCDTF  Here we activate the pre-configured TensorFlow environment, download the source code from GitHub, set the environment variable to define our Matplotlib backend, and download and install some required packages. Then we do a first run with the BAS dataset to verify that everything works. If that is the case, we can run the actual MNIST training and sampling. $ python3 RBM.py --N=28 --data=MNIST --save=1 --hidden=128 --pattern=256 --batch_size=128 --epochs=40000 --run_samples=1 --sample_size=6,6 --beta=1.0 --sample=200000 --algorithm=PCDTF --precision=32


This produced the following sample of 6 x 6 digits.

The execution took roughly 5 minutes – 2 minutes for the training phase and 3 minutes for the sampling phase. During the training, the GPU utilization (captured with nvidia-smi -l 2) was at around 57% and stayed in that range during the sampling phase.

A second run using the switch --precision=64 to set the floating point precision to 64 bits did not substantially change the outcome or the performance.

Then a run with the same parameters was done in pure Python running on the four CPU cores provided by the p2.xlarge instance (--algorithm=PCD). During the training phase, the top command showed a CPU utilization of 400%, i.e. all four cores where at 100%. The utilization stayed in that range during the sampling phase. The training took 10:20 minutes, the sampling 8 minutes. Thus the total run time was 18 minutes compared to 5 minutes – a factor of 360%.

Following the advice on this post, I then played a bit with the settings of the GPU and adjusted the clock rates and the auto boost mode as follows.

sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,875


That brought the GPU utilization down to a bit less than 50%, but had a comparatively small impact on the run times which now were 1:40 min (instead of 2 min) for training and 2:30 min (instead of 3 min) for sampling. So the total run time was now a bit more than 4 minutes, which is a speed up of roughly 20% compared to the default settings. Compared to the CPU, we have now reached a speed up of almost 4,5.

Next, let us compare this to the run time on two CPUs only. To measure that, I grabbed an instance of the t2.large machine type that comes with 2 CPUs – according to /proc/cpuinfo, it is equipped with two Intel Xeon E5-2676 CPUs at 2.40GHz. Interestingly, the training phase only took roughly 8 minutes on that machine, which is even a bit faster than on the p2.xlarge which has four cores. The sampling phase was faster as well, taking only 6 minutes instead of 8 minutes. It seems that adding more CPUs increases the overhead for the synchronisation between the cores drastically so that it results in a performance penalty instead of a performance improvement. To verify this, I did a run on a p2.8xlarge with 32 CPUs and got a similar result – training took 9 minutes, sampling 6:50 minutes.

Finally, I could not resist the temptation to try this out on a more advanced GPU enabled machine. So I got a p3.2xlarge instance which contains one of the relatively new Tesla V100 GPUs. I did again adjust the application clocks using

sudo nvidia-smi -ac 877,1530


With these settings, one execution now took only about 1:20 minutes for the training and 1:50 min for the sampling. However, the GPU utilization was only at 30% – so we have reached a point where just having a faster GPU does not lead to a significant speed advantage any more. The following table summarizes the results of the various measurements.

Instance Run time training Run time sampling
p3.2xlarge (Tesla V100) 1:20 min 1:40 min
p2.large (Tesla K80) 1:40 min 2:30 min
p2.large (4 x CPU) 10 min 8 min
p2.8xlarge (32 x CPU) 9 min 6:50 min
t2.large (2 x CPU) 8 min 6 min

Of course we could now start to optimize the implementation. For the training phase, I assume that the bottleneck that limits GPU utilization is the use of the feed dictionary mechanism which could be replaced by queues to avoid overhead of switching back between CPU and GPU. During the sampling phase, we could also try to reduce the relative overhead of the run method by combining a certain number of steps – say 10 – into the graph and thus reducing the number of iterations that happen outside of the model. It would be interesting to play with this and see whether we can improve the performance significantly. But this is already a long post, so I will leave this for later…

### References

1. R. Raina, A. Madhavan, A. Ng, Large-scale Deep Unsupervised Learning using Graphics Processors, Proceedings of the 26 th International Conference on Machine Learning (2009)
2. K. Chellapilla, S. Puri , P. Simard, High Performance Convolutional Neural Networks for Document Processing, International Workshop on Frontiers in Handwriting Recognition (2006)