More on Paperspace Gradient

Its been a few days since I started to play with Paperspace, and I have come across a couple of interesting features that the platform has – enough for a second post on this topic.

First, GIT integration. Recall that the usual process is to zip the current working directory and submit the resulting file along with the job, the ZIP file is then unzipped in the container in which the job is running and the contents of the ZIP file constitute the working directory. However, if you want to run code that requires, for instance, custom libraries, it is much easier to instruct Paperspace to get the contents of the working directory from GitHub. You can do that by supplying a GIT URL using the --workspace switch. The example below, for instance, instructs Paperspace to pull my code for an RBM from GitHub and to run it as a job.

#!/bin/sh
#
# Run the RBM as a job on Paperspace. Assume that you have the paperspace NodeJS
# CLI and have done a paperspace login before to store your credentials
#
#
~/node_modules/.bin/paperspace jobs create  \
        --workspace "git+https://github.com/christianb93/MachineLearning" \
        --command "export MPLBACKEND=AGG ; python3 RBM.py \
        --N=28 --data=MNIST \
        --save=1 \
        --tmpdir=/artifacts \
        --hidden=128 \
        --pattern=256 --batch_size=128 \
        --epochs=40000 \
        --run_samples=1 \
        --sample_size=6,6 \
        --beta=1.0 --sample=200000 \
        --algorithm=PCDTF --precision=32" \
        --machineType K80 \
        --container "paperspace/tensorflow-python" \
        --project "MachineLearning"

Be careful, the spelling of the URL must be exactly like this to be recognized as a GIT URL, i.e. “git+https” followed by the hostname without the “www”, if you use http instead of https or http://www.github.com instead of github.com, the job will fail (the documentation at this point could be better, and I have even had to look at the source code of the CLI to figure out the syntax). This is a nice feature, using that along with the job logs, I can easily reconstruct which version of the code has actually been executed, and it supports working in a team that is sharing GitHub repositories well.

Quite recently, Paperspace did apparently also add the option to use persistent storage in jobs to store data across job runs (see this announcement). Theoretically, the storage should be shared between notebooks and jobs in the same region, but as I have not yet found out how to start a notebook in a specific region, I could not try this out.

Another feature that I liked is that the container that you specify can actually be any container from the Docker Hub, for instance ubuntu. The only restriction is that Paperspace seems to overwrite the entrypoint in any case and will try to run bashinside the container to finally execute the command that you provide, so containers that do not have a bash in the standard execution path will not work. Still, you could use this to prepare your own containers, maybe with pre-installed data sets or libraries, and ask Paperspace to run them.

Finally, for those of us who are Python addicts, there is also a Python API for submitting and managing jobs in Paperspace. Actually, this API offers you two ways to run a Python script on Paperspace. First, you can import the paperspace package into your script and then, inside the script, do a paperspace.run(), as in the following example.

import paperspace
paperspace.run()
print('This will only be running on Paperspace')

What will happen behind the scenes is that the paperspace module takes your code, removes any occurrences of the paperspace package itself, puts the code into a temporary file and submits that as a job to Paperspace. You can then work with that job as with any other job, like monitoring it on the console or via the CLI.

That is nice and easy, but not everyone likes to hardcode the execution environment into the code. Fortunately, you can also simply import the paperspace package and use it to submit an arbitrary job, much like the NodeJs based CLI can do it. The code below demonstrates how to create a job using the Python API and download the output automatically (this script can also be found on GitHub).

import paperspace.jobs
from paperspace.login import apikey
import paperspace.config

import requests


#
# Define parameters
#
params = {}
# 
# We want to use GIT, so we use the parameter workspaceFileName
# instead of workspace
#
params['workspaceFileName'] = "git+https://github.com/christianb93/MachineLearning"
params['machineType'] = "K80"
params['command'] = "export MPLBACKEND=AGG ; python3 RBM.py \
                --N=28 --data=MNIST \
                --save=1 \
                --tmpdir=/artifacts \
                --hidden=128 \
                --pattern=256 --batch_size=128 \
                --epochs=40000 \
                --run_samples=1 \
                --sample_size=6,6 \
                --beta=1.0 --sample=200000 \
                --algorithm=PCDTF --precision=32"
params['container'] = 'paperspace/tensorflow-python'
params['project'] = "MachineLearning"
params['dest'] = "/tmp"

#
# Get API key
#
apiKey = apikey()
print("Using API key ", apiKey)
#
# Create the job. We do NOT use the create method as it cannot
# handle the GIT feature, but assemble the request ourselves
#
http_method = 'POST'
path = '/' + 'jobs' + '/' + 'createJob'


r = requests.request(http_method, paperspace.config.CONFIG_HOST + path,
                             headers={'x-api-key': apiKey},
                             params=params, files={})
job = r.json()
    
if 'id' not in job:
    print("Error, could not get jobId")
    print(job)
    exit(1)

jobId = job['id']
print("Started job with jobId ", jobId)
params['jobId']  = jobId

#
# Now poll until the job is complete

if job['state'] == 'Pending':
    print('Waiting for job to run...')
    job = paperspace.jobs.waitfor({'jobId': jobId, 'state': 'Running'})

print("Job is now running")
print("Use the following command to observe its logs: ~/node_modules/.bin/paperspace jobs logs --jobId ", jobId, "--tail")

job = paperspace.jobs.waitfor({'jobId': jobId, 'state': 'Stopped'})
print("Job is complete: ", job)

#
# Finally get artifacts
#
print("Downloading artifacts to directory ", params['dest'])
paperspace.jobs.artifactsGet(params)

There are some additional features that the Python API seems to have that I have not yet tried out. First, you can apparently specify an init script that will be run before the command that you provide (though the use of that is limited, as you could put this into your command as well). Second, and more important, you can provide a requirements file according to the pip standard to ask Paperspace to install any libraries that are not available in the container before running your command.

Overall, my impression is that these APIs make it comparatively easy to work with jobs on Paperspace. You can submit jobs, monitor them and get their outputs, and you enjoy the benefit that you are only billed for the actual duration of the job. So if you are interested in a job based execution environment for your Machine Learning models, it is definitely worth a try, even though it takes some time to get familiar with the environment.

First steps with Paperspace Gradient

So far, I have exclusively been using AWS EC2 when I needed access to a GPU – not because I have carefully compared the available offerings and taken a deliberate decision, but simply because I already had an EC2 account and know the platform.

However, I though it would be interesting to try out other platforms as well. In this post, I will talk a bit about my experience with Paperspace. This provider has several offerings – Core, which is basically an IaaS service, and Gradient, which allows you to access Jupyter notebooks and run jobs in ready-made environments optimized for Machine Learning – and of course I wanted to try this.

It should be noted that some time has passed between trying this out for the first time (roughly in May) and publication of this post in July, so bear with me when some of the details have changed in the meantime – Paperspace is still under development.

First steps

After signing up, you are routed to a page where you can choose between two products – Paperspace Core and Paperspace Gradient. I did choose Gradient (after providing the requested credit card information). The first thing I did try was to bring up a Jupyter notebook.

When you select that option, you have to make two choices. First, Jupyter notebooks are started in Docker containers, and you have to pick one of the available containers. Second, and more important, you have to select a machine – you have a choice between several CPU based and several GPU based models with different fees associated with them.

After a few seconds, your notebook is up and running (with the base account, you can only have one notebook server at any point in time). If you hit “Open”, a new tab will open and you will see the usual Jupyter home screen.

Your notebook folder will be prepopulated with some tutorials. The one I tried first is one of the classical MNIST / CNN tutorials. Unfortunately, when I tried to run it, the kernel died several times in a row – not very encouraging (it worked two days later, and overall there seem to be a few sporadic errors that come and go over time..).

Next, I could not resist the temptation to open a terminal. The Docker image seems to be based on a very basic Ubuntu distribution. I could successfully do an apt-get update && apt-get install git. So you could probably start to download things and work directly from the console – but of course this is not really the idea.

After playing for some time with the notebook, you can – again on the Paperspace notebook console stop your notebook (make sure to do this, you will be charged while the notebook is running). Once the notebook has stopped, you can click on the little arrow to the right of the notebook name, which will give you the option to download any files in the notebook directory that you have created in your session.

Once stopped, there is no way to restart a notebook, but you can clone a notebook which will create a copy of the previous notebook and start that copy, so you can continue to work where you left off. This works, but is a bit tiresome as you have to delete the obsolete copy manually.

Jobs

The next thing I tried is to create a job. For that purpose, you will first have to install the Paperspace CLI which in turn requires node.js and npm. So here is what you need to do on Ubuntu:

$ cd ~
$ apt install nodejs
$ apt install npm
$ npm install paperspace-node
$ sudo ln -s /usr/bin/nodejs /usr/bin/node

This will create a directory node_modules in your home directory and within that directory, a directory .bin. To test the paperspace CLI, you can run a command like

$ ./node_modules/.bin/paperspace --version

Next, switch to an empty directory and in that directory, run

$ ~/node_modules/.bin/paperspace project init

This will initiate a new paperspace project, i.e. it will create a subdirectory .ps_project containing a JSON configuration file. Next, you need an API key that you can get on your Paperspace home page. The API key is an authentication token that is used by the API – store that number in a safe place.

Once we have that token, it is time to start our first job.

 ~/node_modules/.bin/paperspace jobs create --container Test-Container --command "nvidia-smi" --apiKey "xxxxx" --machineType K80

where xxxxx needs to be replaced by your API key. Instead of providing your API key with every command, you can also run

$ ~/node_modules/.bin/paperspace login

which will add your credentials to a file in the .paperspace directory in your home directory.

Essentially, what happens when you run a job is that the local directory and all its subdirectories will be zipped into a file, a container will be set up on a Paperspace server, the content of the ZIP file will be extracted into this container and the command that you have specified will execute.

You can now get a list of your processes and their status either on the Paperspace console, where you also have immediate access to the log output, or from the command line using

$ ~/node_modules/.bin/paperspace jobs list

At this point, I was again a bit disappointed – the job appears to be running and is even displayed in the web console, but when it completes, I get an error “503 – Service unavailable” and no log output is provided. I raised a request with the support, and roughly 2 hours later the submission suddenly worked – I have not yet found out whether the support has really done anything or whether a part of the infrastructure was really down at this point in time.

As a temporary workaround, I managed to run a job by redirecting error output and standard output to a file. For instance, to run the script KMeans.py, I did use

$ ~/node_modules/.bin/paperspace jobs create  --command "export MPLBACKEND=AGG ; python KMeans.py > /artifacts/log 2>&1" --machineType C2

Once the job is complete, you can download whatever it has added to the directory /artifacts using

$ ~/node_modules/.bin/paperspace jobs artifactsGet --jobId "js26y3pi6jk056"

where the job ID is the ID of the job and will be displayed by the create command. Finally, the command jobs destroy --jobId=... can be used to delete a job after execution.

So far, I have to admit that I am not so happy with what I have seen. I hit upon several issues in the standard setup, and when playing around, I found that it can take a long time for a job to be scheduled on a GPU (depending very much on the machine type – my impression was that for machine types for which Paperspace uses GCP, like K80 or P100, your job will run quickly, but for other types like GPU+ it can take a long time). In addition, as everything is running in a container, the initial steps in job can be time consuming. TensorFlow, for instance, is known to take longer when it is started the first time, and in a fresh container, every time is a first time, so you will see a significant startup time. This gets worse if you need to download data sets, as this will have to be repeated with every new run. It is apparently not yet possible to mount a permanent volume into your container to avoid this or to reuse a stopped container (update: as of July, Paperspace has announced that the /storage directory is a persistent storage available across notebooks and jobs, but I have not yet tried this).

But maybe this is a premature judgement, and I decided that I will still continue to try it out. In one of the next posts, I will present some more advanced features and in particular the Python API that Paperspace offers.

The EM algorithm and Gaussian mixture models – part II

In this post, I will discuss the general form of the EM algorithm to obtain a maximum likelihood estimator for a model with latent variables.

First, let us describe our model. We suppose that we are given some joint distribution of a random variable X (the observed variables) and and random variable Z (the latent variables) and are interested in maximizing the likelihood of an observed sample x of the visible variable X. We also assume that the joint distribution depends on a parameter \Theta, in practice this could be weights, bias terms or any other parameters. To simplify things a bit, we will also assume that the latent variable is finite. Our aim is to maximize the log likelihood, which we can – under these assumptions – express as follows.

\ln P(x |\Theta) = \ln \sum_z P(x,z | \Theta)

Even if the joint distribution belongs to some exponential family, the fact that we need to consider the logarithm of the sum and not the sum of the logarithms makes this expression and its gradient difficult to calculate. So let us look for a different approach.

To do this, let us assume that we are given a value \Theta of the parameter and let us try to understand how the likelihood changes if we pass from \Theta to some other value \Theta'. For that purpose, we introduce a term that is traditionally called Q and defined as follows (all this is a bit abstract, but will become clearer later when we do an example):

Q(\Theta'; \Theta) = E \left[  \ln P(x,z | \Theta')  | x, \Theta \right]

That looks a bit complicated, so let me explain the notation a bit. We want to define a function Q that will be a function of the new parameter value \Theta'. This function will, in addition, depend on the current value \Theta which we consider as a parameter.

The right hand side is an expectation value. In fact, for each value of the visible variable x and the parameter \Theta, we have a probability distribution on the space in which Z lives, given by the conditional probability of Z given x and \Theta. Whenever we have a function depending on z, we can therefore form the expectation value as usual, i.e. as the weighted sum over all function values, weighted by the probability of z. In particular, we can do this for the function \ln P(x,z | \Theta') of z. Thus the right hand side is, spelled out

E \left[  \ln P(x,z | \Theta')  | x, \Theta \right] = \sum_z \ln P(x,z | \Theta') P(z | x, \Theta)

That is now again a sum of logarithms, not a logarithm of a sum, and we can hope to be able to deal with this much better.

This is nice, but so far we have only introduced a rather complicated additional object – what do we gain? It turns out that essentially, maxizing Q will effectively maximize the likelihood. Let us make this bold statement a bit more precise. Suppose we are able to iteratively maximize Q. Expressed formally, this would mean that we are able to find a sequence \Theta^0, \Theta^1, \dots of parameters such that when passing from \Theta^t to \Theta^{t+1}, the value of Q does not decrease, i.e.

Q(\Theta^{(t+1)};\Theta^{(t)}) \geq Q(\Theta^{(t)} ; \Theta^{(t)})

Then this very same sequence will be a sequence of parameters for which the log-likelihood is non-decreasing as well, i.e.

\ln P(x | \Theta^{(t+1)}) \geq \ln P(x | \Theta^{(t)})

I will not include a proof for this in this post (the proof identifies the difference between any two subsequent steps as a Kullback-Leibler divergence and makes use of the fact that a Kullback-Leibler divergence is never negative, you can find a a proof in the references, in particular in [2], or in my more detailed notes on the EM algorithm and Gaussian mixture models, where I also briefly touch on convergence). Instead, let us try to understand how this can be used in practice.

Suppose that we have already constructed some parameter \Theta^t. The algorithm then proceeds in two steps. First, we calculate Q as a function of the new parameter \Theta^{t+1}. As Q is defined as an expectation value, this step is called the expectation step. Once we have that, we try to maximize Q, i.e. we try to find a value for the parameter \Theta^{t+1} such that Q(\Theta^{t+1}; \Theta^t) is maximized. This part of the algorithm is therefore called the maximization step. Then we start over, using \Theta^{t+1} as new starting point. Thus the algorithm alternates between an expectation and a maximization step, leading to the name EM algorithm.

To see how this works in practice, let us now return to our original example – Gaussian mixtures. Here the parameter \Theta is given by

\Theta = (\mu, \pi, \Sigma)

We consider a random variable X which has N components, each of which being a vector Xn in a d-dimensional space and corresponding to one sample vector (so in the language of machine learning, N will be our batch size). Similarly, Z consists of N components zn which again are subject to the restriction that only one of the znk be different from zero. We assume that the joint representation of our model is given by

P(x , z) = \prod_n \prod_k \pi_k^{z_{nk}}{\mathcal N} (x_n, \mu_k , \Sigma_k)^{z_{nk}}

Now we need to compute Q. The calculation is a bit lengthy, and I will skip most of it here (you can find the details in the references or my notes for this post). To state the result, we have to introduce a quantity called responsibility which is defined as follows.

r_{nk} = \frac{\pi_k {\mathcal N}(x_n ; \mu_k, \Sigma_k)} {\sum_{j} \pi_j {\mathcal N}(x_n ; \mu_j, \Sigma_j) }

From this definition, is it clear that the responsibility is always between zero and one, and in fact is has an interpretation of a (conditional) probability that sample n belongs to cluster k. Note that the responsibility is a function of the model parameters \mu, \pi and \Sigma. Using this notation, we can now write down the result of calculating the Q function:

Q(\Theta'; \Theta) = \sum_n \sum_k r_{nk} \ln \pi'_k + \sum_n \sum_k r_{nk} \ln {\mathcal N}(x ; \mu'_k, \Sigma'_k)

where the responsibilities are calculated using the old parameter set \Theta = (\mu, \pi, \Sigma). For a fixed \Theta, this is a function of the new parameters \mu', \pi' and \Sigma', and we can now try to maximize this function with respect to these parameters. This calculation is not difficult, but again a bit tiresome (and requires the use of Lagrangian multipliers as there is a constraint on the \pi_k), and I again refer to my notes for the details. When the dust settles, we obtain three simple expressions. First, the new values for the cluster means are given by

\mu'_k = \frac{\sum_n r_{nk} x_n}{\sum_n r_{nk}}

This starts to look familiar – this is the same expression that we did obtain for the cluster centers for the k-means algorithm! In fact, if we arrange the rnk as a matrix, the denominator is the sum across column k and the numerators is the weighted sum over all data points. However, there is one important difference – it is no longer true that given n, only one of the rnk will be different from one. Instead, the rnk are soft assignments that model the probability that the data point xn belongs to cluster k.

To write down the expression for the new value of the covariance matrix, we again need a notation first. Set

N_k = \sum_n r_{nk}

which can be interpreted as the number of soft assignments of points to cluster k. With that notation, the new value for the covariance matrix is

\Sigma'_k = \frac{1}{N_k} \sum_n r_{nk} (x_n - {\mu'}_k)(x_n - {\mu'}_k)^T

Finally, we can use the method of Lagrange multipliers to maximize with respect to the weights \pi_k (which always need to sum up to one), and again obtain a rather simple expression for the new value.

\pi'_k = \frac{N_k}{N}

Again, this is intuitively very appealing – the probability to be in cluster k is updated to be the number of points with a soft assignment to k divided by the total number of assignments

We now have all the ingredients in place to apply the algorithm in practice. Let us summarize how this will work. First, we start with some initial value for the parameters – the weights \pi, the covariance matrices \Sigma_k and the means \mu_k – which could, for instance, be chosen randomly. Then, we calculate the responsibilities as above – essentially, this is the expectation step, as it amounts to finding Q.

In the M-step, we then use these responsibilities and the formulas above to calculate the new values of the weights, the means and the covariance matrix. This involves a few matrix operations, which can be nicely expressed by the operations provide by the numpy library.

GMM

In the example above, we have created two sets of 500 sample points from different Gaussian mixture distributions and then applied the k-means algorithm and the EM algorithm. In the top row, we see the results of running the k-means algorithm. The color indicates the result of the algorithm, the shape of the marker indicates the original cluster to which the point belongs. The bottom row displays the results of the EM algorithm, using the same pattern.

We see that while the first sample (diagrams on the left) can be clustered equally well by both algorithms, the k-means algorithm is not able to properly cluster the second sample (diagrams on the right), while the EM algorithm is still able to assign most of the points to the correct cluster.

If you want to run this yourself, you can – as always – find the source code on GitHub. When playing with the code and the parameters, you will notice that the results can differ substantially between two consecutive runs, this is due to the random choice of the initial parameters that have a huge impact on convergence (and sometimes the code will even fail because the covariance matrix can become singular, I have not yet fixed this). Using the switch --data=Iris, you can also apply the EM algorithm to the Iris data set (you need to have a copy of the Iris data file in the current working directory) and find again that the results vary significantly with different starting points.

The EM algorithm has the advantage of being very general, and can therefore be applied to a wide range of problems and not just Gaussian mixture models. A nice example is the Baum Welch algorithm for training Hidden Markov models which is actually an instance of the EM algorithm. The EM algorithm can even be applied to classical multi-layer feed forward networks, treating the hidden units as latent variables, see [3] for an overview and some results. So clearly this algorithm should be part of every Data Scientists toolbox.

References

1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006
2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38
3. Shu-Kay Ng, G.J. McLachlan, Using the EM Algorithm to Train Neural Networks: Misconceptions and a New Algorithm for Multiclass Classification, IEEE Transaction on Neural Networks, Vol. 15, No. 3, May 2004

The EM algorithm and Gaussian mixture models – part I

In the last few posts on machine learning, we have looked in detail at restricted Boltzmann machines. RBMs are a prime example for unsupervised learning – they learn a given distribution and are able to extract features from a data set, without the need to label the data upfront.

However, there are of course many other, by now almost classical, algorithms for unsupervised learning. In this and the following posts, I will explain one of the most commonly applied algorithms in this field, namely the EM algorithm.

As a preparation, let us start with a very fundamental exercise – clustering. So suppose that we are given a data set which we suspect to consist of K clusters. We want to identify these clusters, i.e. to each point in the dataset, we want to assign the number of the cluster to which the point belongs.

More precisely, let us suppose that we are given a set {\mathcal D} = \{ x_i, \dots, x_N\} of points in some euclidian space and a number K of clusters. We want to identify the centre \mu_i of each cluster and then assign the data points xi to some of the \mu_j, namely to the \mu_j which is closest to xi. We can encode the assignment of data points to clusters in a matrix Rij where Rij = 1 if data point xi belongs to cluster j with centre \mu_j. Thus each row of R corresponds to a data point and is non-zero in exactly one column, where it is one. This is known as 1-of-K encoding.

For each value of i, the squared distance of xi to the centre of the cluster to which it is assigned is then given by

\sum_j R_{ij} \| x_i - \mu_j |^2

where we note that only one of the summands is different from zero. Assigning the data points to the clusters and positioning the centre points of the clusters then amounts to minimizing the loss function

l(\mu, R) = \sum_i \sum_j R_{ij} \langle x_i - \mu_j , x_i - \mu_j \rangle

where the brackets denote the usual euclidean scalar product.

Now let us see how we can optimize this function. For a fixed vector \mu of cluster centers, we can easily minimize Rij by assigning each data point to the cluster whose center is closest to xi. Thus we set

R_{ij} =  \begin{cases} 1 & \text{if} \, j = \arg \min \| x_i - \mu_j \|^2 \\ 0 & \text{otherwise} \end{cases}

Conversely, given a matrix R which we hold fixed, it is likewise easy to minimize \mu_j. As there are no further constraints on \mu_j, we can find the minimum by differentiating the loss function. We find that the derivative is zero if and only if

0 = \sum_i R_{ij}(x_i - \mu_j)

holds for all j. Assuming for a moment that each cluster has at least one data point assigned to it, i.e. that none of the columns of R contains zeroes only, we can solve this by

\mu_j = \frac{\sum_i x_i R_{ij}}{\sum_i R_{ij}}

which is also easily seen to be a local minimum by taking the second derivative.

Note that this term has an obvious geometric interpretation. The denominator in this expression is the number of data points that are currently assigned to cluster j. The numerator is the sum of all data points assigned to this cluster. Thus the minimum is the mean value of the positions of the data points that are currently assigned to the cluster (a physicist would call this the center of gravity of the cluster). This is the reason why this method is called the k-means algorithm. If no data points are assigned to the cluster, the loss function does not depend on \mu_j and we can choose any value that we want.

The algorithm now works as follows. First, we choose centers \mu_j at will – it is common to use some of the data points for this purpose. Then we assign each data point to a center using the formula for Rij derived above. We then adjust the center points \mu_j and reallocate the points to the new centers and so forth.

From our discussion above, it is clear that each full iteration of this procedure will reduce the loss function. This does of course not imply that the algorithm converges to a global minimum, as it might get stuck in local minima or saddle points. In practice, the algorithm is executed until the cluster assignments and centers do not change any more substantially or for a given number of steps.

KMeans

The diagram above shows the result of applying this algorithm to a set of points that is organized in two clusters. To generate the data, 100 samples were drawn from 2-dimensional Gaussian distributions. On the left hand side, half of the the samples were centered at the point (5,1), the other samples at (1,4), and both had standard deviation 0.6. On the right hand side, the same centers were used, but only a small number of samples were drawn from the second distribution which had standard deviation 0.5, whereas most samples came from the first distribution with standard deviation 0.8. Then 10 iterations of the k-means algorithm were applied to the resulting sample set. The points in the sample were then plotted with a color indicating the assignment to clusters resulting from the matrix R. The actual cluster from which the sample was drawn is indicated by the shape – a point is cluster one, a diamond is cluster two.

We see that in the example on the left hand side, the algorithm has correctly assigned all points to their original cluster. For the example on the right hand side, the situation is different – the algorithm did actually assign significantly more points to the blue cluster, i.e. there are many wrong assignments (blue points). This does not change substantially if we increase the number of iterations, even with 100 iterations, we still see many wrong assigments for this example. If you want to reproduce the example and play a bit with the parameters, you can get the sourcecode for a straightforward implementation in Python from my GitHub repository.

The K-means algorithm is very simple and straightforward, but seems to have limitations because it cannot determine the shape of a distribution, only its center. It turns out that K-means is a special case of a broader class of algorithms that we now study, hoping to find more robust algorithms.

In our example, we have generated sample data as a combination of two Gaussian distributions. What if we just change the game and simply assume that our data is of this type? In other words, we assume an underlying probabilistic model for our sample data. Once we have that, we can pull all the tricks that statistical inference can offer – we can calculate maximum likelihood and maximum posterior probability, we can try to determine the posterior or even sample from the model.

Thus let us try to fit our data to a model of the form

P(x) = \sum_k \pi_k {\mathcal N}(x ; \mu_k, \Sigma_k)

where {\mathcal N}(x ; \mu_k, \Sigma_k) is a multivariate Gaussian distribution with mean \mu_k and covariance matrix \Sigma_k, i.e.

{\mathcal N}(x ; \mu_k, \Sigma_k) = \frac{1}{\sqrt{(2\pi)^d \det(\Sigma)}} \exp (-\frac{1}{2} \langle x - \mu_k, \Sigma_k^{-1}(x - \mu_k)\rangle)

and the \pi_k are non-negative real numbers with \sum_k \pi_k = 1.

Let us now see how this equation looks like if we use a 1-of-K encoding. We introduce a random variable Z that takes values in \{ 0, 1\}^K with the additional constraint that only one of the Zk is allowed to be different from zero. We interpret \pi_k as the probability

\pi_k = P(Z_k = 1)

Then

P(Z = z) = \prod_k \pi_k ^{z_k}

and we can write

P(X=x) = \sum_z P(Z=z) P(X=x | Z=z) = \sum_z P(x,z)

where P(z) is as above and

P(X = x | Z = z) = \prod_k {\mathcal N}(x ; \mu_k, \Sigma_k)^{z_k}

This is a very general type of distribution which reflects a common pattern in machine learning, namely the introduction of so called latent or hidden variables. In general, latent or hidden variables are random variables that are a part of the model which cannot be observed, i.e. are not part of the input or the output of the model. We have seen latent variables in action several times – adding hidden units to a neural network introduces latent variables and makes the model much more powerful, the hidden layer of a restricted Boltzmann machine serves as memory to learn features, and latent variables that are used to construct a mixture of Gaussians as above allow us to model a much broader class of distributions than a model with just one Gaussian.

Intuitively, it is also clear how to sample from such a model. In a first step, we sample from Z, in other words we determine the index k randomly according to the distribution given by the \pi_k. Once we have k, we then sample from the conditional distribution P(X = x | Z = z). As we already have k, this amounts to sampling from the Gaussian distribution {\mathcal N}(x ; \mu_k, \Sigma_k) with mean \mu_k and covariance matrix \Sigma_k.

GaussianMixture

In the example above, we have first applied this procedure to a one-dimensional Gaussian mixture with K=2. The histogram shows the result of the sampling procedure, the solid lines are the probability density functions of the two individual Gaussian distributions, multiplied by the respective weight. On the right hand side, the result of sampling a Gaussian mixture in two dimensions is displayed (you can download the notebook used to produce this example here).

When we now try to fit this model to the sample data, we again have to calculate the likelihood function and try to maximize it. However, it turns out that the gradient of the likelihood is not so easy to calculate, and it might very well be that there is no closed form solution or that we obtain rather complicated expressions for the gradient. Fortunately, there is an alternative that works for very general models and does not require knowledge of the gradient – the EM algorithm. In the next post, I will present the algorithm in this general setup, before we apply it to our original problem and compare the results with the k-means algorithm.

References

1. C.M. Bishop, Pattern recognition and machine learning, Springer, New York 2006
2. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM-algorithm, Journ. Royal Stat. Soc. Series B. Vol. 39 No. 1 (1977), pp. 1-38

Controlling Docker container with Python

In the last few posts on the bitcoin blockchain, I have already extensively used Docker container to quickly set up test environments. However, it turned out to be a bit tiresome to run the containers, attach to them, execute commands etc. to get into a defined state. Time to learn how this can be automated easily using our beloved Python – thanks to the wonderful Docker Python SDK.

This package uses the Docker REST API and offers an intuitive object model to represent container, images, networks and so on. The API can be made available via a TCP port in the Docker configuration, but be very careful if you do this – everybody who has access to that port will have full control over your Docker engine. Fortunately, the Python package can also connect via the standard Unix domain socker on the file system which is not exposed to the outside world.

As always, you need to install the package first using

$ pip install docker

Let us now go through some objects and methods in the API one by one. At the end of the post, I will show how you a complete Python notebook that orchestrates the bitcoind container that we have used in our tests in the bitcoin series.

The first object we have to discuss is the client. Essentially, a client encapsulates a connection to the Docker engine. Creating a client with the default configuration is very easy.

import docker
client = docker.client.from_env()

The client object has only very few methods like client.info() or client.version() that return global status information. The more interesting part of this object are the collections attached to it. Let us start with images, which can be accessed via client.images. To retrieve a specific instance, we can use client.images.list(), passing as an argument a name or a filter. For instance, when we know that there is exactly one image called “alice:latest”, we can get a reference to it as follows.

alice = client.images.list("alice:latest")[0]

Other commands, like pull or push, are the equivalents of the corresponding docker CLI commands.

Let us now turn to the client.containers collection. Maybe the most important method that this collection offers is the run method. For instance, to run a container and capture its output, use

output = client.containers.run("alpine", "ls", auto_remove=True)

This will run a container based on the alpine image and pass “ls” as an argument (which will effectively execute ls as alpine container will run a shell for you) and return the output as a sequence of bytes. The container will be removed after execution is complete.

By setting detach=True, the container will run in detached mode and the call will return immediately. In this case, the returned object is not the output, but a reference to the created container which you can use later to work with the container. If, for instance, you wanted to start an instance of the alice container, you could do that using

alice = client.containers.run("alice:latest", auto_remove=True, detach=True)

You could then use the returned handle to inspect the logs (alice.logs()), to commit, to exec code in the container similar to the docker exec command (alice.exec_run) and so forth.

To demonstrate the possibilities that you have, let us look at an example. The following notebook will start two instances (alice, bob) of the bitcoin-alpine image that you hopefully have build when you have followed my series on bitcoin. It then uses the collection client.networks to figure out to which IP address on the bridge network bob is connected. Then, we attach to the alice container and run bitcoin-cli in this container to instruct the bitcoind to connect to the instance running in container bob.

We then use the bitcoin-cli running inside the container alice to move the blockchain into a defined state – we mine a few blocks, import a known private key into the wallet, transfer a certain amount to a defined address and mine a few additional blocks to confirm the transfer. Here is the notebook.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Make sure to stop all containers again when you are done, it is comparatively easy to produce a large number of stopped containers if you are not careful and use this for automated tests. I usually run the container with the --rm flag on the command line or the auto_remove=True flag in Python to make sure that they are removed by the Docker engine automatically when they stop.

Of course nobody would use this to simply run a few docker containers with a defined network setup, there are much better tools like Docker Swarm or other container management solutions for this. However, the advantage of using the Python SDK is that we can interact with the containers, run commands, perform tests etc. And all this can be integrated nicely into integration tests using test fixtures, for instance those provided by pytest. A fixture could bring up the environment, could be defined on module level or test level depending on the context, and can add a finalizer to shut down the environment again after the test has been executed. This allows for a very flexible test setup and offers a wide range of options for automated testing.

This post could only give a very brief introduction into the Python Docker SDK and we will not at all discuss pytest and fixtures – but I invite you to browse the Docker SDK documentation and pytest fixtures and hope you enjoy to play with this!

How the number of bitcoins is limited

In some of the previous posts, we did already hit upon the file chainparams.cpp in the source code of the bitcoin reference client. It is interesting to go through this and understand the meaning of the various parameters defined there. One of them should catch your attention:

class CMainParams : public CChainParams {
public:
    CMainParams() {
        strNetworkID = "main";
        consensus.nSubsidyHalvingInterval = 210000;

What does this parameter mean? It is in fact not used awfully often, apart from some unit tests I could only locate it once, namely in validation.cpp.

CAmount GetBlockSubsidy(int nHeight, const Consensus::Params& consensusParams)
{
    int halvings = nHeight / consensusParams.nSubsidyHalvingInterval;
    // Force block reward to zero when right shift is undefined.
    if (halvings >= 64)
        return 0;

    CAmount nSubsidy = 50 * COIN;
    // Subsidy is cut in half every 210,000 blocks which will occur approximately every 4 years.
    nSubsidy >>= halvings;
    return nSubsidy;
}

The output of this function plays an important role when a new block is mined by mine.cpp – this is the amount (in Satoshis) that a miner earns in addition to the fees! Put differently, this is the amount of bitcoins that are created when a block is mined.

What this code tells us is that the amount of bitcoin added during mining starts with 50 and is divided by two every 210.000 blocks. So the amount of bitcoins mined is a given by the formula

210000 \cdot 50 + 210000 \cdot 25 + 210000 \cdot 12.5 + \dots = \sum_{n=0}^\infty  210000 \cdot 50 \cdot  \frac{1}{2}^n

The mathematicians among us will recognize this as a geometric series

210000 \cdot 50 \cdot \sum_{n=0}^\infty q^n

with q = 0.5. This series converges, and its value is

210000 \cdot 50 \cdot 2 = 21 \cdot 10^{6}

Therefore the amount of bitcoins that are created by mining – and thus the overall supply of bitcoins –  can never exceed roughly 21 million bitcoins. You might have heard that before: bitcoins are by designed with a controlled supply which is guaranteed by the continuous reduction of the subsidity as the number of blocks increases (and not because the value of a bitcoin transaction is stored in a 64 bit integer – in fact this would explain why the value of a single transaction output cannot exceed a certain value, but not why the total sum of all ever issued bitcoins is limited). This is the point in the source code that is responsible for this.

Of course I am cheating a bit – the value of a bitcoin is discrete, not a real number. Mining will stop if the value of the block subsidity falls below one Satoshi, as this is the smallest number that can be represented. Let us see when this happens. The subsidity is given by the formula

s = \frac{5 \cdot 10^{9}}{2^n}

where n is obtained by dividing the block height (i.e. length of the chain) by 210000 and converting to an integer. Solving s = 1 for n, we obtain

n = \log_2 (5 \cdot 10^{9}) \approx 32.2

Therefore the bitcoin amount created with a block will drop to zero with block

m = 210000 * 33 = 6.930.000  .

The total number of bitcoins created until then can be approximated (ignoring rounding) by the partial sum

210000 \cdot 50 \cdot \sum_{n=0}^{32} q^n = \frac{50 \cdot 10^{9}(1 - q^{33})}{1 - q}

which gives 20999999.997 bitcoins, i.e. almost exactly 21 million bitcoins as expected. We can also estimate when this will have happened. Looking at blockchain.info, we see that at the time of writing, approximately 513.000 blocks have already been mined. So we still need 6.418.000 blocks. A block is generated roughly every 10 minutes, so there are 6 additional blocks per hour and therefore 52560 blocks being added per year. Thus it will take roughly 122 years from now until all these blocks have been mined, i.e. this will happen somewhere around the year 2140. So still some time to go until then…

If you do not trust the math, you could also simulate this in a little Python program. In fact, this will give you a bit less than what we have calculated above, as the subsidity is rounded to an integer during the calculation, which our geometric series above does not properly reflect.

COIN = 10**8
nHeight = 0
btc = 0.0
while True:
    halvings = nHeight // 210000
    subsidity = (50*COIN) >> halvings
    btc += subsidity
    if subsidity < 1:
        break
    nHeight += 1

print("Total bitcoin amount: ", btc / 10**8 )

You can also easily determine how many bitcoin are available at each point in time. At 513.000 blocks, these are roughly 16.9 million BTC, which, at a price of 10.000 USD, is equivalent to a market capitalization of roughly 160 billion USD.

Mining bitcoins with Python

In this post, we will learn to build a very simple miner in Python. Of course this miner will be comparatively slow and limited and only be useful in our test network, but it will hopefully help to explain the principles behind mining.

When we want to mine a block, we first need some information on the current state of the blockchain, like the hash of the current last block, the current value of the difficulty or the coin base value, i.e. the number of BTC that we earn when mining the block. When we are done building the block, we need to find a way to submit it into the bitcoin network so that it is accepted by all nodes and permanently added to the chain.

If we were member of a mining pool, there would be a mining server that would provide us the required information. As we want to build a solo mining script, we need to communicate with bitcoin core to get that information and to submit our final block. Fortunately, the RPC interface of bitcoin core offers methods to facilitate that communication that were introduced with BIP22.

First, there is the method getblocktemplate. It will deliver all the required information that we need to build a valid block and even propose transactions that we should include in the block. These transactions will be taken from the so called mempool which is a collection of transactions that the bitcoin server knows which have not been added to a block yet (see miner.cpp/BlockAssembler::addPackageTxs in the bitcoin core source code for details on how the selection process works).

If the client is done building the block, it can submit the final block using the method submitblock. This method will run a couple of checks on the block, for instance that it can be correctly decoded, that the first transaction – and only the first – is a coinbase transaction, that it is not a duplicate and that the proof-of-work is valid. If all the checks pass, it will add the block to the local copy of the blockchain. If a check fails, it will return a corresponding error code to the caller.

With that understanding, let us now write down the processing logic for our simple miner, assuming that we only want to mine one additional block. First, we will use getblocktemplate to get the basic parameters that we need and transaction that we can include. Then we will create a new block and a new coinbase transaction. We then add the coinbase transaction followed by all the other transactions to the block.

Once we have that block, we enter a loop. Within the loop, we calculate the hash of the block and compare this against the target. If we can meet the target, we are done and submit the newly created block using submitblock. Otherwise, we increment the nonce and try again.

We have discussed most of these steps in some details in the previous posts, with one exception – coinbase transactions. A coinbase transaction is, by definition, a transaction which generates bitcoin because it has valid outputs, but does not spend any UTXOs. Technically, a coinbase transaction is a transaction which (see CTransaction::IsCoinBase())

  • has exactly one transaction input
  • the previous transaction ID in this input is zero (i.e. a hexadecimal string consisting of 32 zeros “00”)
  • the index of the previous output is -1 (encoded as  0xFFFFFFFF)

As it does not refer to any output, the signature script of a coinbase transaction will never be executed. It can therefore essentially contain an arbitrary value. The only restriction defined by the protocol is described in BIP34, which defines that the first bytes of the signature script should be a valid script that consists of the height of the new block as pushed data. The remainder of the coinbase signature script (which is limited to 100 bytes in total) can be used by the miner at will.

Many miners use this freedom to solve a problem with the length of the nonce in the block header. Here, the nonce is a 32 bit value, which implies that a miner can try 232, i.e. roughly 4 billion different combinations. Modern mining hardware based on ASICs can search that range within fractions of seconds, and the current difficulty is so high that it is rather likely that no solution can be found by just changing the nonce. So we have to change other fields in the block header to try out more hashes.

What are good candidates for this? We could of course use the block creation time, but the bitcoin server validates this field and will reject the block if it deviates significantly from the current time. Instead miners typically use the coinbase signature script as an extra nonce that they modify to increase the range of possible hashes. Therefore the fields after the height are often combinations of an extra nonce and additional data, like the name of the mining pool (increasing the extra nonce is a bit less effective than increasing the nonce in the block header, as it changes the hash of the coinbase transactions and therefore forces us to recalculate the Merkle root, therefore this is most often implemented as an outer loop, with the inner loop being the increments of the nonce in the block header).

To illustrate this, let us look at an example. The coinbase signature script of the coinbase transaction in block #400020 is:

03941a060004c75ccf5604a070c007089dcd1424000202660a636b706f6f6c102f426974667572792f4249503130302f

If we decode this, we find that the first part is in fact a valid script and corresponds to the following sequence of instructions (keep in mind that all integers are encoded as little endian within the script):

OP_PUSH 400020
OP_0
OP_PUSH 1456430279
OP_PUSH 130052256
OP_PUSH 7350439741450669469

As specified by BIP34, the first pushed data is the height of that block as expected. After the OP_0, we see another push instruction, pushing the Unix system time corresponding to Thu Feb 25 20:57:59 2016, which is the creation time of the block.

The next pushed data is a bit less obvious. After looking at the source code of the used mining software, I assume that this is the nanoseconds within the second returned by the Unix system call clock_gettime. This is then followed by an eight byte integer (7350439741450669469) which is the extra nonce.

The next part of the signature script is not actually a valid script, but a string – a newline character (0xa), followed by the string “ckpool”. This is a fixed sequence of bytes that indicates the mining software used.

Finally, there is one last push operation which pushes the string “/Bitfury/BIP100/”, which tells us that the block has been mined by the Bitfury pool and that this pool supports BIP100.

Enough theory – let us put this to work! Using the utility functions in my btc Python package, it is now not difficult to write a short program that performs the actual mining.

However, we need some preparations to set up our test environment that are related to our plan to use the getblocktemplate RPC call. This call performs a few validations that can be a bit tricky in a test environment. First, it will verify that the server is connected, i.e. we need at least one peer. So we need to start two docker container, let us call them alice and bob again, and find out the IP address of the container bob in the Docker bridget network. The three following statements should do this for you.

$ docker run --rm -d -p 18332:18332 --name="alice" alice
$ docker run --rm -d  --name="bob" bitcoin-alpine
$ docker network inspect bridge | grep  -A 4  "bob" - | grep "IPv4" -

Assuming that this gives you 172.17.0.3 (replace this with whatever the result is in your case), we can now again use the addnode RPC call to connect the two nodes.

$ bitcoin-cli --rpcuser=user --rpcpassword=password -regtest addnode "172.17.0.3" add

The next validation that the bitcoin server will perform when we ask for a block template is that the local copy of the blockchain is up to date. It does by verifying that the time stamp of the last block in the chain is less than 24 hours in the past. As it is likely that a bit more time has passed since you have created the Alice container, we therefore need to use the mining functionality built into bitcoin core to create at least one new block.

$ bitcoin-cli --rpcuser=user --rpcpassword=password -regtest generate 1

Now we are ready to run our test. The next few lines will download the code from GitHub, create one transaction that will then be included in the block. We will create this transaction using the script SendMoney.py that we have already used in an earlier post.

$ git clone https://github.com/christianb93/bitcoin.git
$ cd bitcoin
$ python SendMoney.py
$ python Miner.py

You should then see an output telling you that a block with two transactions (one coinbase transaction and the transaction that we have generated) was mined, along with the previous height of the blockchain and the new height which should be higher by one.

Let us now verify that everything works. First, let us get the hash of the current last block.

$ bitcoin-cli --rpcuser=user --rpcpassword=password -regtest getchaintips
[
  {
    "height": 109,
    "hash": "07849d5c8ddcdc609d7acc3090bc48bbe4403c36008d46b5a291185334efe1bf",
    "branchlen": 0,
    "status": "active"
  }
]

Take the value from the hash field in the output and feed it into a call to getblock:

$ bitcoin-cli --rpcuser=user --rpcpassword=password -regtest getblock "07849d5c8ddcdc609d7acc3090bc48bbe4403c36008d46b5a291185334efe1bf"
{
  "hash": "07849d5c8ddcdc609d7acc3090bc48bbe4403c36008d46b5a291185334efe1bf",
  "confirmations": 1,
  "strippedsize": 367,
  "size": 367,
  "weight": 1468,
  "height": 109,
  "version": 536870912,
  "versionHex": "20000000",
  "merkleroot": "8769987458af75adc80d6792848e5cd5cb8178a9584157bb4be79b77cda95909",
  "tx": [
    "7ba3186ca8e5aae750614a3211422423a0a217f5999d0de6dfeb8968aeb01900", 
    "1094360149626a421b4ddbc7eb58a815762700316c36407770b96ffc36a7735b"
  ],
  "time": 1522952768,
  "mediantime": 1521904060,
  "nonce": 1,
  "bits": "207fffff",
  "difficulty": 4.656542373906925e-10,
  "chainwork": "00000000000000000000000000000000000000000000000000000000000000dc",
  "previousblockhash": "2277e40bf4c0ebde3fb5f38fcbd384e39df3471ad192cc46f66ea8d8d96327e7"
}

The second entry in the list tx should now match the ID of the newly created transaction which was displayed when executing the SendMoney.py script. This proves that our new transaction has been included in the block.

Congratulations, you have just mined 50 BTC – unfortunately only in your local installation, not in the production network. Of course, real miners work differently, using mining pools to split the work between many different nodes and modern ASICS to achieve the hash rates that you need to be successful in the production network. But at least we have built a simple miner more or less from scratch, relying only on the content of this and the previous posts in this series and without using any of the many Python bitcoin libraries that are out there.

This concludes my current series on the bitcoin blockchain –  I hope you enjoyed the posts and had a bit of fun. If you want to learn more, here are a few excellent sources of information that I recommend.

  1. Of course the ultimative source of information is always the bitcoin core source code itself that we have already consulted several times
  2. The Bitcoin wiki contains many excellent pages on most of what we have discussed
  3. There is of course the original bitcoin paper which you should now be able to read and understand
  4. and of course there are tons of good books out there, I personally liked Mastering Bitcoin by A. Antonopoulos which is also available online

 

 

The Metropolis-Hastings algorithm

In this post, we will investigate the Metropolis-Hastings algorithm, which is still one of the most popular algorithms in the field of Markov chain Monte Carlo methods, even though its first appearence (see [1]) happened in 1953, more than 60 years in the past. It does for instance appear on the CiSe top ten list of the most important algorithms of the 20th century (I got this and the link from this post on WordPress).

Before we get into the algorithm, let us once more state the problem that the algorithm is trying to solve. Suppose you are given a probability distribution \pi on some state space X (most often this will be a real euclidian space on which you can do floating point arithmetic). You might want to imagine the state space as describing possible states of a physical system, like spin configurations in a ferromagnetic medium similar to what we looked at in my post on the Ising model. The distribution \pi then describes the probability for the system to be in a specific state. You then have some quantity, given as a function f on the state space. Theoretically, this is a quantity that you can calculate for each individual state. In most applications, however, you will never be able to observe an individual state. Instead, you will observe an average, weighted by the probability of occurence. In other words, you observe the expectation value

\langle f \rangle = \int_X f d\pi

of the quantity f. Thus to make a prediction that can be verified or falsified by an observation, you will have to calculate integrals of this type.

Now, in practice, this can be very hard. One issue is that in order to naively calculate the integral, you would have to transverse the entire state space, which is not feasible for most realistic problems as this tends to be a very high dimensional space. Closely related to this is a second problem. Remember, for instance, that a typical distribution like the Boltzmann distribution is given by

\pi(x) = \frac{1}{Z} e^{-\beta E(x)}

The term in the numerator is comparatively easy to calculate. However, the term in the denominator is the partition function, and is itself an integral over the state space! This makes even the calculation of \pi(x) for a single point in the state space intractable.

But there is hope – even though calculating the values of \pi for one point might be impossible, in a distribution like this, calculating ratios of probabilities is easy, as the partition function cancels out and we are left with the exponential of an energy difference! The Metropolis-Hasting algorithm leverages this and also solves our state space problem by using a Markov chain to approximate the integral. So the idea is to build a Markov chain Xt that converges and has \pi as an invariant distribution, so that we can approximate the integral by

\langle f \rangle = \int_X f d\pi \approx \frac{1}{N} \sum_{t=1}^N f(X_t)

for large values of N.

But how do we construct a Markov chain that converges to a given distribution? The Metropolis Hastings approach to solve this works as follows.

The first thing that we do is to choose a proposal density q on our state space X, i.e. a measurable function

q \colon X \times X \rightarrow [0,\infty)

such that for each x, \int q(x,y) dy = 1.

Then q defines a Markov chain, where the probability to transition into a measurable set A being at a point x is given by the integral

Q(x,A) = \int_{\mathcal X} q(x,y) dy

Of course this is not yet the Markov chain that we want – it has nothing to do with \pi, so there is no reason it should converge to \pi. To fix this, we now adjust the kernel to incorporate the behaviour of \pi. For that purpose, define

\alpha(x,y) =  \begin{cases} \min \{ 1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \} & \text{if } \pi(x) q(x,y) > 0 \\ 1 & \text{if } \pi(x) q(x,y) = 0  \end{cases}

This number is called the acceptance probability, and, as promised, it only contains ratios of probabilities, so that factors like the partition function cancel and do not have to be computed.

The Metropolis Hastings algorithm now proceeds as follows. We start with some arbitrary point x0. When the chain has arrived at xn, we first draw a candidate y for the next location from the proposal distribution q(x_n, \cdot). We now calculate \alpha according to the formula above. We then accept the proposal with probability \alpha, i.e. we draw a random sample U from a uniform distribution and accept if U \leq \alpha. If the proposal is accepted, we set xn+1 = y, otherwise we set xn+1 = xn, i.e. we stay where we are.

Clearly, the xn are samples from a Markov chain, as the position at step xn only depends on the position at step xn-1. But is still appears to be a bit mysterious why this should work. To shed light on this, let us consider a case where the expressions above simplify a bit. So let us assume that the proposal density q is symmetric, i.e. that

q(x,y) = q(y, x)

This is the original Metropolis algorithm as proposed in [1]. If we also assume that \pi and q are nowhere zero, the acceptance probability simplifies to

\alpha(x,y) =  \min \{ 1, \frac{\pi(y)}{\pi(x)} \}

Thus we accept the proposal if \pi(y) \geq \pi(x) with probability one. This is very similar to a random search for a global maximum – we start at some point x, choose a candidate for a point with higher value of \pi at random and proceed to this point. The major difference is that we also accept candidates with \pi(y) < \pi(x) with a non-zero probability. This allows the algorithm to escape a local maximum much better. Intuitively, the algorithm will still try to spend more time in regions with large values of \pi, as we would expect from an attempt to sample from the distribution \pi.

SymmetricMetropolisHastings

The image above illustrates this procedure. The red graph displays the distribution \pi. If our algorithm is currently at step xn, the purpose is to move “up-hill”, i.e. to the left in our example. If we draw a point like y from q which goes already in the right directory, we will always accept this proposal and move to y. If, however, we draw a point like y’, at which \pi is smaller, we would accept this point with a non-zero probability. Thus if we have reached a local maximum like the one on the right hand side of the diagram, there is still a chance that we can escape from there and move towards the real maximum to the left.

In this form, the algorithm is extremely easy to implement. All we need is a function propose that creates the next proposal, and a function p that calculates the value of the probability density \pi at some point. Then an implementation in Python is as follows.

import numpy as np
chain = []
X = 0
chain.append(X)
for n in range(args.steps):
  Y = propose(X)
  U = np.random.uniform()
  alpha = p(Y) / p(X)
  if (U <= alpha):
    X = Y
  chain.append(X)

In the diagram below, this algorithm has been applied to a Cauchy distribution with mode zero and scale one, using a normal distribution with mean x and standard deviation 0.5 as a proposal for the next location. The chain was calculated for 500.000 steps. The diagram in the upper part shows the values of the chain during the simulation.

Then the first 100.000 steps were discarded and considered as "burn-in" time for the chain to stabilize. Out of the remaining 400.000 sample points, points where chosen with a distance of 500 time steps to obtain a sample which is approximately independent and identically distributed. This is called subsampling and typically not necessary for Monte Carlo integration (see [2], chapter 1 for a short discussion of the need of subsampling), but is done here for the sake of illustration. The resulting subsample is plotted as a histogramm in the lower left corner of the diagram. The yellow line is the actual probability density.

MetropolisCauchySample

We see that after a few thousand steps, the chain converges, but continues to have spikes. However, the sampled distribution is very close to the sample generated by the Python standard method (which is to take the quotient of two independent samples from a standard normal distribution).

In the diagram at the bottom, I have displayed how the integral of two functions (\sin(x) and \cos(x)) approximated using the partial sums develops over time. We see that even though we still have huge spikes, the integral remains comparatively stable and converges already after a few thousand iterations. Even if we run the simulation only for 1000 steps, we already get close to the actual values zero (for \sin(x) for symmetry reasons) and \approx 0.3678 (for \cos(x), obtained using the scipy.integrate.quad integration routine).

In the second diagram in the middle row, I have plotted the autocorrelation versus the lag, as an indicator for the failure of the sample points to be independent. Recall that for two samples X and Y, the Pearson correlation coefficient is the number

\frac{E((X-\bar{X})(Y-\bar{(Y)})}{\sigma_X \sigma_Y}

where \sigma_X and \sigma_Y are the standard deviations of X and Y. In our case, given a lag, i.e. a number l less than the length of the chain, we can form two samples, one consisting of the points X_0, X_2, \dots and the second one consisting of the points of the shifted series X_l, X_{l+1}, X_{l+2}, \dots. The autocorrelation with lag l is then defined to be the correlation coefficient between these two series. In the diagram, we can see how the autocorrelation depends on the lag. We see that for a large lag, the autocorrelation becomes small, supporting our intuition that the series and the shifted series become independent. However, if we execute several simulation runs, we will also find that in some cases, the convergence of the autocorrelation is very slow, so care needs to be taken when trying to obtain a nearly independent sample from the chain.

In practice, the autocorrelation is probably not a good measure for the convergence of a Markov chain. It is important to keep in mind that obtaining an independent sample is not the point of the Markov chain – the real point is that even though the sample is autocorrelated, we can approximate expectation values fairly well. However, I have included the autocorrelation here for the sake of illustration.

This form of proposal distributions is not the only one that is commonly used. Another choice that appears often is called an independence sampler. Here the proposal distribution is chosen to be independent of the current location x of the chain. This gives us an algorithm that resembles the importance sampling method and also shares some of the difficulties associated with it – in my notes on Markov chain Monte Carlo methods, I have included a short discussion and a few examples. These notes also contain further references and a short discussion of why and when the Markov chain underlying a Metropolis-Hastings sampler converges.

Other variants of the algorithm work by updating – in a high-dimensional space – either only one variable at a time or entire blocks of variables that are known to be independent.

Finally, if we are dealing with a state space that can be split as a product X_1 \times X_2, we can use the conditional probability given either x1 or x2 as a proposal distribution. Thus, we first fix x2, and draw a new value for x1 from the conditional probability for x1 given the current value of x2. Then we move to this new coordinate, fix x1, draw from the conditional distribution of x2 given x1 and set the new value of x2 accordingly. It can be shown (see for example [5]) that the acceptance probability is one in this case. So we end up with the Gibbs sampling algorithm that we have already used in the previous post on Ising models.

Monte Carlo sampling methods are a broad field, and even though this has already been a long post, we have only scratched the surface. I invite you to consult some of the references below and / or my notes for more details. As always, you will also find the sample code on GitHub and might want to play with this to reproduce the examples above and see how different settings impact the result.

In a certain sense, this post is the last post in the series on restricted Boltzmann machines, as it provides (at least some of) the mathematical background behind the Gibbs sampling approach that we used there. Boltzmann machines are examples for stochastic neuronal networks that can be applied to unsupervised learning, i.e. the allow a model to learn from a sample distribution without the need for labeled data. In the next few posts on machine learning, I will take a closer look at some other algorithms that can be used for unsupervised learning.

References

1. N. Metropolis,A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculation by fast computing machines, J. Chem. Phys. Vol. 21, No. 6 (1953), pp. 1087-1092
2. S. Brooks, A. Gelman, C.L. Jones,X.L. Meng (ed.), Handbook of Markov chain Monte Carlo, Chapman Hall / CRC Press, Boca Raton 2011
3. W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, Vol. 57 No. 1 (1970), pp. 97-109
R.M. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993
5. C.P. Robert, G. Casella, Monte Carlo Statistical Methods,
Springer, New York 1999

The difficulty in the bitcoin protocol

Mining is a challenge – a miner has to solve a mathematical puzzle to create a valid block and is rewarded with a certain bitcoin amount (12.5 at the time of writing) for investing electricity and computing power. But technology evolves, and the whole mining process would be pointless if the challenge could not be adapted dynamically to make sure that mining power and the difficulty of the challenge stay in balance. In this post, we will look in a bit more detail into this mechanism.

We have already seen that a block header contains – at character 144 – a four byte long number that is called bits in the source code and which seems to be related to the difficulty. As an example, let us take the block with the hash

0000000000000000000595a828f927ec02763d9d24220e9767fe723b4876b91e

You can get a human readable description of the block at this URL and will see the following output.

BitcoinBlockExample

This output has a field bits that has the value 391129783. If you convert this into a hexadecimal string, you obtain 0x17502ab7, using for instance Pythons hex function. Now this function will give you the big-endian representation of the integer value. To convert to little endian as it can be found in the serialized format of a bitcoin block, you have to reverse the order byte wise and get b72a5017. If you now display the raw format of the block using this link and search for that string, you will actually find it at character 144 at expected.

Now on the screenshot above, you also see the difficulty displayed, which is a floating point number and is 3511060552899.72 in this case. How is that difficulty related to the bits?

As so often, we can find the answer in the bitcoin core source code, starting at the JSON RPC interface. So let us look at the function blockheaderToJSON in rpc/blockchain.cpp that is responsible for transforming a block in the internal representation into a human readable JSON format. Here we find the line

result.push_back(Pair("difficulty", GetDifficulty(blockindex)));

So let us visit the function GetDifficulty that we find in the same file, actually just a few lines up. Here we see that the bits are actually split into two parts. The first part, let me call this the size is just the value of the top 8 bits. The second part, the value of remaining 24 bits, is called the coefficient. Then the difficulty is calculated by dividing the constant 0xFFFF by the coefficient and shifting by 29 – size bits, i.e. according to the formula

\text{difficulty} = 256^{(29 - \text{exponent})} \cdot (65535.0 / \text{coefficient})

In Python, we can therefore use the following two lines to extract the difficulty from a block header in raw format.

nBits = int.from_bytes(bytes.fromhex(raw[144:152]), "little")
difficulty = 256**(29 - (nBits >> 24))*(65535.0 / ((float)(nBits & 0xFFFFFF)))

If you try this out for the transaction above, you will get 3511060552899.7197. Up to rounding after the second digit after the decimal point, this is exactly what we also see on the screenshot of blockchain.info. So far this is quite nice…

However, we are not yet done – this is not the only way to represent the difficulty. The second quantity we need to understand is the target. Recall that the mathematical puzzle that a miner needs to solve is to produce a block with a hash that does not exceed a given number. This number is the target. The lower the target, the smaller the range of 256 bit numbers that do not exceed the target, and the more difficult to solve that puzzle. So a low target implies high difficulty and vice versa.

In fact, there are two target values. One target is part of every block and stored in the bits field – we will see in a minute how this works. The second target is a global target that all nodes in the bitcoin network share – this value is updated independently by each node, but based on the same data (the blockchain) and the same rules so that they arrive at the same conclusion. A block considered valid if solves the mathematical puzzle according to the target which is stored in itself and if, in addition, this target matches the global target.

To understand how this works and the exact relation between target and difficulty, it is again useful to take a look at the source code. The relevant function is the function CheckProofOfWork in pow.cpp which is surprisingly short.

bool CheckProofOfWork(uint256 hash, unsigned int nBits, const Consensus::Params& params)
{
    bool fNegative;
    bool fOverflow;
    arith_uint256 bnTarget;

    bnTarget.SetCompact(nBits, &fNegative, &fOverflow);

    // Check range
    if (fNegative || bnTarget == 0 || fOverflow || bnTarget > UintToArith256(params.powLimit))
        return false;

    // Check proof of work matches claimed amount
    if (UintToArith256(hash) > bnTarget)
        return false;

    return true;
}

Let us try to understand what is going on here. First, a new 256 bit integer bnTarget is instantiated and populated from the bits (that are taken from the block) using the method SetCompact. In addition to a range check, it is then checked that the hash of the block does not exceed the target – this the mathematical puzzle that we need to solve.

The second check – verifying that the target in the block matches the global target – happens in validation.cpp in the function ContextualCheckBlockHeader. At this point, the function GetNextWorkRequired required is called which determines the correct value for the global target. Then this is compared to the bits field in the block and the validation fails if they do not match. So for a block to be accepted as valid, the miner needs to:

  • Actually solve the puzzle, i.e. provide a block with a hash value equal to or smaller than the current global target
  • Include this target as bits in the generated block

This leaves a couple of questions open. First, we still have to understand how the second step – encoding the target in the bits field – works. And then we have to understand how the global target is determined.

Let us start with the first question. We have seen that the target is determined from the bits using the SetCompact method of the class uint256. This function is in fact quite similar to GetDifficulty studied before. Ignoring overflows and signs for a moment, we find that the target is determined from the bits as

\text{target} = \text{coefficient}\cdot 2^{8(\text{exponent}-3)}

If we compare this to the formula for the difficulty that we have obtained earlier, we find that there a simple relation between them:

\text{target} = \text{difficulty}^{-1}\cdot \text{std\_target}

where std_target is the target corresponding to difficulty 1.0 and given by 0xFFFF shifted by 208 bits to the left, i.e. by 0xFFFF with 52 trailing zeros. So up to a constant, target and difficulty are simply inverse to each other.

Now let us turn to the second question – how can the nodes agree on the correct global target? To see this, we need to look at the function GetNextWorkRequired. In most cases, this function will simply return the bits field of the current last block in the blockchain. However, every 2016 blocks, it will adjust the difficulty according to the following rule.

First, it will use the time stamps in the block to calculate how much time has passed since the block which is 2016 blocks back in the chain has been created – this value is called the actual time span. Then, this is compared with the target time span which is two weeks. The new target is then obtained by the formula (ignoring some range checks that are done to avoid too massive jumps)

\text{target}_{new} = \text{target}_{old} \cdot \frac{\text{actual time span}}{\text{target time span}}

Thus if the actual time span is smaller than the target time span, indicating that the mining power is too high and therefore the rate of block creation is too high, the target decreases and consequently the difficulty increases. This should make it more difficult for the miners to create new blocks and the block creation rate should go down again. Similarly, if the difficulty is too high, this mechanism will make sure that it is adjusted downwards. Thus the target rate is 2016 blocks being created in two weeks, i.e. 336 hours, which is a rate of six blocks per hours, i.e. 10 minutes for one block.

Finally, it is worth mentioning that there is lower limit for the difficulty, corresponding to a higher limit for the target. This limit is (as most of the other parameters that we have discussed) contained in chainparams.cpp and specific for the three networks main, net and regtest. For instance, for the main network, this is

consensus.powLimit = uint256S("00000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffff");

whereas the same parameter for the regression testing network is

consensus.powLimit = uint256S("7fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff");

This, together with the target of the genesis block (which determines the target for the first 2016 blocks) (bits = 0x207fffff ) determines the initial difficulty in a regression test network. Let us verify this (I knew I would somehow find a reason to add a notebook to this post…)

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

We are now well equipped to move towards our final project in this series – building a Python based miner. The last thing that we need to understand is how the miner operates and communicates with the bitcoin server, and this will be the topic of the next post.

Recurrent and ergodic Markov chains

Today, we will look in more detail into convergence of Markov chains – what does it actually mean and how can we tell, given the transition matrix of a Markov chain on a finite state space, whether it actually converges.

So suppose that we are given a Markov chain on a finite state space, with transition probabilities described by a matrix K as in the previous post on this topic.

We have seen that the distribution of Xn is given by \mu K^n where \mu is the initial distribution. So if Kn converges, we can expect the distribution of Xn to converge as well. If there is a matrix K^\infty such that

\lim_{n \rightarrow \infty} K^n = K^\infty

then of course

K^\infty K = (\lim_{n \rightarrow \infty} K^n) K = \lim_{n \rightarrow \infty} K^{n+1} = K^\infty

In other words, each row \pi of K^\infty will have the property that \pi K = \pi  . Interpreting K as transition probabilities, this implies that if the distribution of Xn is \pi, the distribution of Xn+1 will again be \pi. Any distribution, described by a row vector \pi with row sum 1, for which this holds is called an invariant distribution and traditionally denoted by \pi. In other words, invariant distributions correspond to eigenvectors of the transposed matrix KT with eigenvalue 1, and our argument has shown that if K^n converges to K^\infty, then every row of K^\infty will be an invariant distribution.

So convergence implies the existence of an invariant distribution. Let us next try to understand whether this invariant distribution is unique. There are obvious examples where this is not the case, the most trivial one being the unit matrix

K = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}

Obviously K^n = K and the chain converges, but every vector with row sum one is an invariant distribution. This chain is too rigid, because in whatever state we start, we will stay in this state forever. It turns out that in order to ensure uniquess of an invariant distribution, we need a certain property that makes sure that the states can move around freely which is called irreducibility.

Intuitively, we say that a Markov chain is irreducible if any state can be reached from any other state in finitely many steps. For a finite state space, this is rather easy to formalize. A finite Markov chain is called irreducible if, given two states i and j, we can find some power n such that K^n_{ij} > 0. In other words, given a row index i and a column index j, we can find a power n such that the element of Kn at (i,j) is not zero. It turns out that if a chain is not irreducible, we can split the state space into smaller areas that the chain – once it has entered one of them – does not leave again and on which it is irreducible. So irreducible Markov chains are the buildings blocks of more general Markov chains, and the study of many properties of Markov chains can be reduced to the irreducible case.

Now let us assume that our chain is in fact irreducible. What else do we have to ask for to make sure that it converges? Again, let us as look at an obvious example.

K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}

All even powers of this matrix are the unit matrix and all odd powers are K itself, so the Markov chain described by this transition matrix is clearly irreducible. However, there is still something wrong. If, for instance, we are in state 1 at the time t, then the probability to still be in state 1 at time t+1 is zero. Thus we are actually forced to move to state 2. In a certain sense, there is still an additional constraint on the the behaviour of the chain – the state space can be split in two pieces and we cycle through these pieces with every step. Such a chain is called periodic. If a chain is periodic, it is again obvious that we can think of it as two different chains, one chain given by the random variables X0, X2,… at even times and the other chain given by the random variables X1, X3, … at odd times. Thus again, there is a way to split the chain into parts.

So let us now focus on chains that are irreducible and aperiodic. Can we tell whether the chain converges? The answer is surprisingly simple and one of the most powerful results in the theory or Markov chains: every finite Markov chain which is irreducible and aperiodic converges. Thus once we have verified aperiodicity and irreducibility, we can be sure that the limit K^\infty exists. We can say even more – we have seen that the rows of the limit are invariant distributions for K. However, one can show that an irreducible Markov chain can only have one invariant distribution. Thus we can conclude that all rows of the matrix K^\infty are identical. In fact, they are all equal to the (unique) invariant distribution that again corresponds to an eigenvector of KT with eigenvalue one.

This gives us two different approaches to calculating the invariant distribution and the limit K^\infty. First, we can form high powers Kn to approximate the limit K^\infty. Or, alternatively, we can search for eigenvectors of the transposed matrix KT with eigenvalue 1 using one of the known algorithms to calculate eigenvectors and eigenvalues.

Let us do this for the example of the random walk on a circle that we have studied earlier. For simplicity, we will use N = 4 this time. The transition matrix with p = 0.8 is

K = \begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}

We can then easily implement both approaches in Python using the numpy library.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

So we see that the approach to take the eigenvalues and eigenvectors gives us – in this case – the exact result (and a direct computation confirms that this is really an eigenvector), whereas the matrix power with n=20 gives a decent approximation. Both results confirm what we have observed visually in the last post – the distribution converges, and it does in fact converge to the uniform distribution. So roughly speaking, after sufficiently many steps, we end up everyhwere in the state space with equal probability.

So far, we have only discussed the case of a finite state space. In the general case, the situation is more complicated. First, we need a better definition of irreducibility, as on a general state space, the probability measure of a single point tends to be zero. It turns out that the solution is to define irreducibility with respect to a measure on the state space, which is called \psi-irreducibility. Once we have that, we find two potential behaviours that do not show up in the finite case.

First, it might happen that even though the chain is irreducible, it does not properly cycle through the state space and revisits every point with finite probability, but – intuitively speaking – heads off to infinity. Such a chain is called transient and does not converge.

Second, even if the chain is not transient, we might be able to find an invariant distribution (a measure in this case), but this measure might not be a probability distribution as it is not finite. Chains with this property are called null recurrent.

And finally, the concept of converge needs to be made more precise, and it turns out to have a subtle dependency on the starting point which can be fixed by assuming so called Harris recurrence. But all this can be done and delivers a very similar result – a convergence theorem for a large class of chains (in this case aperiodic Harris recurrent Markov chains). If you would like to get deeper into this and also see proofs for the claims made in this post and references, I invite you to take a look at my short introduction to Markov chains.

We close this post with one remark which, however, is of crucial importance for applications. Suppose we have a convergent Markov chain Xt, and a function f on the state space. If the Xt were independent and identically distributed, we could approximate the expectation value of f as

\int_X f(x) dP = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)

where dP is the (common) distribution of the Xt. For Markov chains, however, we know that the Xt are not independent – the whole point of having a Markov chain is that Xt+1 does actually depend on Xt. However, in our example above, we have seen that the dependency gets trivial for large t as in this example, all entries in the transition matrix become equal in the limit.

Of course this is a special case, but it turns out that even in general, the approximation of the expectation value by averages across the sample is still possible. Intuitively, this is not so surprising at all. If the distributions K^n of the $X_n$ converge to some invariant measure \pi, the X_n will, for large n, be approximately identically distributed, namely according to \pi. Moreover, heuristically we have for large n, m and a fixed starting point s:

P(X_{n+m}=i, X_n=j) = K^m_{ji} K^n_{sj} \approx \pi_i \pi_j

On the other hand, this product is again approximately the product

\pi_i \pi_j  \approx P(X_{n+m}=i) P(X_n=j)

Thus, intuitively, Xn and Xm are almost independent if n and m are large, which makes us optimistic that the law of large numbers could actually hold.

In fact, if Xt converges (again I refer to my notes for a more precise definition of convergence in this case), it turns out that the law of large numbers remains valid if we integrate f with respect to the invariant distribution \pi:

\int_X f(x) d\pi = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)

This is extremely useful in applications, where often the primary purpose is not so much to obtain a sample, but to approximate otherwise intractable integrals! Once we have found a Markov chain that converges to the invariant distribution \pi, we can calculate integrals over \pi by running a long Markov chain until converge and then taking the average value of the function that we want to integrate for a large number of subsequent points from this chain. In fact, this is how most applications work – and this is more or less what we also did when we applied Markov chains to the problem of calculating gradients in the PCD algorithm, where this chain is given by the state of the negative particles in subsequent iterations.

So given a distribution \pi, we are now lead to the question how we can possibly find a Markov chain for which this distribution is invariant and which converges. The most general answer to this question is a class of algorithms known as Metropolis-Hastings algorithms which we study in the next post in this series.