christianb93

Recurrent and ergodic Markov chains

Today, we will look in more detail into convergence of Markov chains – what does it actually mean and how can we tell, given the transition matrix of a Markov chain on a finite state space, whether it actually converges.

So suppose that we are given a Markov chain on a finite state space, with transition probabilities described by a matrix K as in the previous post on this topic.

We have seen that the distribution of X_n is given by $\mu K^n$ where $\mu$ is the initial distribution. So if Kⁿ converges, we can expect the distribution of X_n to converge as well. If there is a matrix $K^\infty$ such that

$\lim_{n \rightarrow \infty} K^n = K^\infty$

then of course

$K^\infty K = (\lim_{n \rightarrow \infty} K^n) K = \lim_{n \rightarrow \infty} K^{n+1} = K^\infty$

In other words, each row $\pi$ of $K^\infty$ will have the property that $\pi K = \pi$ . Interpreting K as transition probabilities, this implies that if the distribution of X_n is $\pi$ , the distribution of X_n+1 will again be $\pi$ . Any distribution, described by a row vector $\pi$ with row sum 1, for which this holds is called an invariant distribution and traditionally denoted by $\pi$ . In other words, invariant distributions correspond to eigenvectors of the transposed matrix K^T with eigenvalue 1, and our argument has shown that if $K^n$ converges to $K^\infty$ , then every row of $K^\infty$ will be an invariant distribution.

So convergence implies the existence of an invariant distribution. Let us next try to understand whether this invariant distribution is unique. There are obvious examples where this is not the case, the most trivial one being the unit matrix

$K = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$

Obviously $K^n = K$ and the chain converges, but every vector with row sum one is an invariant distribution. This chain is too rigid, because in whatever state we start, we will stay in this state forever. It turns out that in order to ensure uniquess of an invariant distribution, we need a certain property that makes sure that the states can move around freely which is called irreducibility.

Intuitively, we say that a Markov chain is irreducible if any state can be reached from any other state in finitely many steps. For a finite state space, this is rather easy to formalize. A finite Markov chain is called irreducible if, given two states i and j, we can find some power n such that $K^n_{ij} > 0$ . In other words, given a row index i and a column index j, we can find a power n such that the element of Kⁿ at (i,j) is not zero. It turns out that if a chain is not irreducible, we can split the state space into smaller areas that the chain – once it has entered one of them – does not leave again and on which it is irreducible. So irreducible Markov chains are the buildings blocks of more general Markov chains, and the study of many properties of Markov chains can be reduced to the irreducible case.

Now let us assume that our chain is in fact irreducible. What else do we have to ask for to make sure that it converges? Again, let us as look at an obvious example.

$K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$

All even powers of this matrix are the unit matrix and all odd powers are K itself, so the Markov chain described by this transition matrix is clearly irreducible. However, there is still something wrong. If, for instance, we are in state 1 at the time t, then the probability to still be in state 1 at time t+1 is zero. Thus we are actually forced to move to state 2. In a certain sense, there is still an additional constraint on the the behaviour of the chain – the state space can be split in two pieces and we cycle through these pieces with every step. Such a chain is called periodic. If a chain is periodic, it is again obvious that we can think of it as two different chains, one chain given by the random variables X₀, X₂,… at even times and the other chain given by the random variables X₁, X₃, … at odd times. Thus again, there is a way to split the chain into parts.

So let us now focus on chains that are irreducible and aperiodic. Can we tell whether the chain converges? The answer is surprisingly simple and one of the most powerful results in the theory or Markov chains: every finite Markov chain which is irreducible and aperiodic converges. Thus once we have verified aperiodicity and irreducibility, we can be sure that the limit $K^\infty$ exists. We can say even more – we have seen that the rows of the limit are invariant distributions for K. However, one can show that an irreducible Markov chain can only have one invariant distribution. Thus we can conclude that all rows of the matrix $K^\infty$ are identical. In fact, they are all equal to the (unique) invariant distribution that again corresponds to an eigenvector of K^T with eigenvalue one.

This gives us two different approaches to calculating the invariant distribution and the limit $K^\infty$ . First, we can form high powers Kⁿ to approximate the limit $K^\infty$ . Or, alternatively, we can search for eigenvectors of the transposed matrix K^T with eigenvalue 1 using one of the known algorithms to calculate eigenvectors and eigenvalues.

Let us do this for the example of the random walk on a circle that we have studied earlier. For simplicity, we will use N = 4 this time. The transition matrix with p = 0.8 is

$K = \begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can then easily implement both approaches in Python using the numpy library.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw TransitionMatrixMarkovChain.ipynb hosted with ❤ by GitHub

So we see that the approach to take the eigenvalues and eigenvectors gives us – in this case – the exact result (and a direct computation confirms that this is really an eigenvector), whereas the matrix power with n=20 gives a decent approximation. Both results confirm what we have observed visually in the last post – the distribution converges, and it does in fact converge to the uniform distribution. So roughly speaking, after sufficiently many steps, we end up everyhwere in the state space with equal probability.

So far, we have only discussed the case of a finite state space. In the general case, the situation is more complicated. First, we need a better definition of irreducibility, as on a general state space, the probability measure of a single point tends to be zero. It turns out that the solution is to define irreducibility with respect to a measure on the state space, which is called $\psi$ -irreducibility. Once we have that, we find two potential behaviours that do not show up in the finite case.

First, it might happen that even though the chain is irreducible, it does not properly cycle through the state space and revisits every point with finite probability, but – intuitively speaking – heads off to infinity. Such a chain is called transient and does not converge.

Second, even if the chain is not transient, we might be able to find an invariant distribution (a measure in this case), but this measure might not be a probability distribution as it is not finite. Chains with this property are called null recurrent.

And finally, the concept of converge needs to be made more precise, and it turns out to have a subtle dependency on the starting point which can be fixed by assuming so called Harris recurrence. But all this can be done and delivers a very similar result – a convergence theorem for a large class of chains (in this case aperiodic Harris recurrent Markov chains). If you would like to get deeper into this and also see proofs for the claims made in this post and references, I invite you to take a look at my short introduction to Markov chains.

We close this post with one remark which, however, is of crucial importance for applications. Suppose we have a convergent Markov chain X_t, and a function f on the state space. If the X_t were independent and identically distributed, we could approximate the expectation value of f as

$\int_X f(x) dP = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

where dP is the (common) distribution of the X_t. For Markov chains, however, we know that the X_t are not independent – the whole point of having a Markov chain is that X_t+1 does actually depend on X_t. However, in our example above, we have seen that the dependency gets trivial for large t as in this example, all entries in the transition matrix become equal in the limit.

Of course this is a special case, but it turns out that even in general, the approximation of the expectation value by averages across the sample is still possible. Intuitively, this is not so surprising at all. If the distributions $K^n$ of the $X_n$ converge to some invariant measure $\pi$ , the $X_n$ will, for large n, be approximately identically distributed, namely according to $\pi$ . Moreover, heuristically we have for large n, m and a fixed starting point s:

$P(X_{n+m}=i, X_n=j) = K^m_{ji} K^n_{sj} \approx \pi_i \pi_j$

On the other hand, this product is again approximately the product

$\pi_i \pi_j \approx P(X_{n+m}=i) P(X_n=j)$

Thus, intuitively, X_n and X_m are almost independent if n and m are large, which makes us optimistic that the law of large numbers could actually hold.

In fact, if X_t converges (again I refer to my notes for a more precise definition of convergence in this case), it turns out that the law of large numbers remains valid if we integrate f with respect to the invariant distribution $\pi$ :

$\int_X f(x) d\pi = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i=1}^N f(X_i)$

This is extremely useful in applications, where often the primary purpose is not so much to obtain a sample, but to approximate otherwise intractable integrals! Once we have found a Markov chain that converges to the invariant distribution $\pi$ , we can calculate integrals over $\pi$ by running a long Markov chain until converge and then taking the average value of the function that we want to integrate for a large number of subsequent points from this chain. In fact, this is how most applications work – and this is more or less what we also did when we applied Markov chains to the problem of calculating gradients in the PCD algorithm, where this chain is given by the state of the negative particles in subsequent iterations.

So given a distribution $\pi$ , we are now lead to the question how we can possibly find a Markov chain for which this distribution is invariant and which converges. The most general answer to this question is a class of algorithms known as Metropolis-Hastings algorithms which we study in the next post in this series.

Merkle trees

Suppose you are given a long file and wanted to use hashing to verify the integrity of the file. How would you do this in the most efficient way?

Of course, the first thing that might come to your mind is to simply take a hash of the entire file. This is very easy, but has one major disadvantage – if the file is damaged, there is no way to tell which part of the file has been damaged. If you use such a mechanism for instance to verify the integrity of a large download, you will have to go back and download the entire file again, which is very time consuming.

A smarter approach is to split the file into data blocks. You then calculate a hash for each data block to obtain a hash list. Finally, you could concatenate all the hashes into one large string and take the hash of that string. If a recipient of the file has access to only this top level hash, she could already verify the integrity of the file by repeating this procedure and comparing the top level hash with the hash created by the sender.

If that check fails, there is no need to re-transmit the entire file. Instead, the recipient could obtain a copy of the hash list itself and compare the hashes block by block until the faulty block is detected. This block can then be exchanged and the procedure can be repeated to verify that the replacement is valid. The bittorrent protocol uses this mechanism to detect inconsistencies in transferred files.

With a simple hash list like this, you would have to search the entire hash list in order to determine the faulty piece, requiring a time linear in the number of blocks. Searching can be improved greatly by organizing the hashes not into a list, but into a (binary) tree – and this data structure is called a hash tree or Merkle tree.

To illustrate the principle, suppose you wanted to build a hash tree for a file consisting of eight blocks, labeled 1 – 8.

You would then proceed as follows, as indicated in the image above. First, you compute the hash value for all blocks and assign them as values to the eight leaves of the tree. Then, you concatenate the values of leaves 1 and 2 and take the hash value of the result. This gives you the value for the parent node – called a in the picture above – of 1 and 2. You repeat the same procedure for the leaves 3 and 4, this gives you the value for the node b. Then you concatenate the values of a and b and take the hash value of the result, this gives you the value of the node c. Repeat this procedure for the right part of the tree and assign the result to node d. Finally, concatenate the values of node c and d and take the hash – this will be the value of the root node called the Merkle root. Thus, in a binary Merkle tree, the value of each node is defined to be the hash of the concatenation of the values of its two child nodes.

How is a Merkle tree realised in the bitcoin reference implementation? The currently relevant code is in consensus/merkle.cpp. This implementation is already optimized, the original implementation used to be in primitives/block.cpp and is still available on GitHub.

As the comments in the source code explain, at any level of the tree, the last node is added at the end again if needed to make sure that the number of nodes is even at each level – the algorithm that I have illustrated above only works if the number of leaves is a power of two. The actual implementation first builds a list of the leaves. It then goes through the list and for any two subsequent elements, it adds an element at the end which is the hash of the concatenation of these two elements. The last element is then the Merkle root that we are looking for.

Let us now see how a straightforward implementation looks like in Python.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw MerkleRoot.ipynb hosted with ❤ by GitHub

Most of the code should be self explaining, but a few remarks are in order. First, the hash function that we use could of course by any hash function, but, to be consistent with what bitcoin is doing, we use a double SHA256 hash. Then, byte ordering is again to be observed. The bitcoin reference client uses the method GetHash() on the involved transactions to populate the leaves, and we know that this is providing the transaction ID in little endian byte order. The entire calculation then proceeds in little endian byte order. The JSON output, however, display the transaction IDs in big endian order and the same holds for the Merkle root. Thus we need to reverse the byte order before we build the tree and, when comparing our result with the Merkle root which is part of the block, reverse the byte order again.

The actual calculation of the Merkle root uses a very straightforward approach – instead of maintaining one long list as the original bitcoin reference representation did, we represent the nodes at each level of the tree by a separate list and use recursion. This is probably a bit less efficient, but maybe easier to understand.

In order to test our implementation, we finally use the Python requests library to get a block in JSON format from blockchain.info using their REST API. We then build the leaves of the tree by walking through the list of transaction IDs in the block, build the Merkle root and compare the result to the value which is part of the JSON block structure. Nice to see that they match!

What happens if a client wants to verify that a given transaction is really included in a block without downloading the full block? In this case, it can ask the bitcoin server (either via the RPC call gettxoutproof or as part of the GETDATA network message) to provide a Merkle block, which is a serialized instance of CMerkleBlock. Essentially, a Merkle block consists of a block header plus a branch of the Merkle tree that is required to verify that a specific transaction is part of the block. This eliminates the need to transfer all transactions (which can be several thousand) while still allowing the client to check that a transaction is really included in a block.

This completes our short discussion of Merkle trees. We are now able to calculate the Merkle root for inclusion in a block header given the list of transactions that a block contains. To be able to mine a block, there is one more thing that we need to understand and that we will cover in the next post in this series – mining difficulty and the proof-of-work target.

Finite Markov chains

In this post, we will look in more detail into an important class of Markov chains – Markov chains on finite state spaces. Many of the subtleties that are present when studying Markov chains in general state spaces do not appear in the finite case, while most of the key ideas and features of Markov chains are still visible, so this is a good starting point if you want to grasp the key points.

So let us assume that our state space X is finite. For simplicity, we label our states as $\{1, 2, \cdots N \}$ where N is the number of states. We also assume that all our points are measurable, i.e we consider our state space as a discrete probability space.

Now consider a sequence of random variables X₀, X₁, …. How can we formalize the idea that X_t+1 depends on X_t in a randomized way?

As so often in probability theory, let us model the dependency as a conditional probability. The conditional probability for X_t+1 to take on a value i given X_t

$P(X_{t+1} = i | X_t)$

considered as a function of X_t will assign a conditional probability to each of the states i and for each value of X_t. Therefore we can write this as a matrix

$P(X_{t+1} = i | X_t = j) = K_{ji}$

To obtain a time homogeneous Markov chain, we also assume that the matrix K does not depend on the time t. Therefore we define a Markov chain on X to be a sequence $\{X_t\}_t$ of random variables taking values in X with the Markov property saying that for all times t, the conditional distribution for X_t+1 given all previous values $X_0, X_1, \dots, X_t$ only depends on X_t, i.e.

$P(X_{t+1} | X_t, X_{t-1}, \cdots ) = P(X_{t+1} | X_t)$

We also require that this conditional probability is independent of t and is therefore given by a matrix K as in the formula above.

Markov chains on finite state spaces are often visualized as a graph. Suppose, for instance, that our state space contains only two elements: $X = \{ 1 , 2\}$ . We can think of the combined values $X_t$ for all times t as a history of states or as a random walk in the state space. The Markov property then means that the probability to transition into a next state does not depend on the full history, but only on the current state – Markov chains do not have a memory.

In our state space with two elements only, the Markov chain is then described by four transition probabilities: the probability to stay in state 1 when the chain is in state 1, the probability to move to state 2 when being in state 1, the probability to stay in state 2 and the probability to move to state 1 after being in state 2. This can be visualized as follows.

Let us now calculate a few probabilities to get an idea for the relevant quantities in such a model. First, suppose that at time 0, the model is in state j with probability $\mu_j$ . What is the probability to be in state j after one step? Of course we can write

$P(X_1 = i) = \sum_j P(X_1 = i, X_0 = j)$

Now, according to the rules of conditional probabilities, we can express the joint probability as follows.

$P(X_1 = i, X_0 = j) = P(X_1 = i | X_0 = j) P(X_0 = j)$

Plugging this into our previous expression and using the definitions of K and $\pi$ , we now obtain

$P(X_1 = i) = \sum_j K_{ji} \mu_j$

Thus if we think of $\mu$ as a row vector, then the probability after one step is described by the matrix product $\mu K$ .

Now let us calculate a slightly different quantity. Assume that we know that a chain starts at j. What is the probability to be at i after two steps? Using once more the rules of conditional probability and the Markov property, we can write

$P(X_2=i | X_0=j) = \sum_k P(X_2=i | X_1=k)P(X_1=k | X_0=j)$

Intuitively, this is very appealing. To get from j to i in two steps, we can take the way via any intermediate state k. To get the total probability, we simply sum up all these different probabilities! If you have ever seen path integrals in quantum mechanics, this idea will look familiar.

Again, we can write this in matrix notation. Each of the conditional probabilities on the right hand side of the expression above is a matrix element, and we find that

$P(X_2 = i | X_0 = j) = \sum_k K_{ki} K_{jk}$

so that the probability to get in two steps from j to i is simply given by the elements of the matrix K². Similarly, the n-step transition probabilities are the entries of the matrix Kⁿ.

Let us look at an example to understand what is going on here, which is known as a finite random walk on a circle. Our state space consists of N distinct points which we place arbitrarily on a circle. We then define a Markov chain as follows. We start at some arbitrary point x. In each step, we move along the circle to a neighbored point – one point to the left with probability 1/2 and one point to the right with probability 1/2. By the very definition, the transition probabilities do not change over time and depend only on the current state, so is a finite Markov chain. If we label the points on the circle by $1, 2, \dots, N$ , then the transition matrix is given by a matrix of the form (for instance for N=4)

$K = \begin{pmatrix} 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \\ 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \end{pmatrix}$

More generally, we can also allow the process to stay where it is with probability 1 – p, where p is then the probability to move, which leads us to the matrix

$K = \begin{pmatrix} 1-p & \frac{p}{2} & 0 & \frac{p}{2} \\ \frac{p}{2} & 1-p & \frac{p}{2} & 0 \\ 0 & \frac{p}{2} & 1-p & \frac{p}{2} \\ \frac{p}{2} & 0 & \frac{p}{2} & 1 - p \end{pmatrix}$

Let us try to figure out whether the target distribution, given by the matrix Kⁿ for large n, somehow converges.

To see this, we do two numerical experiments. First, we can easily simulate a random walk. Suppose that we have a function draw which accepts a distribution (given by a vector p whose elements add up to one) and draws a random value according to that distribution, i.e. it returns 1 with probability p₀, 2 with probability p₁ and so on. We can then simulate a random walk on the circle as follows.

def simulate_chain(N, p, steps=100, start=5):
    chain = []
    x = start % N
    chain.append(x)
    for i in range(steps):
        x = (x  + draw([p/2.0, 1.0 - p, p / 2.0]) - 2) % N
        chain.append(x)
    return chain

Here N is the number of points on the circle, steps is the number of simulation steps that we run and start is the starting point. The function draw then returns 1 with probability p/2, 3 with probability p/2 and 2 with probability 1 – p. Thus we move with probability p/2 to the right, with probability p/2 to the left and stay were we are with probability 1 – p. If we set p = 0.8, this results in the following transition matrix.

$\begin{pmatrix} 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 \\ 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 & 0.40 \\ 0.40 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.40 & 0.20 \\ \end{pmatrix}$

We can visualize the results if we execute a large number of runs and draw histograms of the resulting positions 100, 200, 300, … steps. The result will look roughly like this (for this image, I have used p = 0.8 and N=10):

We see that after a few hundred steps, this seems to converge – actually this starts to look very much like a uniform distribution. As we already know that the distribution after n steps is given by the matrix Kⁿ, it makes sense to look at high powers of this matrix as well. To visualize this, I will use a method that I have seen in MacKays excellent book and that works as follows.

Visualizing one row of a matrix is not difficult. We can plot the entries in a two-dimensional diagram, where the x-axis corresponds to the column index and the y-axis corresponds to the value. A flat line is then a row in which all entries have the same value.

To display a full matrix, we use colors – we assign one color to each row index and plot the individual rows as just described. If we do this for the powers of the matrix K with the parameters p = 0.8 and N = 10 (and only chose some rows, for instance 1,4,7, to not run out of colors…) we obtain an image like the one below.

We can clearly see that the powers of the matrix K converge towards a matrix

$K^\infty = \lim_{n \rightarrow \infty} K^n$

where all entries in each row seem to have the same value. We have seen that the distribution after n steps assuming an initial distribution described by a row vector $\mu$ is $\mu K^n$ . Therefore the limit distribution is $\mu K^\infty$ . If we take $\mu$ to be a unit vector, we find that the rows of the matrix $K^\infty$ do actually represent the limit distribution of all chains that have started at a specific point i of the state space. Therefore the entries in the rows need to sum up to one, and thus, if they are all equal, need to be equal to 1 / N. If you calculate and print out high powers of the matrix K, you will in fact see that they approach the matrix where all entries are 0.1 (as we have chosen N = 10 in this example).

To completes our short introduction into Markov chains. We have seen that Markov chains model stochastic processes in discrete time in which the state at step t+1 depends only on the state at step t and the dependency is given by a function independent of the current time. In finite state spaces, Markov chains are described by a transition matrix K. The i-th row of the matrix Kⁿ is the distribution of chains starting at point i after n steps. Consequently, converge properties of the Markov chains can be related to converge of high powers Kⁿ and the apparatus of linear algebra can be applied.

In the next post, we will learn more about convergence – how it can be made precise, and how we can tell whether a given Markov chain converges. This will then allow us to construct Markov chains that converge towards a given target distribution and use them for sampling.

Understanding bitcoin blocks

In order to analyse a block, obviously the first thing that we need is – a block. Fortunately, this is not difficult. We can either use the block explorer hosted by blockchain.info to get the second block in the blockchain in raw or JSON or use curl to retrieve the block from the command line, say in raw format.

$ curl https://blockchain.info/de/block/000000006a625f06636b8bb6ac7b960a8d03705d1ace08b1a19da3fdcc99ddbd?format=hex

010000004860eb18bf1b1620e37e9490fc8a427514416f
d75159ab86688e9a8300000000d5fdcc541e25de1c7a5a
ddedf24858b8bb665c9f36ef744ee42c316022c90f9bb0
bc6649ffff001d08d2bd61010100000001000000000000
0000000000000000000000000000000000000000000000
000000ffffffff0704ffff001d010bffffffff0100f205
2a010000004341047211a824f55b505228e4c3d5194c1f
cfaa15a456abdf37f9b9d97a4040afc073dee6c8906498
4f03385237d92167c13e236446b417ab79a0fcae412ae3
316b77ac00000000

For this post, I have chosen a block from the very early days of bitcoin, as this block contains only one transaction and is therefore rather small – current blocks typically contain more than 2.000 transactions and are too long to be displayed easily. This block is actually the third block on the blockchain and has height two. The height is the position of a block on the blockchain (ignoring forks). The block with height 0 is the very first block and actually hardcoded into the bitcoin software, this block is called the genesis block. The first blocks have all been mined (according to their time stamp) in the early morning hours of January 9th 2009.

If you want to follow this post step by step, I recommend to download and run the Jupyter notebook containing the respective code.

In order to understand a bitcoin block, let us again look at the source code of the reference implementation. If you have followed me through the dissection of a bitcoin transaction, you will not have any difficulties to understand the code in primitives/block.h and find that a block is made up of a block header (an instance of CBlockHeader) and the block data which is an instance of the derived class CBlock and contains, in addition to the header fields, the transactions that make up the block. Let us now go through the header fields one by one.

class CBlockHeader
{
public:
    // header
    int32_t nVersion;
    uint256 hashPrevBlock;
    uint256 hashMerkleRoot;
    uint32_t nTime;
    uint32_t nBits;
    uint32_t nNonce;

The first field is the version number, in the usual little endian encoding. Here, the version is simply 1 – not really surprising for the second block that was ever mined. In more recently mined blocks, the version field looks different and is most often 0x20000000 – this is not just a version number any more but a bit mask used to indicate any soft forks that the miner is willing to support, as specified in BIP 009.

The next field is the hash of the previous block. Similar to transactions, this in practice identifies a block uniquely (and is a better identifier than the position in the block chain since it also works for blocks in a fork). Actually, this is not really the hash of the entire block, but of the block header only (GetHash is a method of CBlockHeader).

If we compare the previous blocks hash ID with the JSON output, we find that compared to the JSON output, the order of the bytes is reversed. We have already seen that byte reversal when looking at transactions – the ID of the previous transaction also appears reversed in the serialization compared to the output of a blockchain explorer or the bitcoin-cli tool Time to look at why this happens in a bit more detail.

If you look at primitives/block.h, you will find that the hash of the previous transaction is a uint256. This data type – a 256 bit integer – is defined in uint256.h and is a number. As every number on the x86 platform (on which the bitcoin software was developed), this number is internally stored in little endian byte order, i.e. the least significant bit first (look for instance at the method base_blob::GetUint64 to see this) and is serialized as a little endian byte string (see base_blob::Serialize). However, if it is displayed, it is converted to a big endian representation in the the function base_blob::GetHex(), which simply reverts the order bytewise.

An example where this happens is the JSON RPC server. In blockheaderToJSON in rpc/blockchain.cpp, for instance, we find the line

if (blockindex->pprev)
        result.push_back(Pair("previousblockhash", blockindex->pprev->GetBlockHash().GetHex()));

So we see that GetHex() is called explicitly to convert the little endian internal representation of the hash into the big endian display format. This explains why the order of the hash appears to be reversed in the serialized raw format.

Having clarified this, let us now proceed with our analysis of the raw block data. The next field is called the Merkle root. Essentially, this is a hash value of the transactions which are part of the block, aggregated in a clever way, a so called Merkle tree. We will look at Merkle trees in a separate post, for the time being let us simply accept that this is a hash of the underlying transactions. Again, note that the order of the bytes is reversed between the JSON representation (big endian) and the serialized format (little endian).

The next field is again easy, it is the creation time of the block as announced by the miner who did build the block. As usual in the Unix world, the time is stored as POSIX time, i.e. essentially in seconds since the first of January 1970.

Again, byte order is significant – the hex string b0bc6649 in our case first needs to be reversed and then transformed into an integer:

nTime = nTime = int.from_bytes(bytes.fromhex('b0bc6649'), "little")
import time
print(time.ctime(nTime))

This should give you Friday, January 9th 2009 at 03:55:44.

The next field is called the bits – a name which would probably not win a prize for being a unique, meaningful and descriptive variable name. If you search the source code a bit, you will find that in fact, those bits represent the difficulty which controls the mining process – again, this is a matter for a separate post.

The last field in the header is again relevant for the mining process. It is called the nonce and is used as a placeholder that a miner can set to an arbitrary value in order to produce blocks with different hash values (actually, there is a second nonce hidden in the signature script of the coinbase transaction, we get to this point later when we discuss the mining in detail).

This completes the fields of the block header. In addition to these fields, a full block contains a vector holding the transactions that are contained in the block. As every vector, this vector is serialized by first writing out the number of elements – one in this case – followed by the individual transactions, encoded in the usual serialized format.

With this, we have completed our short tour through the layout of a block and are ready to summarize our findings in the following diagram.

In the next few posts, we will dig deeper into some of these fields and learn how Merkle trees are used to validate integrity and how the difficulty and the nonce play together in the mining process.

Quantum computing and the Blockchain

Browsing the MIT Technology Review, I recently found an article on the potential threat that quantum computing means for the bitcoin protocol in its current form. This article is in fact a review of a paper that appeared last October on the arxiv preprint server. If you use your favorite search engine and search for combinations of words like quantum computing, threat, bitcoin and cryptography, you find many articles on that topic with a greatly varying degree of accuracy – so it is definitely worth taking a look at that paper and see what it really says.

In the bitcoin protocol, computing power enters the picture at two vital points. The first point is the mining process. Recall that in order to create a new block, a mining node has to solve a mathematical challenge called the proof-of-work which is to adjust a part of the new block – the nonce – in such a way that the first few hexadecimal digits of the block hash become zero. The number of digits that need to be zero is controlled by the so called difficulty, a parameter of the network which is dynamically adjusted based on the mining rate, aimining at keeping the mining rate constant at roughly 6 blocks per hour.

So finding the nonce effectively amounts to inverting the used hash function (a double SHA256 hash). Currently, the only (at least publicly) known approach is brute force – simply try out a large number of values for the nonce in parallel and stop when you find a value that solves the puzzle.

On the site blockchain.info, you can find, among much additional information, a graphical history of the difficulty over time. At the time of writing, the development since the birth of the bitcoin looks as follows.

It appears that the difficulty has increased exponentially over the last years (the same web page also offers a version of that graph with a logarithmic scale which demonstrates that starting at 2015, the difficulty did in fact develop essentially exponentially over time). This can be used to estimate the number of hashes per second that are computed by all miners in the network together, and on blockchain.info, there is a nice graph illustrating the result. At the moment, the hash rate is estimated to be roughly 25.000 tera hashes per second, i.e. 2,5 * 10¹⁶ hashes per second, growing exponentially as well.

This dramatic increase in computing power devoted to mining is not just the result of more mining nodes participating in the network. It is also due to an evolution of the technology used to mine bitcoin. In 2010, miners started to use graphical processing units (GPUs), followed by FPGAs about one year later and finally by ASICS, i.e. custom manufactured integrated circuits which can only do one thing – computing hashes – but do this amazingly fast.

In [2], the authors investigate the question at which point in time quantum computing might be developed to the point where a quantum computer can solve the same challenge at a speed that is comparable to currently existing devices. Essentially, their conclusion is that it will easily take between 10 and fifteen years for this to pose a real threat for the bitcoin network. They estimate that quantum computers that can implement the required algorithms will not appear before 2018, and even then the speed of the quantum gates will be much lower than the speed of the custom built ASICS so that it will take in addition at least ten years until a single quantum computers reaches the range of hashing power that the entire bitcoin network represents. Thus, it is unlikely that we will soon see that a single quantum computer is able to run an attack against the bitcoin network based on hashing power.

However, there is a second threat that is much more realistic, targeting the second point in the bitcoin protocol where computing power is essential – the signature algorithm based on elliptic curve cryptography.

In fact, elliptic curve cryptography is based on the difficulty to calculate a discrete logarithm on a classical computer, and was one of the first problems for which it could be shown (see [3]) that it could be solved using a quantum computer much faster, i.e. polynomial in the number of inputs.

In [2], a few attack scenarios are described and one specific scenario is analysed in detail. To understand this scenario, suppose that Alice is creating a transaction to transfer bitcoin to Bob. When she uses the standard P2PKH signature scheme, that transaction will contain the hash value of Bobs public key, but the full public key of Alice. So when this transaction appears on the network, the public key is known. If an attacker is able to solve the discrete logarithm that is required to derive the private key from the public key (see this post for the relation between public and private key), the attacker can use the private key to create an own transaction to spend the UTXO that Alice is using. That attack would have to be executed before the original transaction is added to a block, thus within less than 10 minutes.

The authors then determine how many qubits a quantum computer would need to solve the discrete logarithm and again predict, based on the current rate of improvement of quantum computing technology, how long it would take until a first quantum computer becomes available that can solve this in less than 10 minutes. Different scenarios are considered, but in the most aggressive scenario, this will happen in 2027, i.e. less than ten years from now.

So the conclusion is that the most serious threat to the bitcoin protocol in its current form comes from the anticipated ability of quantum computers to quickly break the ECDSA algorithm in less than ten years from now. Fortunately, the authors also propose countermeasures. In fact, there are public key signature algorithms that appear to be quantum safe, i.e. for which no quantum algorithm is known which can solve the underlying mathematical challenge faster than on traditional hardware. So it is possible to adapt the bitcoin protocol in a way that makes it less vulnerable to attacks by quantum computers.

Thus the key take aways from the paper seem to be that quantum computing will most likely not kill the bitcoin protocol, but force it to change – and that we still have a few years to go even under optimistic assumptions on the rate of further improvement of quantum computing technology.

References

1. L. Cocco, M. Marchesi, Modeling and Simulation of the Economics of Mining in the Bitcoin Market, PLoS One 11 (10) (2016)
2. D. Aggarwal et al., Quantum attacks on Bitcoin, and how to protect against them
3. P. W. Shor. Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. SIAM Review, 41(2):303–332, 1999 (see also the preprint on the arxiv)

Monte Carlo methods and Markov chains – an introduction

In our short series on machine learning, we have already applied sampling methods several times. We have used and implemented Gibbs sampling, and so far we have simply accepted that the approach works. Time to look at this in a bit more detail in order to understand why it works and what the limitations of the algorithm are.

Regardless of whether you want to simulate ferromagnetic behavior in an Ising model, run a Hopfield network or train a Boltzmann machine, the fundamental problem that we have to solve is always the same. We are given a probability distribution P living on some state space X, and we are trying to create a sample, i.e. a set of points in the state space such that the probability for a point x to appear in this sample is equal to the probability P(x) given by the probability distribution.

The naive approach to this is simple: visit every point x in the state space and include that point with probability P(x). However, it is clear that with a large state space, this approach is not computationally feasible. In the example of a Boltzmann machine trained on handwritten digits with 28 x 28 pixels, our state space has 2⁷⁸⁴ elements, and there is no way we can visit them all one by one. Instead, we would need something like a randomized walk through the state space. We could start with same randomly chosen state X₀, then – using a randomized transition rule – move on to a point X₁ and so forth. Intuitively, we want to select our transition rule in such a way that the state space elements X_i selected in this way form a sample, i.e. such that our chain of state space locations visits regions with large values for P(x) more often than regions with low values of P(x). Thus we would systematically ignore regions of the state space with low probability which would greatly reduce the number of states that we have to visit to obtain a valid sample.

So, from a mathematical point of view, we consider a sequence of random variables X_i such that X_i+1 is related to X_i by some randomized transition rule. We also assume that this rule does not depend on the index i which is usually called the time. Thus we have a sequence of random variables X_i which is not independent, but almost independent – X_i+1 depends only on X_i and in way that itself does not vary with i. This is called a Markov chain (more precisely, a time homogeneous Markov chain).

Let us consider an example to illustrate the idea. As our state space, we choose the space or real numbers. We fix a starting value, say X₀=0, and we obtain the next value by adding a number that we draw from a standard normal distribution. Thus, mathematically, we assume that W_n are identically distributed and independent random variables, all distributed according to the standard normal distribution, and set

$X_{n+1} = X_n + W_n$

This is a Markov chain: the value X_n+1 depends only on X_n, not on any earlier elements of the chain. The transition rule is randomized, but itself does not depend on the time step n – all W_n have the same distribution. Let us implement this in Python to see how it works (the full notebook can be downloaded here).

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw GaussianRandomWalk1.ipynb hosted with ❤ by GitHub

Here we have created and displayed three different random walks. All of them start at the same point (zero), and all of them follow the same transformation rule, but as the transformation rule is stochastic in nature, they all develop differently.

Now let us try to turn the view on this upside down. This time, we execute a larger number – 1000 – of random walks with 5000 steps each. But instead of plotting the sequence of points X_i for every walk, we display the distribution of the last point of each walk, i.e. we plot the distribution of the random variable X₄₉₉₉.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw GaussianRandomWalk2.ipynb hosted with ❤ by GitHub

This does in fact look more familiar. We see that most walks end up being close to zero at the end – steps in the positive direction and steps in the negative direction cancel each other. Only very few walks end up at an extreme position close to plus or minus 200 – this is not surprising as well, to arrive at an extreme point, we would need to draw 5000 times in a row an extreme value from the random normal distribution, which is a rather unlikely chain of events.

In this case, the distribution does actually not converge if we increase the number of steps – you can try this out and play with different values, i.e. replace 5000 by 50000 (this will run some time) and look at the distribution of X₄₉₉₉₉ – you will see that this is now spread out to roughly plus / minus 750 (in fact, the distribution is obtained as a multiple convolution of the standard normal distribution with itself and thus is again a normal distribution).

Even though the distribution does not converge, we have been able to sample from a specific distribution – in this case the distribution after 5000 steps – using only the ability to sample from a different distribution – in this case the random normal distribution. Obviously, in this special case, the result is trivial, but the principle that we have found looks interesting. Can we generalize this approach to obtain sampling methods for target distributions that are otherwise intractable?

Now this is exactly the idea behind the sampling approach that is commonly known as Markov chain Monte Carlo (MCMC) and which has become very popular, with applications to complex simulations in theoretical physics, to machine learning and even asset pricing and value-at-risk calculations.

So let us summarize how the MCMC approach works. Given a target distribution P(x), we first construct a Markov chain that converges to that target distribution. Once we have that, we can simulate a large number of runs and use the resulting points as our sample (in fact, in many cases we can also do with one run only, as we will see later). Thus in order to utilize Markov chains for sampling, we would need to understand under what conditions a Markov chain converges and if it converges, how we can relate the target distribution to the transformation rule. We will look into these points in more detail in future posts in this series.

Bitcoin mining – an overview

In my recent posts in this series, I have walked you in detail through two of the basic objects that make up the bitcoin blockchain – participants, represented by public / private key pairs, and transactions. Time to look at the third major ingredient – blocks.

What are blocks and why are they necessary? When the bitcoin protocol was invented in 2008 by a person (or group of persons) acting under the pseudonym Satoshi Nakamoto (see [1]), the real novelty was not the usage of a chain of transactions in order to model payments. The real novelty was the proposed solution for a major issue with a cryptocurrency that is based solely on such a chain: the problem of double spending. To understand this, suppose that Eve agrees on a deal with Alice: Eve will transfer a certain amount of bitcoin to Alice in exchange for another currency like USD. She would pick an unspent transaction output of the correct amount, create and sign a transaction and publish that transaction in the network. Alice could see and validate the transaction appearing in the network and would then transfer the USD amount to Eve.

Unfortunately, Eve could, at virtually the same time, agree on a similar deal with Bob and create another transaction transferring the same unspent output to Bob.

Now we have two transactions in the network that both claim to spend the same previously unspent transaction output. However, what if, due to some network latency, Bob and Alice detect the second transaction only after they have initiated the transfer of the USD amount? And if they detect the problem, how can they – and the entire network – agree on which of these transactions should be considered valid and which should be discarded? This is the double spend problem.

To solve this, a mechanism is needed that allows all nodes of the network to agree on an order of transactions. If we have that order, the solution to our problem is easy – the first transaction, i.e. the transaction with the lowest order wins. Satoshis proposal was to implement an ordering by what he called a timestamp service and what is now known as the blocks of the blockchain.

Essentially, blocks are groups of transactions. Nodes in a bitcoin network collect transactions that have been posted, but not yet included in a block. They then calculate hash values of these transactions and organize them into a larger data structure – the block. But similar to transactions, blocks are linked into a chain – Satoshi calls this proof-of-work chain and in fact the term blockchain does not appear at all in his paper but was introduced later. Each (full) node in the network maintains a local copy of this chain. Whenever a node creates a block, it includes the hash value of the current last block in the chain in the block data structure. It then publishes the block on the network. If other nodes pick up the block, they can verify that it is valid – in particular they can verify that none of the transactions included in the block lead to double spending with respect to any other transaction in this or earlier blocks. If this validation is successful, the block is added to the copy of the chain maintained by the server.

The fact that the blocks in the chain are linked makes it more difficult to alter a block (and the transactions contained in it) that is already buried below some new blocks. In fact, in order to do this, we would have to recreate all blocks between the current last block and the block that we want to alter, as the blocks contain hash values of their predecessor. In order to make sure that this is virtually undoable, every node that creates a block needs to solve a mathematical puzzle – a so called proof of work. The difficulty of this puzzle can be adjusted dynamically and is kept at a level where approximately 10 minutes of work are required to create a new block. This also makes it less likely that new blocks are created at exactly the same point in time.

In the bitcoin block chain, this proof of work is designed as follows. The block data structure contains hash values of the transactions that are part of the block and – in addition – a field called the nonce that the node creating the block can determine at will. As the nonce enters the hash value of the block, this implies that, simply by trying out different values for the nonce, a node can create blocks whose hash value has certain properties. The mathematical challenge that a node has to solve to create a block that is accepted by the network is to adjust the nonce such that the hash value of the block is zero in the first few hexadecimal digits. The number of leading digits that need to be zero is the difficulty which is adjusted over time to make sure that the block creation rate is kept at one block per ten minutes.

Thus a node that wishes to create a new block will collect transactions not yet included in a block, verify them (and will thereby avoid double spend), add them to a block along with the hash of the current last block, try out some value of the nonce, calculate the hash value, check whether its first digits are zero, try a different nonce if that is not the case and so on until a matching nonce is found. Then the block is published, verified by the other nodes and – if it passes the checks – added to the block chain as the next block. Nodes that create new blocks are called miners.

Let us now trace the full history of a sample transaction throughout this process.

Suppose Alice wants to transfer 50 BTC to Bob. She creates and signs a transaction according to the rules that we have seen in the previous posts. That transaction is then published in the network. A mining node picks up the transaction and assembles it – along with many other transactions – in a new block. That block is then distributed on the network. Each node on the network validates the block and adds it to the copy of the blockchain that this node maintains. Finally, Bob can listen on the network as well. As soon as Bob finds that the transaction has been added to a block and this block has been included in the network, he can be relatively sure that the transaction is valid. Bob can now be confident to be protected against double spend and use this transaction in turn as unspent output (as a rule of thumb, a transaction is considered to be confirmed if it is included in a block and at least five additional blocks have been added on top of that, i.e. after roughly one hour).

There are still a few open ends. Why, for instance, should a mining node invest CPU power in the creation of blocks? And what happens if two different blocks are mined at the same time?

Let us first start with the question of incentivation. In order to make sure that miners participate in the process, a miner is rewarded for every block that he creates. In fact, each new block contains a special transaction called a coinbase transaction. Such a transaction has no inputs, and the mining node can determine the address to which the output is credited. Thus, effectively, mining creates bitcoins, and the miner earns the newly created bitcoins.

The amount of a coinbase transaction used to be 50 BTC in the first days of the bitcoin protocol, and is halved every 210.000 blocks. At the moment, the reward is 12.5 BTC, which is the equivalent of roughly 10.000 USD at the current exchange rate. This is the incentive for the mining node to invest energy and hardware cost into the process.

Now let us see what happens if two valid blocks are mined at the same time. In this case, we create a fork of the chain.

Suppose, for instance, that two different nodes find a solution for the mathematical puzzle at exactly the same point in time and publish two blocks that could fit into position N of the chain. Let us call these blocks N and N’. Now all the other nodes in the network have to take a decision which chain they consider to be the main chain – the chain continued with N or the chain continued with N’. There is one basic rule in this game – the longest chain wins. However, at this point in time, both chains have the same length. So some of the nodes – say roughly half of them – will opt for the first chain, the other part of the nodes will opt for the second chain. Now each mining node will continue to work on the chain that this node considers to be valid. So roughly 50% of the miners will start to work on a block which could follow block N and the other nodes will start to search for a block that could follow N’.

Now it is rather unlikely that again, both puzzles are solved at exactly the same point in time. So let us assume that the next block which is mined is based on N. This block will then be published. Now the chain based on N is longer than the chain based on N’, and all the other nodes will – according the rule that the longest chain is the winning chain – discard the chain based on N’ and continue to mine blocks based on the new chain that included N. Thus the fork is resolved and all chains agree again on one copy of the chain.

That completes our short overview of the mining process. In the next couple of posts, we will again analyze the structure of a block in detail and see whether we will eventually be able to mine and publish a block in our test network.

References

1. Satoshi Nakamoto, Bitcoin: A Peer-to-Peer Electronic Cash System, 2008
2. Bitcoin History: Pre-Blockchain Digital Currencies

Training a restricted Boltzmann machine on a GPU with TensorFlow

During the second half of the last decade, researchers have started to exploit the impressive capabilities of graphical processing units (GPUs) to speed up the execution of various machine learning algorithms (see for instance [1] and [2] and the references therein). Compared to a standard CPU, modern GPUs offer a breathtaking degree of parallelization – one of NVIDIAs current flagships, the Tesla V100, offers more than 5.000 CUDA cores that can perform work in parallel. As training and evaluating neural networks involves many floating operations on large matrices, they can benefit heavily from the special capabilities that a GPU provides.

So how can we make our code execute on a GPU? Of course you could program directly against the CUDA interface or similar interfaces like OpenCL. But specifically for the purposes of machine learning, there are easier options – over the last years, several open source frameworks like Theano, Torch, MXNet or TensorFlow have become available that make it comparatively easy to leverage a GPU for machine learning. In this post, I will use the TensorFlow framework, simply because so far this is the only one of these frameworks that I have used (though MXNet looks very interesting too and I might try that out and create a post on it at some point in the future).

It takes some time to get used to the programming model of TensorFlow which is radically different from the usual imparative programming style. As an example, let us suppose we wanted to add two matrices. In Python, using numpy, this would look as follows.

import numpy as np
a = np.matrix([[0, 1], [1, 0]])
b = np.matrix([[1, 0], [0, 1]])
c = a + b
print(c)

This program is described by a sequence of instructions (let us ignore the fact for a moment that these are of course functions that we call – ultimately, functions are composed of instructions). When we execute this program, the instructions are processed one by one. First, we assign a value to the variable a, then we assign a value to a variable b, then we add these two values and assign the result to a variable c and finally we print out the value of c.

The programming model behind TensorFlow (and other frameworks like Theano) is fundamentally different. Instead of describing a program as a sequence of instructions, the calculations are organized as a graph. The nodes in this graph correspond to operations. The edges joining the nodes represent the flow of data between the operations. In TensorFlow, data is always represented as a tensor, so the edges in the graph are tensors. An operation consumes data from its inputs, processes it and forwards it to the next operation in the graph as its output.

A program using TensorFlow typically consists of two phases. In the first phase, we build the graph, i.e. we define the operations and their inputs and outputs that make up the calculation that we want to perform. However, in this phase, no calculations are actually performed. Instead, this happens in the second phase when we actually run the graph. For that purpose, we create a session. Roughly speaking, a session defines an environment in which a graph can be executed. Once the session has been defined, we can invoke its run method. To the run method, we pass as an argument the operation in the graph that we want to execute. The run method will then trace the graph backwards and evaluate all operations that provide input to our target operation recursively, i.e. it will identify the subgraph that needs to be executed to evaluate our target operation.

Let us again use the example of a simple addition to illustrate this. The source code looks as follows.

import tensorflow as tf
#
# Build the model
#
a = tf.constant([[0, 1], [1, 0]], name="a")
b = tf.constant([[1, 0], [0, 1]], name="b")
c = tf.add(a,b, name="c")

#
# Create a session and run it
#
session = tf.Session()
print(session.run(c))

First, we import the tensorflow library itself. Then, in the next three lines, we build the graph. We define three nodes in the graph. The first two nodes are special operations that output simply a constant value. The third operation is the operation that performs the actual addition and uses the previously defined operations as input. Thus our final graph has three nodes and two edges, as shown below.h

In the next line, we create a TensorFlow session which we then run. The argument specifies which operation we want to execute and therefore determines which part of the graph we will actually run. The output of the run method is an ordinary numpy array which we then print out.

Let us now look at an example which is slightly more complicated. In the PCD algorithm, we can compute the contribution of the negative phase to the weight updates as follows.

E = expit(self.beta*(np.matmul(S0, self.W) + self.c))
pos = np.tensordot(S0, E, axes=((0),(0)))

Here S0 is a batch from the sample set, W is the current value of the weights and c is the current value of the bias. In TensorFlow, the code to build the corresponding part of the model looks quite similar.

S0 = tf.placeholder(name="S0", shape=[batch_size, self.visible], 
                    dtype=tf.float32)
W = tf.get_variable(name="W", 
                        dtype=tf.float32, 
                        shape=[self.visible, self.hidden],
                        initializer = tf.zeros_initializer(),
                        trainable=False)
c = tf.get_variable(name="c", 
                        dtype=tf.float32, 
                        shape=[1, self.hidden],
                        initializer = tf.zeros_initializer(),
                        trainable=False)
E = tf.sigmoid(self.beta*(tf.matmul(S0, W) + c), name="E")
pos = tf.tensordot(S0, E, axes=[[0],[0]], name="pos")

The first element that we define – S0 – is a so called placeholder. This is a bit like a constant, with the difference that its value can be specified per run, using an additional argument called feed dictionary to the Session.run method. The next two elements that we define are variables. Variables are similar to operations – they represent nodes in the network and provide an output, but have no input. Instead, they have a certain value and feed that value as outputs to other operations. We then use the built-in tensorflow operations sigmoid and tensordot to calculate the expectation values of the visible units and the positive phase.

The full model to train a restricted Boltzmann machine is of course a bit more complicated. TensorFlow comes with a very useful device called TensorBoard that can be used to visualize a graph constructed in TensorFlow. The image below has been created using TensorFlow and shows the full graph of our restricted Boltzmann machine.

TensorBoard offers the option to combine operations into groups which are then collapsed in the visual representation. In the image above, all groups are collapsed except the group representing the contribution from the positive phase. We can clearly see the flow of data as described above – we first multiply S0 and W, then add c to the result, multiply this by a constant (the inverse temperature, called x in the diagram) and then apply the sigmoid operation that we have called E. The result is then fed into other, collapsed groups like the group delta which holds the part of the model responsible for calculating the weight updates.

I will not go through the full source code that you can find on GitHub as usual – you will probably find the well written tutorial on the TensorFlow homepage useful when going through this. Instead, let us play around a bit with the result.

As the PC that is under my desk is almost seven years old and does not have a modern GPU, I did use a p2.xlarge instance from Amazon EC2 which gave me access to a Tesla K80 GPU and four Intel Xeon E5-2686 cores running at 2.3 GHz (be careful – this instance type is not covered by the free usage tier, so that will cost you a few dollars). I used the Amazon provided Deep Learning AMI based on Ubuntu 16.04. After logging into the instance, we first have to complete a few preparational steps.

$ source activate tensorflow_p36
$ git clone http://www.github.com/christianb93/MachineLearning.git
$ cd MachineLearning
$ export MPLBACKEND="AGG"
$ conda install scikit-learn
$ python3 RBM.py --algorithm=PCDTF

Here we activate the pre-configured TensorFlow environment, download the source code from GitHub, set the environment variable to define our Matplotlib backend, and download and install some required packages. Then we do a first run with the BAS dataset to verify that everything works. If that is the case, we can run the actual MNIST training and sampling.

$ python3 RBM.py --N=28 --data=MNIST --save=1 --hidden=128 --pattern=256 --batch_size=128 --epochs=40000 --run_samples=1 --sample_size=6,6 --beta=1.0 --sample=200000 --algorithm=PCDTF --precision=32

This produced the following sample of 6 x 6 digits.

The execution took roughly 5 minutes – 2 minutes for the training phase and 3 minutes for the sampling phase. During the training, the GPU utilization (captured with nvidia-smi -l 2) was at around 57% and stayed in that range during the sampling phase.

A second run using the switch --precision=64 to set the floating point precision to 64 bits did not substantially change the outcome or the performance.

Then a run with the same parameters was done in pure Python running on the four CPU cores provided by the p2.xlarge instance (--algorithm=PCD). During the training phase, the top command showed a CPU utilization of 400%, i.e. all four cores where at 100%. The utilization stayed in that range during the sampling phase. The training took 10:20 minutes, the sampling 8 minutes. Thus the total run time was 18 minutes compared to 5 minutes – a factor of 360%.

Following the advice on this post, I then played a bit with the settings of the GPU and adjusted the clock rates and the auto boost mode as follows.

sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,875

That brought the GPU utilization down to a bit less than 50%, but had a comparatively small impact on the run times which now were 1:40 min (instead of 2 min) for training and 2:30 min (instead of 3 min) for sampling. So the total run time was now a bit more than 4 minutes, which is a speed up of roughly 20% compared to the default settings. Compared to the CPU, we have now reached a speed up of almost 4,5.

Next, let us compare this to the run time on two CPUs only. To measure that, I grabbed an instance of the t2.large machine type that comes with 2 CPUs – according to /proc/cpuinfo, it is equipped with two Intel Xeon E5-2676 CPUs at 2.40GHz. Interestingly, the training phase only took roughly 8 minutes on that machine, which is even a bit faster than on the p2.xlarge which has four cores. The sampling phase was faster as well, taking only 6 minutes instead of 8 minutes. It seems that adding more CPUs increases the overhead for the synchronisation between the cores drastically so that it results in a performance penalty instead of a performance improvement. To verify this, I did a run on a p2.8xlarge with 32 CPUs and got a similar result – training took 9 minutes, sampling 6:50 minutes.

Finally, I could not resist the temptation to try this out on a more advanced GPU enabled machine. So I got a p3.2xlarge instance which contains one of the relatively new Tesla V100 GPUs. I did again adjust the application clocks using

sudo nvidia-smi -ac 877,1530

With these settings, one execution now took only about 1:20 minutes for the training and 1:50 min for the sampling. However, the GPU utilization was only at 30% – so we have reached a point where just having a faster GPU does not lead to a significant speed advantage any more. The following table summarizes the results of the various measurements.

Instance	Run time training	Run time sampling
p3.2xlarge (Tesla V100)	1:20 min	1:40 min
p2.large (Tesla K80)	1:40 min	2:30 min
p2.large (4 x CPU)	10 min	8 min
p2.8xlarge (32 x CPU)	9 min	6:50 min
t2.large (2 x CPU)	8 min	6 min

Of course we could now start to optimize the implementation. For the training phase, I assume that the bottleneck that limits GPU utilization is the use of the feed dictionary mechanism which could be replaced by queues to avoid overhead of switching back between CPU and GPU. During the sampling phase, we could also try to reduce the relative overhead of the run method by combining a certain number of steps – say 10 – into the graph and thus reducing the number of iterations that happen outside of the model. It would be interesting to play with this and see whether we can improve the performance significantly. But this is already a long post, so I will leave this for later…

References

1. R. Raina, A. Madhavan, A. Ng, Large-scale Deep Unsupervised Learning using Graphics Processors, Proceedings of the 26 th International Conference on Machine Learning (2009)
2. K. Chellapilla, S. Puri , P. Simard, High Performance Convolutional Neural Networks for Document Processing, International Workshop on Frontiers in Handwriting Recognition (2006)

Creating and signing a bitcoin transaction from scratch

In the one of the the last posts in this series, we have been able to successfully verify an existing bitcoin transaction. With the understanding of how this works, we should now be able to conversely create a transaction from scratch, sign it and publish it in the bitcoin network.

Given an amount that we want to transfer and the target, we will first need to determine unspent transaction outputs which are available and select those UTXOs that we will be using. This process is called coin selection. Coin selection is a non-trivial task, as it needs to consider several variables like the number of coins in your wallet, the size of the resulting transaction, and the required fee to make sure that the transaction is picked up and included in a block by a miner. In the bitcoin reference implementation, coin selection is done in the method CWallet::SelectCoins in the file wallet.cpp. The logic is nicely described in this post on StackExchange. Essentially, it will first try to spend all those outputs that are smaller than the target amount. If that does not give an exact match, it will try to find a single output larger than the target amount. Otherwise, it will select a random subset of those outputs smaller than the target. Other algorithms exists, and I even found a masters thesis on this on the web (see [2]).

For our purpose, we will implement an easier algorithm that I have seen in the excellent book [1]. Again, we will be using the btc package that is available in my GitHub repository. The first thing that we do is to retrieve a list of all unspent transaction outputs that our wallet knows. To do this, we again use the RPC interface to access the wallet built into the bitcoin core client and call the method listunspent. We then split the resulting list of coins in two lists – one containing all entries smaller than the amount to spend, the other one containing all other entries.

#
# First we make an RPC call to retrieve unspent transaction output and 
# select the outputs that we are going to spend
#
listunspent = btc.utils.rpcCall("listunspent")
#
# Now extract transaction IDs and store that in a list of 
# dictionaries
#
# We split this list into one list of entries that are greater
# than the amount we want to transfer and one list of entries
# that are smaller
#
#
smaller = []
greater = []
amount_to_spend = float(args.amount)
for _ in listunspent:
    if _['spendable']:
        txid = _['txid']
        vout = _['vout']
        amount = float(_['amount'])
        address = _['address']
        coin = {'txid': txid,
                  'vout': vout,
                  'amount': amount,
                  'address' : address}
        if amount > amount_to_spend:
            greater.append(coin)
        else:
            smaller.append(coin)

Next we sort these two lists and build a list to_be_spent of those coins that we want to spend. If there is a coin that is greater than the amount to be spend, we take this entry and are done. If not, we try to sum up available coins from the other list to reach our target sum.

#
# Next we sort the lists. 
#
greater.sort(key=lambda entry: entry['amount'])
smaller.sort(key=lambda entry: entry['amount'], reverse=True)
#
# If greater is not emtpy, take the smallest (i.e. now first)
# element
#
if len(greater) > 0:
    amount_funded = greater[0]['amount']
    to_be_spent = [greater[0]]
else:
    #
    # We need to combine more than one transaction output
    #
    to_be_spent = []
    amount_funded = 0
    for _ in smaller:
        if amount_funded < amount_to_spend:
            to_be_spent.append(_)
            amount_funded += _['amount']
    if (amount_funded < amount_to_spend):
        # Failed, clean up list
        to_be_spent = []

Now that we have selected the UTXOs that we will use to build our transaction, we need to retrieve the corresponding private keys so that we can use them for the needed signatures. At the same time, we will build a list of the transaction outputs that we use in our standard format (i.e. as Python objects).

txos = []
privateKeys = []
for _ in to_be_spent:
    tx = btc.txn.txn()
    raw = btc.utils.rpcCall("getrawtransaction", [_['txid']])
    tx.deserialize(raw)
    #
    # Get private key using again an RPC call 
    #
    privKey = btc.utils.rpcCall("dumpprivkey", [_['address']])
    privKey = btc.keys.wifToPayloadBytes(privKey)
    privKey = int.from_bytes(privKey, "big")
    privateKeys.append(privKey)
    txos.append(tx.getOutputs()[_['vout']])

We can now assemble our transaction. We will only use one transaction output for simplicity. WARNING: this will imply that the difference between the amount represented by our unspent transaction outputs and the amount that we want to transfer counts as fees and will be lost! In our example, we use an UTXO worth 30 bitcoin and transfer only one bitcoin, i.e. we will loose almost 29 bitcoin which is almost 30 kUSD at the current bitcoin / USD exchange rate! So do not do this with real bitcoin.

Here is the code to assemble our transaction.

#
# Next we create our transaction. First we create the transaction 
# inputs. We leave the signature scripts empty for the time
# being
#
txn = btc.txn.txn()
for _ in to_be_spent:
    txin = btc.txn.txin(prevTxid = _['txid'], vout = _['vout'])
    txn.addInput(txin)

#
# Next we do the outputs. For the time being, we use only one output
# So we need to convert the address to a public key hash
#
publicKeyHash = btc.keys.ecAddressToPKH(args.target)
publicKeyHash = binascii.hexlify(publicKeyHash).decode('ascii')
#
# Create locking script
#
lockingScript = btc.script.scriptPubKey(scriptType = btc.script.SCRIPTTYPE_P2PKH, 
                                        pubKeyHash = publicKeyHash)
#
# and output
#
txout = btc.txn.txout(value = int(amount_to_spend * 100000000), 
                      scriptPubKey = lockingScript)
txn.addOutput(txout)

Next we need to sign our transaction, i.e. we need to go through all transaction inputs and add the correct signatures. We already know how to do this – simply revert the process that we have studied in my previous post. I have put the required code into the function signTransaction in the module btc.script. The only subtlety that we have to observe is that the ECDSA signature algorithm might return a pair (r,s) where s is greater than n/2 – such a signature would be rejected by the bitcoin client (see the check in interpreter.cpp/IsLowDERSignature). Therefore we need to replace s by n – s if needed. With this function, signing and sending is now very easy.

#
# Sign it
#
txn = btc.script.signTransaction(txn, txos, privateKeys)
#
# and send it
#
raw = txn.serialize()
s = btc.utils.rpcCall("sendrawtransaction", [raw, True])
print("Done, resulting transaction ID: ")
print(s)

Now let us actually run this and try it out. To be a bit more realistic, we will first create a third node called Bob in our test network. Using our stored containers, this is very easy. We can use the following command to bring up a container called Bob and – in addition – an instance of our previously created Alice container.

$ docker run -d --rm -p 18332:18332 --name=alice alice
$ docker run -it --rm --name=bob bitcoin-alpine

Now let us add a new address to Bob’s wallet that we will use as target address for our transaction. We can use the bitcoin-cli client if we first attach to the container bob

$ docker exec -it bob sh
# bitcoin-cli --rpcuser=user --rpcpassword=password -regtest getnewaddress "Bob"
n37U5uEL2h9FcZbumpoAxNYT9ACGjZCkCi

The output is the new address that we have just generated. If you now run the listaccounts command in the same terminal, you will see the newly created account and its balance which is still zero.

Now switch back to a terminal on your host. Clone my repository and run the script SendMoney.

$ git clone http://www.github.com/christianb93/bitcoin.git
$ cd bitcoin
$ python SendMoney.py --target=n37U5uEL2h9FcZbumpoAxNYT9ACGjZCkCi

Note that the argument --target is the newly created address. We are now almost done, but two steps remain. First, we will again have to connect our two nodes. As in the previous post, run docker network inspect bridge to find out the IP address of the node alice on the bridge network. In my case, this is again 172.17.0.2. Then reattach the terminal to the container bob and connect to this peer.

$ docker exec -it bob sh
# bitcoin-cli --rpcuser=user --rpcpassword=password -regtest addnode "172.17.0.2" add

Finally, we need to include our newly created transaction in the blockchain, i.e. we have to go through the process of mining again. For the sake of simplicity, we will use the container of Alice as its port 18332 is mapped into the host network and so we can reach it from our host and use the bitcoin CLI client there. So on the host (!) open a terminal and run

$ bitcoin-cli --rpcuser=user --rpcpassword=password -regtest generate 6

As a result, you should see the IDs of the six newly created blocks. Now let us switch back to the terminal attached to Bob and check the balance.

# bitcoin-cli --rpcuser=user --rpcpassword=password -regtest listaccounts
{
  "": 0.00000000,
  "Bob": 1.00000000
}

Aaaahh, the sweet taste of success – we made it. Our transaction has been accepted by the network, enclosed into a block and correctly interpreted by Bobs node as contributing one bitcoin to his balance.

This post completes our in-depth analysis of transactions in the bitcoin blockchain. With the next post, we will switch our focus and start to investigate the process of mining and the structure of blocks in detail.

1. A. Antonopoulos, Mastering Bitcoin, O’Reilly 2015
2. M. Erhardt, An evaluation of coin selection strategies, Masters Thesis, Karlsruhe Institute of Technology, available at this URL

Training restricted Boltzmann machines with persistent contrastive divergence

In the last post, we have looked at the contrastive divergence algorithm to train a restricted Boltzmann machine. Even though this algorithm continues to be very popular, it is by far not the only available algorithm. In this post, we will look at a different algorithm known as persistent contrastive divergence and apply it to the BAS data set and eventually to the MNIST data set.

Recall that one of the ideas of contrastive divergence is to use a pattern from the sample set as a starting point for a Gibbs sampler to calculate the contribution of the negative phase to the weight update. The idea behind persistent contrastive divergence (PCD), proposed first in [1], is slightly different. Instead of running a (very) short Gibbs sampler once for every iteration, the algorithm uses the final state of the previous Gibbs sampler as the initial start for the next iteration. Thus, in every iteration, we take the result from the previous iteration, run one Gibbs sampling step and save the result as starting point for the next iteration.

This amounts to running one long chain of states that are related by Gibbs sampling steps. Of course this is not exactly one longs Gibbs sampler, as the weights and therefore the probability distribution changes with each step. However, the idea is that when the learning rate is small, the weight change during two subsequent iterations is neglegible, and we effectively create one long Gibbs sampler which provides a good approximation to the actual distribution.

In practice, one often uses several chains that are run in parallel. Such a chain is sometimes called a negative particle. It is recommended in [1] to chose the number of particles to be equal to the batch size. In an implementation in Python, we can store the state of the negative particles in a matrix N where each row corresponds to one particle.

The idea to form one long Markov chain obviously works best if the learning rate is very small. On the other hand, this slows down the convergence of the gradient descent algorithm. In order to solve this, it is common to reduce the learning rate over time, for instance linearly with the number of iterations.

A second additional improvement that is usually implemented is a weight decay. Essentially, a weight decay is an additional penalty that is applied to avoid that the weights grow too large which would slow down the sampling procedure.

Let us now see how the PCD algorithm can be coded in Python. We will again store the model parameters and the state in a Python class. In the __init__ method of that class, we initialize the weights and the bias vectors and also set the particles to some randomly chosen initial value.

class PCDRBM (Base.BaseRBM):

    def __init__(self, visible = 8, hidden = 3, particles = 10, beta=2.0):
        self.visible= visible
        self.hidden = hidden
        self.beta = beta
        self.particles = particles
        #
        # Initialize weights with a random normal distribution
        #
        self.W = np.random.normal(loc=0.0, scale=0.01, size=(visible, hidden))
        #
        # set bias to zero
        #
        self.b = np.zeros(dtype=float, shape=(1, visible))
        self.c = np.zeros(dtype=float, shape=(1, hidden))
        #
        # Initialize the particles
        #
        self.N = np.random.randint(low=0, high=2, size=(particles,self.visible))
        self.global_step = 0

Assuming that we have a method runGibbsStep that runs one Gibbs sampling step with the given weight starting at some initial state, one iteration of the PCD algorithm now looks as follows.

#
# Update step size - we do this linearly over time
#
step = initial_step_size * (1.0 -(1.0*self.global_step)/(1.0*iterations*epochs))
#
# First we compute the negative phase. We run the
# Gibbs sampler for one step, starting at the previous state
# of the particles self.N
#
self.N, _ = self.runGibbsStep(self.N, size=self.particles)
#
# and use this to calculate the negative phase
#
Eb = expit(self.beta*(np.matmul(self.N, self.W) + self.c))
neg = np.tensordot(self.N, Eb, axes=((0),(0)))
#
# Now we compute the positive phase. We need the
# expectation values of the hidden units
#
E = expit(self.beta*(np.matmul(V, self.W) + self.c))
pos = np.tensordot(V, E, axes=((0),(0)))
#
# Now update weights
#
dW = step*self.beta*(pos -neg) / float(batch_size) - step*weight_decay*self.W / float(batch_size)
self.W += dW
self.b += step*self.beta*np.sum(V - self.N, 0) / float(batch_size)
self.c += step*self.beta*np.sum(E - Eb, 0) / float(batch_size)
self.global_step +=1

As always, the full source code is available from my machine learning GitHub repository. I have enhanced the code in RBM.py so that it accepts a command line parameter --algorithm that lets you choose between ordinary contrastive divergence and the PCD algorithm.

Let us now run a few trials. First, we will again use the BAS data set. You can download and run the code from the GitHub repository as follows.

$ git clone http://www.github.com/christianb93/MachineLearning.git
$ cd MachineLearning
$ python RBM.py --algorithm=PCD --run_reconstructions=1 --show_metrics=1

When the script completes, you should again see the two images. The first image displays how the reconstruction errors and weight changes behave during the training.

We see that the reconstruction error (the diagram on the right) decreases slower than it did for the ordinary contrastive divergence algorithm. On the left hand side, where the change of the weights is displayed, we can clearly see the impact of the linearly decreasing step size. The second picture shows again the result of a reconstruction attempt of slightly distorted patterns.

Let us now try out a different application of restricted Boltzmann machines – sampling. After a successful training phase, the model distribution given by the weights should be close to the empirical distribution of the training data. Thus, if we sample from the model distribution, using for instance Gibbs sampling, we should be able to obtain patterns. that somehow resemble the training data.

We will use this to generate handwritten digits based on the well known MNIST data set, more precisely the copy available at mldata.org. To download and read the data set, we use the method fetch_mldata provided by the scikit learn library. We will then train our network for 40.000 epochs using 60 images out of this data set and 128 hidden units and subsequently run 200.000 Gibbs sampling steps starting from a random pattern.

$ python RBM.py --algorithm=PCD --data=MNIST --N=28 --epochs=40000 --pattern=60 --hidden=128 --run_samples=1 --sample=200000 --save=1

Note that when you run this for the first time, the MNIST data set will be downloaded and stored in a folder in your home directory, so this might take some time (the file has a bit less than 60 MBytes).

The results are already very encouraging. Most patterns resemble a digit closely, only the image at the top left corner did obviously not converge properly. However, we still see a strong bias – only very few of the 9 digits that the data set contains appear. So we probably need to fine tune the parameters like number of hidden units, learning rate, weight decay or the number of epochs to obtain better results.

Unfortunately, when you start to play around to optimize this further, you will see that the run time of the algorithm has reached a point where quick iterations to try out different parameters become virtually impossible. I have been running this on my PC that has an Intel Core i7 CPU, and Python was able to distribute this nicely across all four physical cores, taking them to 100% utilization, but still the script was already running for 7 minutes. If we want to increase the number of iterations or the number of hidden units to be able to learn more pattern, the run time can easily go up to almost 30 minutes.

Of course professional training of neuronal networks is nowadays no longer been done on a CPU. Instead, modern frameworks use the power of graphical processing units (GPUs) that are optimized for exactly the type of work that we need – highly parallel processing of floating point matrices. Therefore, I will show you in the next post in this series how you can use the TensorFlow framework to move the workload to a GPU.

1. T. Tieleman, Training restricted Boltzmann machines using approximations to the likelihood gradient, International Conference on Machine Learning (ICML), 2008
2. A. Fischer, C. Igel, Training restricted Boltzmann machines: an introduction, Pattern Recognition Vol. 47 (2014), pp 25–39