Training restricted Boltzmann machines with persistent contrastive divergence

In the last post, we have looked at the contrastive divergence algorithm to train a restricted Boltzmann machine. Even though this algorithm continues to be very popular, it is by far not the only available algorithm. In this post, we will look at a different algorithm known as persistent contrastive divergence and apply it to the BAS data set and eventually to the MNIST data set.

Recall that one of the ideas of contrastive divergence is to use a pattern from the sample set as a starting point for a Gibbs sampler to calculate the contribution of the negative phase to the weight update. The idea behind persistent contrastive divergence (PCD), proposed first in [1], is slightly different. Instead of running a (very) short Gibbs sampler once for every iteration, the algorithm uses the final state of the previous Gibbs sampler as the initial start for the next iteration. Thus, in every iteration, we take the result from the previous iteration, run one Gibbs sampling step and save the result as starting point for the next iteration.

This amounts to running one long chain of states that are related by Gibbs sampling steps. Of course this is not exactly one longs Gibbs sampler, as the weights and therefore the probability distribution changes with each step. However, the idea is that when the learning rate is small, the weight change during two subsequent iterations is neglegible, and we effectively create one long Gibbs sampler which provides a good approximation to the actual distribution.

In practice, one often uses several chains that are run in parallel. Such a chain is sometimes called a negative particle. It is recommended in [1] to chose the number of particles to be equal to the batch size. In an implementation in Python, we can store the state of the negative particles in a matrix N where each row corresponds to one particle.

The idea to form one long Markov chain obviously works best if the learning rate is very small. On the other hand, this slows down the convergence of the gradient descent algorithm. In order to solve this, it is common to reduce the learning rate over time, for instance linearly with the number of iterations.

A second additional improvement that is usually implemented is a weight decay. Essentially, a weight decay is an additional penalty that is applied to avoid that the weights grow too large which would slow down the sampling procedure.

Let us now see how the PCD algorithm can be coded in Python. We will again store the model parameters and the state in a Python class. In the __init__ method of that class, we initialize the weights and the bias vectors and also set the particles to some randomly chosen initial value.

class PCDRBM (Base.BaseRBM):

    def __init__(self, visible = 8, hidden = 3, particles = 10, beta=2.0):
        self.visible= visible
        self.hidden = hidden
        self.beta = beta
        self.particles = particles
        # Initialize weights with a random normal distribution
        self.W = np.random.normal(loc=0.0, scale=0.01, size=(visible, hidden))
        # set bias to zero
        self.b = np.zeros(dtype=float, shape=(1, visible))
        self.c = np.zeros(dtype=float, shape=(1, hidden))
        # Initialize the particles
        self.N = np.random.randint(low=0, high=2, size=(particles,self.visible))
        self.global_step = 0

Assuming that we have a method runGibbsStep that runs one Gibbs sampling step with the given weight starting at some initial state, one iteration of the PCD algorithm now looks as follows.

# Update step size - we do this linearly over time
step = initial_step_size * (1.0 -(1.0*self.global_step)/(1.0*iterations*epochs))
# First we compute the negative phase. We run the
# Gibbs sampler for one step, starting at the previous state
# of the particles self.N
self.N, _ = self.runGibbsStep(self.N, size=self.particles)
# and use this to calculate the negative phase
Eb = expit(self.beta*(np.matmul(self.N, self.W) + self.c))
neg = np.tensordot(self.N, Eb, axes=((0),(0)))
# Now we compute the positive phase. We need the
# expectation values of the hidden units
E = expit(self.beta*(np.matmul(V, self.W) + self.c))
pos = np.tensordot(V, E, axes=((0),(0)))
# Now update weights
dW = step*self.beta*(pos -neg) / float(batch_size) - step*weight_decay*self.W / float(batch_size)
self.W += dW
self.b += step*self.beta*np.sum(V - self.N, 0) / float(batch_size)
self.c += step*self.beta*np.sum(E - Eb, 0) / float(batch_size)
self.global_step +=1

As always, the full source code is available from my machine learning GitHub repository. I have enhanced the code in so that it accepts a command line parameter --algorithm that lets you choose between ordinary contrastive divergence and the PCD algorithm.

Let us now run a few trials. First, we will again use the BAS data set. You can download and run the code from the GitHub repository as follows.

$ git clone
$ cd MachineLearning
$ python --algorithm=PCD --run_reconstructions=1 --show_metrics=1

When the script completes, you should again see the two images. The first image displays how the reconstruction errors and weight changes behave during the training.


We see that the reconstruction error (the diagram on the right) decreases slower than it did for the ordinary contrastive divergence algorithm. On the left hand side, where the change of the weights is displayed, we can clearly see the impact of the linearly decreasing step size. The second picture shows again the result of a reconstruction attempt of slightly distorted patterns.


Let us now try out a different application of restricted Boltzmann machines – sampling. After a successful training phase, the model distribution given by the weights should be close to the empirical distribution of the training data. Thus, if we sample from the model distribution, using for instance Gibbs sampling, we should be able to obtain patterns. that somehow resemble the training data.

We will use this to generate handwritten digits based on the well known MNIST data set, more precisely the copy available at To download and read the data set, we use the method fetch_mldata provided by the scikit learn library. We will then train our network for 40.000 epochs using 60 images out of this data set and 128 hidden units and subsequently run 200.000 Gibbs sampling steps starting from a random pattern.

$ python --algorithm=PCD --data=MNIST --N=28 --epochs=40000 --pattern=60 --hidden=128 --run_samples=1 --sample=200000 --save=1

Note that when you run this for the first time, the MNIST data set will be downloaded and stored in a folder in your home directory, so this might take some time (the file has a bit less than 60 MBytes).


The results are already very encouraging. Most patterns resemble a digit closely, only the image at the top left corner did obviously not converge properly. However, we still see a strong bias – only very few of the 9 digits that the data set contains appear. So we probably need to fine tune the parameters like number of hidden units, learning rate, weight decay or the number of epochs to obtain better results.

Unfortunately, when you start to play around to optimize this further, you will see that the run time of the algorithm has reached a point where quick iterations to try out different parameters become virtually impossible. I have been running this on my PC that has an Intel Core i7 CPU, and Python was able to distribute this nicely across all four physical cores, taking them to 100% utilization, but still the script was already running for 7 minutes. If we want to increase the number of iterations or the number of hidden units to be able to learn more pattern, the run time can easily go up to almost 30 minutes.

Of course professional training of neuronal networks is nowadays no longer been done on a CPU. Instead, modern frameworks use the power of graphical processing units (GPUs) that are optimized for exactly the type of work that we need – highly parallel processing of floating point matrices. Therefore, I will show you in the next post in this series how you can use the TensorFlow framework to move the workload to a GPU.

1. T. Tieleman, Training restricted Boltzmann machines using approximations to the likelihood gradient, International Conference on Machine Learning (ICML), 2008
2. A. Fischer, C. Igel, Training restricted Boltzmann machines: an introduction, Pattern Recognition Vol. 47 (2014), pp 25–39

Docker internals: networking part II

In this post, we will look in more detail at networking with Docker if communication between a Docker container and either the host or the outside world is involved.

It turns out that in these cases, the Linux Netfilter/iptables facility comes into play. This post is not meant to be an introduction into iptables, and I will assume that the reader is aware of the basics (for more information, I found the tutorial on frozentux and the overview on digitalozean very helpful).

Setup and basics

To simplify the setup, we will only use one container in this post. So let us again designate a terminal to be the container terminal and in this terminal, enter

$ docker run --rm -d --name "container1" httpd:alpine
$ docker exec -it container1 "/bin/sh"

You should now see a shell prompt inside the container. When you run netstat -a from that prompt, you should see a listening socket being bound to port 80 within the container.

Now let us print out the iptables configuration on the host system.

$ sudo iptables -S
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT

Here we see that Docker has added three new chains and a couple of rules. The first chain that Docker has added is the DOCKER chain. In our configuration, that chain is empty, we will see later that this will change once we expose ports to the outside world.

The second chain that we see is the DOCKER-ISOLATION chain. I have not been able to find out much about this chain so far, but it appears that Docker uses this chain to add rules that isolate containers when you do not use the default bridge device but connect your containers to user defined bridges.

Finally, there is the chain DOCKER-USER that Docker adds, but otherwise leaves alone, so that firewall rules can be added by an administrator with a bit less conflict of clashing with the manipulations that Docker performs.

All these chains are empty or just consist of a RETURN statement, so we can safely ignore them for the time being.

Host-to-container traffic

As a first use case, let us now try to understand what happens when an application (curl in our case) in the host namespace wants to talk to the web server running in our container. To be able to better see what is going on, let us add two logging rules to the iptables configuration to log traffic coming in via the docker0 bridge and going out via the docker0 bridge.

$ sudo iptables -A INPUT -i docker0 -j LOG --log-prefix "IN: " --log-level 3
$ sudo iptables -A OUTPUT -o docker0 -j LOG --log-prefix "OUT: " --log-level 3

With these rules in place, let us now create some traffic. In the host terminal, enter

$ curl

We can now inspect the content of /var/log/syslog to see what happened. The first two entries should like like this (stripping of host name and time stamps):

OUT: IN= OUT=docker0 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=3460 DF PROTO=TCP SPT=34322 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 
IN: IN=docker0 OUT= PHYSIN=veth376d25c MAC=02:42:25:b7:e5:38:02:42:ac:11:00:02:08:00 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=80 DPT=34322 WINDOW=28960 RES=0x00 ACK SYN URGP=0 

So we see that the first logging rule that has been triggered is the rule in the OUTPUT chain. Let us try to understand in detail how this log entry was created.

When curl asks the kernel to establish a connection with, i.e. to send a TCP SYN request, a TCP packet will be generated and handed over to the kernel. The kernel will consult its routing table, find the route via docker0 and send the packet to the bridge device.

At this point, the packet leaves the namespace for which this set of iptables rules is responsible, so the OUTPUT chain is traversed and our log entry is created.

What happens next? The packet is picked up by the container namespace, processed and the answer goes back. We can see the answer coming in again, this time triggering the logging in the INPUT rule – this is the second line, the SYN ACK packet.

Except our logging rule, no other rules are defined in the INPUT and OUTPUT chains, so the default policies apply for our packets. As both policies are set to ACCEPT, netfilter will allow our packets to pass and the connection works.

Getting out of the container

The story is a bit different if we are trying to talk to a web server on the host or on the LAN from within the container. Thus, from the kernels point of view, we are now dealing with traffic involving more than one interface, and in addition to the INPUT and OUTPUT chains, the FORWARD chain becomes relevant. To be able to inspect this traffic, let us therefore add two logging rules to the FORWARD chain.

$ sudo iptables -I FORWARD -i docker0 -j LOG --log-prefix "IN_FORWARD: " --log-level 3
$ sudo iptables -I FORWARD -o docker0 -j LOG --log-prefix "OUT_FORWARD: " --log-level 3

Now let us generate some traffic. The first thing that I have tried is to reach my SAN on the same network which is a Synology diskstation listening on port 5000 of So in the container window, I did a

# telnet

and entered some nonsens (it does not matter so much what you enter here, it will most likely result in a “bad request” message, but it generates traffic – do not forget to hit return). This will again produce some logging output, the first two lines being

IN_FORWARD: IN=docker0 OUT=enp4s0 PHYSIN=veth376d25c MAC=02:42:25:b7:e5:38:02:42:ac:11:00:02:08:00 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=28280 DF PROTO=TCP SPT=54062 DPT=5000 WINDOW=29200 RES=0x00 SYN URGP=0 
OUT_FORWARD: IN=enp4s0 OUT=docker0 MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=0 DF PROTO=TCP SPT=5000 DPT=54062 WINDOW=14480 RES=0x00 ACK SYN URGP=0

Let us again try to understand what happened. An application (telnet in our case) wants to reach the IP address The kernel will first consult the routing table in the namespace of the container and decide to use the default route via the eth0 device. Thus the packet will go to the bridge device docker0. There, it will be picked up by the netfilter chain in the host namespace. As the destination address is not one of the IP addresses of the host, it will be handled by the FORWARD chain, which will trigger our logging rules.

Let us now inspect the other rules in the forward chain once more using iptables -S FORWARD. We see that in addition to the rules pointing to the docker generated subchains and in addition to our own logging rules, there are two rules relevant for our case.

$ sudo iptables  -S  FORWARD
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT

The first rule will accept all traffic that comes in at any network device and is routed towards the bridge device if that traffic belongs to an already established connection. This allows the answer to our TCP request to travel back from the network interface connected to the local network (enp4s0 in my case) to the container. However, unsolicited requests, i.e. new connection requests targeted towards the bridge device will be left to the default policy of the FORWARD chain and therefore dropped.

The second rule will allow outgoing traffic – all packets coming from the docker0 bridge device targeted towards any other interface will be accepted and hence forwarded. As there is no filter on the connection state, this allows an application inside the container to establish a new connection to the outside world.

However, I have been cheating a bit and skipped one important point. Suppose our SYN request happily leaves our local network adapter and travels through the LAN. The request comes from within the container, so from the IP address If that IP address would still appear in the IP header, the external server (the disk station in my case) would try to send the answer back to this address. However, this address is not known in the LAN, only locally on my machine, and the response would get lost.

To avoid this, Docker will in fact add one more rule to the NAT table. Let us try to locate this rule.

$ sudo iptables  -S -t nat 
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN

Again, we see that docker is adding a new chain and some rules pointing to this chain. In addition, there is a rule being added to the POSTROUTING chain which is invoked immediately before a packet leaves the host. This is a so called masquerading rule which will replace the source IP address in the IP header of the outgoing packet by the IP address of the device through which the packet is sent. Thus, from the point of view of my brave diskstation, the packet will look as if it originated from the host and will therefore be sent back to the host. When the response comes in, netfilter will revert the process and forward the packet to the correct destination, in our case the bridge device.

Reaching a server from the outside world

This was already fairly complicated, but now let us try to see what happens if we want to connect to the web server running in our container from the outside world.

Now if I simply ssh into my diskstation and run curl there to reach, this will of course fail. The diskstation does not have a route to that destination, and the default gateway cannot help either as this is a private class B network. If I replace the IP address with the IP address of the host, it will not work either – in this case, the request reaches the host, but on the host network address, no process is listening in port 80. So we somehow need to map the port from the container into the host networking system.

If you consult the docker documentation on this case, you will learn that in order to do this, you have to run the container with the -p switch. So let us stop and restart our container and apply that option.

$ docker stop container1
$ docker run --rm -d --name "container1" -p 80:80 httpd:alpine
$ docker exec -it container1 "/bin/sh"

If we now inspect the chains and see what has changed, we can find the following new rule which has been added to the filter table.

A DOCKER -d ! -i docker0 -o docker0 -p tcp -m tcp --dport 80 -j ACCEPT

This rule will apply to all incoming traffic that is targeted towards and not coming from the bridge device, and accept it. In addition, two new rules have been added to the NAT table.

-A POSTROUTING -s -d -p tcp -m tcp --dport 80 -j MASQUERADE
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 80 -j DNAT --to-destination

The first rule will again apply a network address translation (SNAT, i.e. manipulating the source address) as we have already seen it before and applies to traffic within the virtual network to which the bridge belongs. The second rule is more interesting. This rule has been added to the DOCKER chain and requests DNAT (i.e. destination NAT, meaning that the target address is replaced) for all packets that are not coming from the bridge device, but have destination port 80. For these packets, the target address is rewritten to be, so all traffic directed towards port 80 is now forwarded to the container network.

Let us again go through one example step by step. For that purpose, it is useful to add some more logging rules, this time to the NAT table.

$ sudo iptables -t nat -I PREROUTING  -j LOG --log-prefix "PREROUTING: " --log-level 3
$ sudo iptables -t nat -I POSTROUTING  -j LOG --log-prefix "POSTROUTING: " --log-level 3

When we now submit a request from another machine on the local network directed towards (i.e. towards the IP address of the host!), we find the following log entries in /var/syslog.

PREROUTING: IN=enp4s0 OUT= MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0 
OUT_FORWARD: IN=enp4s0 OUT=docker0 MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0 
POSTROUTING: IN= OUT=docker0 SRC= DST= LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0

Thus the first packet of the connection arrives (note that NAT rules are not evaluated for subsequent packets of a connection any more) and will first be processed by the PREROUTING chain of the NAT table. As we added our logging rule here, we see the log output. We can also see that at this point, the target address is still

The next rule – still in the PREROUTING chain – that is being evaluated is the jump to the DOCKER chain. Here, the DNAT rule kicks in and changes the destination address to, the IP address of the container.

Then the kernel routing decision is taken based on the new address and the forwarding mechanism starts. The packet will appear in the forward chain. As the routing has already determined the target interface to be docker0, our OUT_FORWARD logging rule applies and the second log entry is produced, also confirming that the new target address is Then, the jump to the DOCKER chain matches, and within that chain, the rule is accepted as its target port is 80.

Finally, the POSTROUTING chain in the NAT table is traversed. This produces our third log file entry. However, the SNAT rule does not apply, as the source address does not belong to the network – you can use tcpdump on the bridge device to see that when the packet leaves the device, it still has the source IP address belonging to the diskstation. So again, the configuration works and we can reach an application inside the container from the outside world.

There are many other aspects of networking with Docker that I have not even touched upon – user defined bridges, overlay networks or the famous docker-proxy, just to mention a few fo them – but this post is already a bit lengthy, so let us stop here for today. I hope I could provide at least some insight into the internals of networking with Docker – and for me, this was actually a good opportunity to refresh and improve my still very basic knowledge of the Linux networking stack.

Docker internals: networking part I

In this post, we will investigate one of the more complex topics when working with Docker – networking.

We have already seen in the previous post that namespaces can be used to isolate networking resources used by different containers as well as resources used by containers from those used by the host system. However, by nature, networking is not only about isolating but also about connecting – how does this work?

So let us do some tests with the httpd:alpine image. First, get a copy of that image into your local repository:

$ docker pull httpd:alpine

Then open two terminals that we call Container1 and Container2 and attach to them (note that you need to switch to the second terminal after entering the second line).

$ docker run --rm -d  --name="Container1" httpd:alpine
$ docker exec -it Container1 "/bin/sh"

$ docker run --rm -d  --name="Container2" httpd:alpine
$ docker exec -it Container2 "/bin/sh"

Next, make curl available in both containers using apk update followed by apk add curl. Now let us switch to cointainer 1 and inspect the network configuration.

/usr/local/apache2 # ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02  
          inet addr:  Bcast:  Mask:
          RX packets:1867 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1017 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1831738 (1.7 MiB)  TX bytes:68475 (66.8 KiB)

lo        Link encap:Local Loopback  
          inet addr:  Mask:
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

We see that there is an ethernet interface eth0 with IP address If you do the same in the second container, you should see a similar output with a different IP address, say

Now let us make a few connection tests. First, go to the terminal running inside container 1 and enter


You should now see a short HTML snippet containing the text “It works!”. So apparently we can reach container 2 from container 1 – and similarly, you should be able to reach container 1 from container 2. Finally, try the same from a terminal attached to the host – you should be able to reach both containers from there. Finally, if you also have a web server or a similar server running on the host, you will see that you can also reach that from within the containers. In my case, I have a running tomcat being bound to on my local host, and was able to connect using


from within the container. How does this work?

To solve the puzzle, go back to a terminal on the host system and take a look at the routing table.

$ ip route show
default via dev enp4s0  proto static  metric 100 dev enp4s0  scope link  metric 1000 dev docker0  proto kernel  scope link  src dev enp4s0  proto kernel  scope link  src  metric 100 

We see that docker has apparently added an additional routing table entry and has created an additional networking device – docker0 – to which all packages with destination in the class B network are sent.

This device is a so called bridge. A (software) bridge is very similar to an ethernet bridge in hardware. It connects two or more devices – each packet that goes to one device is forwarded to all other devices connected to the bridge. The Linux kernel offers the option to establish a bridge in software which does exactly the same thing.

Let us list all existing bridges using the brctl utility.

$ brctl show
bridge name	bridge id		STP enabled	interfaces
docker0		8000.0242fd67b17b	no		veth1f8d78c

Let us compare this with the output of the ifconfig command (again on the host). The corresponding output is

veth1f8d78c Link encap:Ethernet  HWaddr 56:e5:92:e8:77:1e  
          inet6 addr: fe80::54e5:92ff:fee8:771e/64 Scope:Link
          RX packets:1034 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1964 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:70072 (70.0 KB)  TX bytes:1852757 (1.8 MB)

vetha1692e1 Link encap:Ethernet  HWaddr ca:73:3e:20:36:f7  
          inet6 addr: fe80::c873:3eff:fe20:36f7/64 Scope:Link
          RX packets:1133 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2033 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:78715 (78.7 KB)  TX bytes:1860499 (1.8 MB)

These two devices are also appearing in the output of brctl show and are two devices that are connected to the bridge. These devices are called a virtual ethernet devices. They are always created in pairs and act like a pipe: traffic flowing in at one of the two devices appears to come out of the second device and vice versa – like a virtual network cable connecting the two devices.

We just said that virtual devices are always created in pairs. We have two containers, and if you start them one by one and look at the output of ifconfig, we see that each of the two containers contributes one device. Can that be correct? After starting the first container we should already see a pair, and after starting the second one we should see four instead of just two devices. So one in each pair is missing. Where did it go?

The answer is that Docker did create it, but move it into the namespace of the container, so that it is no longer visible on the host. To verify this, we can see the connection between the two interfaces as follows. First, enter

ip link

on the host. The output should look similar to the following lines

$ ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp4s0:  mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 1c:6f:65:c0:c9:85 brd ff:ff:ff:ff:ff:ff
3: docker0:  mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:fd:67:b1:7b brd ff:ff:ff:ff:ff:ff
25: vetha1692e1@if24:  mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default 
    link/ether ca:73:3e:20:36:f7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
27: veth1f8d78c@if26:  mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default 
    link/ether 56:e5:92:e8:77:1e brd ff:ff:ff:ff:ff:ff link-netnsid 1

This will show you (the very first number in each line) the so called ifindex which is a unique identifier for each network device within the kernel. In our case, the virtual ethernet devices visible in the host namespace have the indices 25 and 27. After each name, after the “at” symbol, you see a second number – 24 and 26. Now let us execute the same commmand in the first container.

/usr/local/apache2 # ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
24: eth0@if25:  mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff

Here suddenly the device with index 24 appears, and we see that it is connected to device if25, which is displayed as vetha1692e1 in the host namespace! Similarly, we find the device if26 inside container two:

/usr/local/apache2 # ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
26: eth0@if27:  mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff

This gives us now a rather complete picture of the involved devices. Ignoring the loopback devices for a moment, the following picture emerges.


Now we can understand what happens when an application in container 1 sends a packet to container 2. First, the kernel will inspect the routing table in container 1. This table looks as follows.

/usr/local/apache2 # ip route
default via dev eth0 dev eth0  src

So the kernel will determine that the packet should be sent to the interface known as eth0 in container 1 – this is the interface with the unique index 24. As this is part of a virtual ethernet device pair, it appears on the other side of the pair, i.e. at the device vetha1692e1. This device in turn is connected to the bridge docker0. Being a bridge, it will distribute the packet to all other attached devices, so it will reach veth1f8d78c. This is now one endpoint of the second virtual ethernet device pair, and so the packet will finally end up at the interface with the unique index 26, i.e. the interface that is called eth0 in container 2. On this interface, the HTTP daemon is listening, receives the message, prepares an answer and that answer goes the same way back to container 1.

Thus, effectively, it appears from inside the containers as if all container network interfaces would be attached to the same network segment. To complete the picture, we can actually follow the trace of a packet going from container 1 to container 2 using arp and traceroute.

/usr/local/apache2 # arp
? ( at 02:42:fd:67:b1:7b [ether]  on eth0
? ( at 02:42:ac:11:00:03 [ether]  on eth0
/usr/local/apache2 # traceroute
traceroute to (, 30 hops max, 46 byte packets
 1 (  0.016 ms  0.013 ms  0.010 ms

We can now understand how applications in different containers can communicate with each other. However, what we have discussed so far is not yet sufficient to explain how an application on the host can access a HTTP server running inside a container, and what additional setup we need to access a HTTP server running in a container from the LAN. This will be the topic of my next post.



Setting up and testing our bitcoin network

In the last post in my series on bitcoin and the blockchain, we have successfully “dockerized” the bitcoin core reference implementation and have built the bitcoin-alpine docker container that contains a bitcoin server node on top of the Alpine Linux distribution. In this post, we will use this container image to build up and initialize a small network with three nodes.

First, we start a node that will represent the user Alice and manage her wallet.

$ docker run -it --name="alice" -h="alice" bitcoin-alpine

Note that we have started the container with the option -it so that it stays attached to the current terminal and we can see its output. Let us call this terminal terminal 1. We can now open a second terminal and attach to the running server and use the bitcoin CLI to verify that our server is running properly.

$ docker exec -it alice  "sh"
# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password getinfo

If you run – as shown above – the bitcoin CLI inside the container using the getbalance RPC call, you will find that at this point, the balance is still zero. This not surprising, our network does not contain any bitcoins yet. To create bitcoins, we will now bring up a miner node and mine a certain amount of bitcoin. So in a third terminal, enter

$ docker run -it --name="miner" -h="miner" bitcoin-alpine

This will create a second node called miner and start a second instance of the bitcoin server within this network. Now, however, our two nodes are not yet connected. To change this, we can use the addnode command in the bitcoin CLI. So let us figure out how they can reach each other on the network. We have started both containers without any explicit networking options, so they will be connected to the default bridge network that Docker creates on your host. To figure out the IP addresses, run

$ docker network inspect bridge

on your host. This should produce a list of network devices in JSON format and spit out the IP addresses of the two nodes. In my case, the node alice has the IP address and the miner has the IP address So let us change back to terminal number 2 (attached to the container alice) and run

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password addnode add

In the two terminal windows that are displaying the output of the two bitcoin daemons on the nodes alice and miner, you should now see an output similar to the following.

2018-03-24 14:45:38 receive version message: /Satoshi:0.15.0/: version 70015, blocks=0, us=[::]:0, peer=0

This tells you that the two peers have exchanged a version message and recognized each other. Now let us mine a few bitcoins. First attach a terminal to the miner node – this will be terminal four.

$ docker exec -it miner "sh"

In that shell, enter the following command

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password generate 101
# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password getbalance

You should now see a balance of 50 bitcoins. So we have mined 50 bitcoins (in fact, we have mined much more, but it requires additional 100 blocks before a mined transaction is considered to be part of the balance and therefore only the amount of 50 bitcoins corresponding to the very first block shows up).

This is already nice, but when you execute the RPC method getbalance in terminal two, you will still find that Alice balance is zero – which is not surprising, after all the bitcoins created so far have been credited to the miner, not to Alice. To change this, we will now transfer 30 bitcoin to Alice. As a first step, we need to figure out what Alice address is. Each wallet maintains a default account, but we can add as many accounts as we like. So let us now add an account called “Alice” on the alice node. Execute the following command in terminal two which is attached to the container alice.

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password getnewaddress "Alice"

The output should be something like mkvAYmgqrEFEsJ9zGBi9Z87gP5rGNAu2mx which is the bitcoin address that the wallet has created. Of course you will get a different address, as the private key behind that address will be chosen randomly by the wallet. Now switch to the terminal attached to the miner node (terminal four). We will now transfer 30 bitcoin from this node to the newly created address of Alice.

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password sendtoaddress "mkvAYmgqrEFEsJ9zGBi9Z87gP5rGNAu2mx" 30.0

The output will be the transaction ID of the newly created transaction, in my case this was


If you now display the balances on both nodes again, you will see that the balance in the default account on the miner node has dropped to something a bit below 20 (not exactly 20, because the wallet has added a fee to the transaction). The balance of Alice, however, is still zero. The reason is that the transaction does not yet count as confirmed. To confirm it, we need to mine six additional blocks. So let us switch to the miners terminal once more and enter

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password generate 6

If you now display the balance on the node alice, you should find that the balance is now 30.

Congratulations – our setup is complete, we have run our first transaction and now have plenty of bitcoin in a known address that we can use. To preserve this wonderful state of the world, we will now create two new images that contain snapshots of the nodes so that we have a clearly defined state to which we can always return when we run some tests and mess up our test data. So switch back to a terminal on your host and enter the following four commands.

$ docker stop miner
$ docker stop alice
$ docker commit miner miner
$ docker commit alice alice

If you now list the available images using docker images, you should see two new images alice and miner. If you want, you can now remove the stopped container instances using docker rm.

Now let us see how we can talk to a running server in Python. For that purpose, let us start a detached container using our brand new alice image.

$ docker run  -d --rm --name=alice -p 18332:18332 alice

We can now communicate with the server using RPC requests. The bitcoin RPC interface follows the JSON RPC standard. Using the powerful Python package requests, it is not difficult to write a short function that submits a request and interprets the result.

def rpcCall(method, params = None, port=18332, host="localhost", user="user", password="password"):
    # Create request header
    headers = {'content-type': 'application/json'}
    # Build URL from host and port information
    url = "http://" + host + ":" + str(port)
    # Assemble payload as a Python dictionary
    payload = {"method": method, "params": params, "jsonrpc": "2.0", "id": 1}        
    # Create and send POST request
    r =, json=payload, headers=headers, auth=(user, password))
    # and interpret result
    json = r.json()
    if 'result' in json and json['result'] != None:
        return json['result']
    elif 'error' in json:
        raise ConnectionError("Request failed with RPC error", json['error'])
        raise ConnectionError("Request failed with HTTP status code ", r.status_code)

If have added this function to my btc bitcoin library within the module utils. We can use this function in a Python program or an interactive ipython session similar to the bitcoin CLI client. The following ipython session demonstrates a few calls, assuming that you have cloned the repository.

In [1]: import btc.utils
In [2]: r = btc.utils.rpcCall("listaccounts")
In [3]: r
Out[3]: {'': 0.0, 'Alice': 30.0}
In [4]: r = btc.utils.rpcCall("dumpprivkey", ["mkvAYmgqrEFEsJ9zGBi9Z87gP5rGNAu2mx"])
In [5]: r
Out[5]: 'cQowgjRpUocje98dhJrondLbHNmgJgAFKdUJjCTtd3VeMfWeaHh7'

We are now able to start our test network at any time from the saved docker images and communicate with it using RPC calls in Python. In the next post, we will see how a transaction is created, signed and propagated into the test network.

Learning algorithms for restricted Boltzmann machines – contrastive divergence

In the previous post on RBMs, we have derived the following gradient descent update rule for the weights.

\Delta W_{ij} = \beta \left[ \langle v_i \sigma(\beta a_j) \rangle_{\mathcal D} - \langle v_i \sigma(\beta a_j) \rangle_{P(v)} \right]

In this post, we will see how this update rule can be efficiently implemented. The first thing that we note is that the term \sigma(\beta a_j) that appears several times is simply the conditional probability for the hidden unit j to be “on” and, as only the values 0 and 1 are possible, at the same time the conditional expectation value of that unit given the values of the visible units – let us denote this quantity by e_j. Our update rule now reads

\Delta W_{ij} = \beta \left[ \langle v_i e_j \rangle_{\mathcal D} - \langle v_i e_j \rangle_{P(v)} \right]

Theoretically, we know how to calculate this. The first term – the positive phase – is easy, this is just the average over the sample set.

The second term is more challenging. Theoretically, we would need a Gibbs sampler to calculate it using a Monte Carlo approach. One step of this sampler would proceed as follows.

  1. Given the values v of the visible units, calculate the resulting expectation values e
  2. Set hidden unit j to one with probability ej
  3. For each visible unit i, calculate the conditional probability pi to be one given the new values of the hidden units
  4. Set vi to 1 with probability pi

After some burn-in phase, we would then calculate the product v_i e_j after each step and take the average of these values.

The crucial point is that for a naive implementation, we would start the Gibbs sampling procedure during each gradient descent iteration from scratch, i.e. with some randomly initialized values for the visible units. One of the ideas behind the algorithm known as contrastive divergence that was proposed by G. Hinton in [1] is to restart the Gibbs sampler not at a random value, but a randomly chosen vector from the data set! The idea behind this is that if we have been running the training for some time, the model distribution should be close to the empirical distribution of the data, so sampling a vector from the data should give us something close to the equilibrium state of the Gibbs sampling Markov chain (if you do not known what a Markov chain is – do not worry and just read on, I will cover Markov chains and the mathematics behind all this in a later post).

The second approximation that the contrastive divergence algorithm makes is to replace the expectation values in the positive and negative phase by a point estimate. For the positive phase, that means we simply calculate the value at one point from the data set. For the negative phase, we run the Gibbs sampling procedure – starting as explained above with a vector from the data set – and then simply compute the product v_i e_j for the result.

It now turns out that, based on empirical observations, these approximations work extremely well – in fact, it turns out that instead of running a full Gibbs sampler with a few hundred or even a few thousand steps, one step is often sufficient! This is surprising, but open to an intuitive explanation – we run all this within the outer loop provided by the gradient descent algorithm, and if we chose the learning rate sufficiently small, the parameters do not change a lot between these steps, so that we effectively do something that is close to one long Gibbs sampling Markov chain.

With these simplifications, the constrastive divergence algorithm now looks as follows.

FOR EACH iteration DO

Sample a vector v from the data set

SET e = \sigma(\beta( W^T v + c))

FOR EACH hidden unit DO

SET h_j = 1 with probability e_j

FOR EACH visible unit DO

SET \bar{v}_i = 1 with probability \sigma(\beta (W h + b))_i

SET \bar{e} = \sigma(\beta (W^T \bar{v} + c))

SET W = W + \lambda \beta \left[ v e^T - \bar{v} \bar{e}^T \right]

SET b = b + \lambda \beta \left[ v - \bar{v} \right]

SET c = c + \lambda \beta \left[ e - \bar{e} \right]


The first six lines within an iteration constitute one Gibbs sampling step, starting with a value for the visible units from the data set, sampling the hidden units from the visible units and sampling the visible units from the hidden units. In the next line, we recalculate the expectation values of the hidden units given the (updated) values of the visible units. The value \bar{v}_i \bar{e}_j is then the contribution of the negative phase to the update of W_{ij}. We can summarize the contributions for all pairs of indices as the matrix \bar{v} \bar{e}^T. Similarly, the positive phase contributes with v e^T. In the next line, we update W with both contributions, where \lambda is the learning rate. We then apply similar update rules to the bias for visible and hidden units – the derivation of these update rules from the expression for the likelihood function is done similar to the derivation of the update rules for the weights as shown in my last post.

Let us now implement this in Python. To have a small data set for our tests, we will use an artificial data set called bars and stripes that I have seen first in [3]. Given a number N, we can create an image with N x N pixels for every number x smallers than 2N as follows. Each row corresponds to one binary digit of x. If this digit is one, the entire row is black, i.e. we have one black vertical stripe, otherwise the entire row is white. A second row of patterns is obtained by coloring the columns similarly instead of the rows. Thus we obtain 2N+1 possible patterns, more than enough for our purposes. I have written a helper class BAS in Python that creates these patterns.

Next, let us turn to the actual RBM. We store the current state of the RBM in a class RBM that is initialized as follows.

class RBM:

    def __init__ (self, visible = 8, hidden = 3, beta = 1):
        self.visible = visible
        self.hidden = hidden
        self.beta = beta
        self.W = np.random.normal(loc = 0, scale = 0.01, size = (visible, hidden))
        self.b = np.zeros(shape = (1,visible))
        self.c = np.zeros(shape = (1,hidden))

Here W is the weight matrix, beta is the inverse temperature, and b and c are the bias vectors for the visible and hidden units.

Next we need a method that runs one step in a Gibbs sampling chain, starting with a state of the visible units captured in a matrix V (we calculate this in a mini-batch for more than one sample at a time, each row in the matrix represents one sample vector). Using once more the numpy library, this can be done as follows.

def runGibbsStep(self, V, size = 1):
    # Sample hidden units from visible units
    E = expit(self.beta*(np.matmul(V, self.W) + self.c))
    U = np.random.random_sample(size=(size, self.hidden))
    H = (U <= E).astype(int)
    # and now sample visible units from hidden units
    P = expit(self.beta*(np.matmul(H, np.transpose(self.W)) + self.b))
    U = np.random.random_sample(size=(size, self.visible))
    return (U <= P).astype(int), E

With this method at hand – which returns the new value for the visible units but the old value for the conditional expectation of the hidden units – we can now code our training routine.

def train(self,  V, iterations = 100, step = 0.01):
    batch_size = V.shape[0]
    # Do the actual training. First we calculate the expectation
    # values of the hidden units given the visible units. The result
    # will be a matrix of shape (batch_size, hidden)
    for _ in range(iterations):
        # Run one Gibbs sampling step and obtain new values
        # for visible units and previous expectation values
        Vb, E = self.runGibbsStep(V, batch_size)
        # Calculate new expectation values
        Eb = expit(self.beta*(np.matmul(Vb, self.W) + self.c))
        # Calculate contributions of positive and negative phase
        # and update weights and bias
        pos = np.tensordot(V, E, axes=((0),(0)))
        neg = np.tensordot(Vb, Eb, axes=((0),(0)))
        dW = step*self.beta*(pos -neg) / float(batch_size)
        self.W += dW
        self.b += step*self.beta*np.sum(V - Vb, 0) / float(batch_size)
        self.c += step*self.beta*np.sum(E - Eb, 0) / float(batch_size)

Let us now play around with this network a bit and visualize the training results. To do this, clone my repository and then run the simulation using

$ git clone
$ cd MachineLearning
$ python  --run_reconstructions=1 --show_metrics=1

This will train a restricted Boltzmann machine on 20 images out of the BAS dataset with N=6. For the training, I have used standard parameters (which you can change using the various command line switches, use --help to see which parameters are available). The learning rate was set to 0.05. The number of iterations during training was set to 30.000, and 16 hidden units are used. The inverse temperature \beta is set to 2.0. In each iteration, a mini-batch of 10 patterns is trained.

After every 500 iterations, the script prints out the current value of the reconstruction error. This is defined to be the norm of the difference between the value of the visible units when the Gibbs sampling step starts and the value after completing the Gibbs sampling step, i.e. this quantity measures how well the network is able to reconstruct the value of the visible units from the hidden units alone.

After the training phase is completed, the script will select eight patterns randomly. For each of these patterns, it will flip a few bits and then run 100 Gibbs sampling steps. If the training was successful, we expect that the result will be a reconstruction of the original image, i.e. the network would be able to match the distorted images to the original patterns.

When all the calculations have been completed, the network will display two images. The first image should roughly look like the image below.


This matrix visualizes the result of the reconstruction process described above. Each of the rows shows the outcome for one of the eight selected patterns. The first image in each row is the original pattern from the BAS data set. The second one is the distorted image some pixels have been flipped. The third image shows the result of the reconstruction run after 50 Gibbs iterations, and the last image shows the result after the full 100 iterations.

We see that in most cases, the network is able to correctly reconstruct the original image. However, there are also a fes rows that look suspicious. In the first row, we could hope that the network eventually converges if we execute more sampling steps. In the third row, however, the network converges to a member of the BAS data set, but to the wrong one.

The second diagram that the script produces displays the change to the weights after each iteration and the reconstruction error.


We see that both quantities quickly get smaller, but never stabilize at exactly zero. This is not really surprising – as we work with a non-zero temperature, we will always have some thermal fluctuations and the reconstruction error will never be constantly zero, but oscillate around a small value.

I invite you to play around with the parameters a bit to see how the network behaves. We can change the value of the inverse temperature with the parameter --beta, the number of hidden units with the parameter --hidden, the number of Gibbs steps used during the reconstruction with --sample and the step size with --step. If, for instance, you raise the temperature, the fluctuations of the reconstruction error will increase. If, one the other hand, we choose a very small temperature, the network converges very slowly. Making the step size too small or too large can also lead to non-convergence etc.

That completes this post on contrastive divergence. In the next post, I will show you an alternative algorithm that has gained a lot of popularity called persistent contrastive divergence (PCD), before we finally set out to implement an restricted Boltzmann machine on a GPU using the TensorFlow framework.

1. G. Hinton, Training products of experts by minimizing contrastive divergence, Journal Neural Computation Vol. 14, No. 8 (2002), 1771 1800
2. G. Hinton, A practical guide to training restricted Boltzmann machines, Technical Report University of Montreal TR-2010-003 (2010)
[3] D. MacKay, Information Theory, Inference and learning
algorithms, section 43, available online at this URL

Docker internals: process isolation with namespaces and cgroups

A couple of years back, when I first looked into Docker in more detail, I put together a few pages on how Docker is utilizing some Linux kernel technologies to realize process isolation. Recently I have been using Docker again, so I thought it would be a good point in time to dig out some of that and create two or maybe three posts on some Docker internals. Lets get started….

Container versus virtual machines

You probably have seen the image below or a similar image before, but for the sake of completeness let us quickly recap what the main difference between a container like Docker and a virtual machine is.


On the left hand side, we see a typical stack when full virtualization is used. The exact setup will depend on the virtualization model that is being used, but in many cases (like running VirtualBox on Linux), you will have the actual hardware, a host operating system like Linux that consists of the OS kernel and on top of that a file system, libraries, configuration files etc. On these layers, the virtual machine is executing as an application. Inside the virtual machine, the guest OS is running. This could again be Linux, but could be a different distribution, a different kernel or even a completely different operating system. Inside each virtual machine, we then again have an operating kernel system kernel, all required libraries and finally the applications.

This is great for many purposes, but also introduces some overhead. If you decide to slice and dice your applications into small units like microservices, your actual applications can be rather small. However, for every application, you still need the overhead of a full operating system. In addition, a full virtualization will typically also consume a few resources on the host system. So full virtualization might not always be the perfect solution.

Enter containers. In a container solution, there is only one kernel – in fact, all containers and the applications running in them use the same kernel, namely the kernel of the host OS. At least logically, however, they all have their own root file system, libraries and so on. Thus containers still have the benefit of a certain isolation, i.e. different applications running in different container are still isolated on the file system level, can use networking resources like ports and sockets without conflicting and so forth, while reducing the overhead by sharing the kernel. This makes containers a good choice if you can live with the fact that all applications run one one OS and kernel version.

But how exactly does the isolation work? How does a container create the illusion for a process running inside it that it is the exclusive user of the host operating system? It turns out that Docker uses some technologies built into the Linux kernel to do this. Let us take a closer look at those core technologies one by one.

Core technologies

Let us start with namespaces. If you know one or more programming languages, you have probably heard that term before – variables and other objects are assigned to namespaces so that a module can use a variable x without interfering with a variable of the same name in a different module. So namespaces are about isolation, and that is also their role in the container world.

Let us look at an example to understand this. Suppose you want to run a web server inside a container. Most web servers will try to bind to port 80 on startup. Now at any point in time, only one application can listen on that port (with the same IP address). So if you start two containers that both run a web server, you need a mechanism to make sure that both web servers can – inside their respective container – bind to port 80 without a conflict.

A similar issue exists for process IDs. In the “good old days”, a Linux process was uniquely identified by its process ID, so there was exactly one process with ID 1 – usually the init process. ID 1 is a bit special, for instance when it comes to signal handling, and usually different copies of the user space OS running in different containers will all try to start their own init process and might rely on it having process ID one. So again, there is a clash of resources and we need some magic to separate them between containers.

That is exactly what Linux namespaces can do for you. Citing from the man page,

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.

In fact, Linux offers namespaces for different types of resources – networking resources, but also mount points, process trees or the good old System V inter-process communication devices. Individual processes can join a namespace or leave a namespace. If you spawn a new process using the clone system call, you can either ask the kernel to assign the new process to the same namespaces as the parent process, or you can create new namespaces for the child process.

Linux exposes the existing namespaces as symbolic links in the directory /proc/XXXX/ns, where XXXX is the process id of the respective process. Let us try this out. In a terminal, enter (recall that $$ expands to the PID of the current process)

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 net -> net:[4026531957]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 pid -> pid:[4026531836]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 user -> user:[4026531837]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 uts -> uts:[4026531838]

Here you see that each namespace to which the process is assigned is represented by a symbolic link in this directory (you could use ls -Li to resolve the link and display the actual inode to which it is pointing). If you compare this with the content of /proc/1/ns (you will need to be root to do this), you will probably find that the inodes are the same, so your shell lives in the same namespace as the init process. We will later see how this changes when you run a shell inside a container.

Namespaces already provide a great deal of isolation. There is one namespace, the mount namespace, which even allows different processes to mount different volumes as root directory, and we will later see how Docker uses this to realize file system isolation. Now, if every container really had its own, fully, independent root file system, this would again introduce a high overhead. If, for instance, you run two containers that both use Ubuntu Linux 16.04, a large part of the root file system will be identical and therefore duplicated.

To be more efficient, Docker therefore uses a so called layered file system or union mount. The idea behind this is to merge different volumes into one logical view. Suppose, for instance, that you have a volume containing a file /fileA and another volume containing a file /fileB. With a traditional mount, you could mount any of these volumes and would then see either file A or file B. With a union mount, you can mount both volumes and would see both files.

That sounds easy, but is in fact quite complex. What happens, for instance, if both volumes contain a file called /fileA? To make this work, you have to add layers, where each layer will overlay files that already exist in lower layers. So your mounts will start to form a stack of layers, and it turns out that this is exactly what we need to efficiently store container images.

To understand this, let us again look at an example. Suppose you run two containers which are both based on the same image. What then essentially happens is that Docker will create two union mounts for you. The lowest layer in both mounts will be identical – it will simply be the common image. The second layer, however, is specific to the respective container. When you now add or modify a file in one container, this operation changes only the layer specific to this container. The files which are not modified in any of the containers continue to be stored in the common base layer. Thus, unless you execute heavy write operations in the container, the specific layers will be comparatively small, reducing the overhead greatly. We will see this in action soon.

Finally, the last ingredient that we need are control groups, abbreviated as cgroups. Essentially, cgroups provide a way to organize Linux processes into hierarchies in order to manage resource limits. Being hierarchies, cgrous are again exposed as part of the file system. On my machine, this looks as follows.

chr:~$ ls /sys/fs/cgroup/
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  systemd

We can see that there are several directories, each representing a specific type of resources that we might want to manage. Each of these directories can contain an entire file system tree, where each directory represents a node in the hierarchy. Processed can be assigned to a node by adding their process ID to the file tasks that will find in each of these nodes. Again, the man page turns out to be a helpful resource and explains the meaning of the different entries in the /sys/fs/cgroup directories.


Let us now see how all this works in practice. For that purpose, let us open two terminals. In one of the terminals – let me call this the container terminal – start a container running the Alpine distribution using

docker run --rm -it alpine

In the second window which I will call the host window, we can now use ps -axf to inspect the process tree and then look at the directories in /proc to browse the namespaces.

What you will find is that there are three processes involved. First, there is the docker daemon itself, called dockerd. In my case, this process has PID 1496. Then, there is a child process called docker-containerd, which is the actual container runtime within the Docker architecture stack. This process in turn calls a process called docker-containerd-shim (PID 10126) which then spawns the shell (PID 10142 in my case) inside the container.



Now let us inspect the namespaces associated with these processes first. We start with the shell itself.

$ sudo ls -Lil /proc/10142/ns
total 0
4026531835 -r--r--r-- 1 root root 0 Apr  9 11:20 cgroup
4026532642 -r--r--r-- 1 root root 0 Apr  9 11:20 ipc
4026532640 -r--r--r-- 1 root root 0 Apr  9 11:20 mnt
4026532645 -r--r--r-- 1 root root 0 Apr  9 10:48 net
4026532643 -r--r--r-- 1 root root 0 Apr  9 11:20 pid
4026531837 -r--r--r-- 1 root root 0 Apr  9 11:20 user
4026532641 -r--r--r-- 1 root root 0 Apr  9 11:20 uts

Let us now compare this with the namespaces to which the containerd-shim process is assigned.

$ sudo ls -Lil /proc/10126/ns
total 0
4026531835 -r--r--r-- 1 root root 0 Apr  9 11:21 cgroup
4026531839 -r--r--r-- 1 root root 0 Apr  9 11:21 ipc
4026531840 -r--r--r-- 1 root root 0 Apr  9 11:21 mnt
4026531957 -r--r--r-- 1 root root 0 Apr  9 08:49 net
4026531836 -r--r--r-- 1 root root 0 Apr  9 11:21 pid
4026531837 -r--r--r-- 1 root root 0 Apr  9 11:21 user
4026531838 -r--r--r-- 1 root root 0 Apr  9 11:21 uts

We see that Docker did in fact create new namespaces for almost all possible namespaces (ipc, mnt, net, pid, uts).

Next, let us compare mount points. Inside the cointainer, run mount to see the existing mount points. Usually at the very top of the output, you should see the mount for the root filesystem. In my case, this was

none on / type aufs (rw,relatime,si=e92adf256343919e,dio,dirperm1)

Running the same command on the host system, I got a line like

none on /var/lib/docker/aufs/mnt/a9c5d26a45307d4e168b3936bd65d301c8dd039336083a324ed1a0b7c2bd0c52 type aufs (rw,relatime,si=e92adf256343919e,dio,dirperm1)

The identical si attribute tells you that this is fact the same mount. You also verify this directly. Inside the container, create a file test using touch test. If you then use ls to display the contents of the mount point as seen on the host, you should actually see this file. So the host process and the process inside the container see different mount points – made possible by the namespace technology! You can now access the files from inside the container or from outside the container without having to use docker exec (though I am note sure I would recommend this).

If you want, you can even trace the individual layers of this file system on the host system by using ls /sys/fs/aufs/si_e92adf256343919e/ and printing out the contents of the various files that you will find there – you will find that there are in fact two layers, one of them being a read-only layer and the second on top being a read-write layer.

ls /sys/fs/aufs/si_e92adf256343919e/
br0  br1  br2  brid0  brid1  brid2  xi_path
root:~# cat /sys/fs/aufs/si_e92adf256343919e/xi_path
root:~# cat /sys/fs/aufs/si_e92adf256343919e/br0
root:~# cat /sys/fs/aufs/si_e92adf256343919e/br1

You can even “enter the container” using the nsenter Linux command to manually attach to defined namespaces of a process. To see this, enter

$ sudo nsenter -t 10142 -m -p -u "/bin/sh"
/ # ls
bin    dev    etc    home   lib    media  mnt    proc   root   run    sbin   srv    sys    test   tmp    usr    var
/ #

in a host terminal. This will attach to the mount, PID and user namespaces of the target process specified via the -t parameter, in our case this is the PID of the shell inside the container, and run the specified command /bin/sh. As a result, you will now see the file test created inside the container and see the same filesystem that is also visible inside the container.

Finally, let us take a look at the cgroups docker has created for this container. The easiest way to find them is to search for the first few characters of the container ID that you can figure out using docker ps (I have cut off some lines at the end of the output).

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
7c0f142bfbfd        alpine              "/bin/sh"           About an hour ago   Up About an hour                        quizzical_jones
$ find /sys/fs/cgroup/ -name "*7c0f142bfbfd*"

If you now inspect the task files in each of the newly created directories, you will find the PID of the container shell as seen from the root namespace, i.e. 10142 in this case.

This closes this post which is already a bit lengthy. We have seen how Docker uses union file systems, namespaces and cgroups to manage and isolate container and how we can link the resources as seen from within the container to resources on the host system. In the next posts, we will look in more detail at networking in Docker. Until then, you might want to consult the followings links that contain additional material.

  • J. Pettazone has put slides from a talk online
  • The overview from the official Docker documentation briefly explains the core technologies
  • There are of course already many excellent blog posts on Docker internals, like the one by Julia Evans
  • The post from the Docker Saigon community

In case you are interested in container platforms in general, you also might want to take a look at my series on Kubernetes that I have just started.

Building a bitcoin container with Docker

When we are playing with bitcoin transactions, we need some playground where making a mistake does not cost us real bitcoins and therefore money. In addition, we might want to play around with more than one bitcoin server to see how networkings works and how the messages are exchanged in the bitcoin peer-to-peer network.

There are several ways to do this. First, we could run more than one bitcoin server locally, and use different data directories and different configuration files and ports to separate them. However, there is a different option – we can use Docker container. This will make it ease to spin up as many instances as needed while only having to put together the configuration once, supports networking and allows us to easily clean up after we have tried something and restart our scenarios from a well defined state. In this post, I will guide you through the steps needed to build and install the bitcoin core software inside a container. This post is not meant to be a general introduction into container technology and Docker, but do not worry – I will not assume that you are familiar with the concepts, but we will start from scratch. However, if you want to understands the basics, there are many good posts out there, like this introduction or the Docker overview page which is part of the official documentation.

First, we will need to make sure that we have the latest Docker CE installed. On Ubuntu 16.04, this requires the following steps. First, we need to remove any potentially existing docker versions.

$ sudo apt-get remove docker docker-engine

Next we will add the Docker repository to the repository list maintained by APT to be able to install from there. So we first use curl to get the public key from the docker page, feed this into the apt-key tool and then add the repository.

$ curl -fsSL | sudo apt-key add - 
$ sudo add-apt-repository "deb [arch=amd64] xenial stable"

We can then install docker from there and make sure that it is started at boot time using

$ sudo apt-get update
$ sudo apt-get install docker-ce
$ sudo systemctl enable docker

You also should make sure that there is a docker group to which your user is added – use the command groups to verify that this is the case and follow the instructions on the Docker page if not.

When everything is done, let us test our installation. We will ask docker to download and run the hello-world minimal image.

$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
ca4f61b1923c: Pull complete
Digest: sha256:97ce6fa4b6cdc0790cda65fe7290b74cfebd9fa0c9b8c38e979330d547d22ce1
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:

For more examples and ideas, visit:

Now we can start with the actual installation. At the end, of course, there will be a Dockerfile – but I will guide you through the individual steps one by one and we will create the final file iteratively. If you would only like to see the result, you can skip to the last paragraph of this post that explains how to retrieve this file from my GitHub repository and to use it … but if course you are more than welcome to join me on the way there.

Actually, we will even do a bit more than we would need if we only wanted to have a running bitcoind, but this post is also meant as a refresher if you have built docker images before and as a hands-on introduction if you have never done this.

As a basis, we will use the Alpine Linux distribution. From there, we have several options to proceed. Of course, the Alpine Linux distribution has a docker package, but simply that or any other precompiled binary would not give us the flexibility that we need – it is for instance extremely useful to be able to add debugging output to the source code if something is not as expected. Besides, it is more fun to compile from scratch. So our target will be to pull the source code, compile and install in a container.

Based on the Alpine Linux container, we will first create several container images with separate docker files that are based on each other, which will make it easier to iterate and check things manually as we go. In the end, we will assemble everything into one docker file.

As a first step, we add a few libraries to the Alpine image that we need to be able to fetch and compile the bitcoin source code. We call the target image alpine-dev as it can be reused as a development environment for other purposes as well. Here is the docker file

FROM alpine
RUN apk update && apk add git \ 
                          make \
                          file \
                          autoconf \
                          automake \ 
                          build-base \
                          libtool \
                          db-c++ \
                          db-dev \
                          boost-system \
                          boost-program_options \
                          boost-filesystem \ 
                          boost-dev \ 
                          libressl-dev \ 

Save this file somewhere in a local directory (ideally, this directory is otherwise empty as during the build process, its context will be transferred to the docker daemon and used as build context, so we better keep this small) and then build the image using

docker build -f -t alpine-dev .

If you now check your local repository with docker images, you should see that the image has been added successfully (it is big, but do not worry, we fix that later).

The next image that we will be building is an image that is based on alpine-dev but contains the specific version of the bitcoin source code that we want, i.e. it pulls the code from the bitcoin directory. So its docker file is very short.

FROM alpine-dev
RUN git clone --branch v0.15.0 --single-branch

You can build this image with

docker build -f -t bitcoin-alpine-build .

Next we write the docker file that performs the actual build. This is again not difficult (but here it comes in handy that we can run a container from the image that we have just generated and play around with the configuration to arrive at the final set of options below).

FROM bitcoin-alpine-build
RUN (cd bitcoin  && ./ && \
                      ./configure --disable-tests \
                      --disable-bench --disable-static  \
                      --without-gui --disable-zmq \ 
                      --with-incompatible-bdb \
                      CFLAGS='-w' CXXFLAGS='-w' && \
                      make -j 4 && \
                      strip src/bitcoind && \
                      strip src/bitcoin-cli && \
                      strip src/bitcoin-tx && \
                      make install )

Again, we build the image and place it into our local repository using

docker build -f Dockerfile.install -t bitcoin-alpine-bin .

Let us now test our installation. We first bring up an instance of our new container.

docker run -it bitcoin-alpine-bin

Within the container, we can now run the bitcoin daemon.

bitcoind -server=1 -rest=1 -regtest -txindex=1 -daemon

Once the daemon has started, we can verify that it is up and running and ready to accept commands. We can for instance – from within the container – run bitcoin-cli -regtest getinfo and should see some status information.

So let us now move on to write the docker file that will create the run time environment for our daemon. There is still a couple of things we need to think about. First, we will have to communicate with our bitcoin server using JSON-RPC, so we need to expose this port towards the host.

Second, we need a configuration file. We could generate this on the fly, but the easiest approach is to place this in the build context and copy it when we run the daemon.

Finally, we have to think about the RPC authorization mechanism. Typically, the bitcoin server writes a cookie file into the configuration directory which is then picked up the by client, but this does not work if the server is running inside the container and the client locally on our host system. Probably the safest way would be to use the option to provide a hashed password to the server and keep the actual password secure. We will use a different approach which is of course not recommended for production use as it could allow the world access to your wallet – we specify the username and password in the configuration file. Our full configuration file is


Note that this opens the RPC port to the world and is probably not secure, but for use in a test setup this will do. Our docker file for the last stage is now

FROM bitcoin-alpine-bin

# Copy the bitcoin.conf file from
# the build context into the container
COPY bitcoin.conf /bitcoin.conf

# Expose the port for the RPC interface
EXPOSE 18332/tcp

# Start the bitcoin server
ENTRYPOINT ["/usr/local/bin/bitcoind"]
CMD ["-conf=/bitcoin.conf", "-regtest", "-rest=1", "-server=1", "-printtoconsole", "-txindex=1"]

We can now test our client. First, we build and run the container.

$ docker build -f -t bitcoin-alpine-run .
$ docker run --rm -it -p 18332:18332 bitcoin-alpine-run

Then, in a second terminal, let us connect. We assume that you have an installation of the bitcoin-cli client on the host machine as well. Run

$ bitcoin-cli -regtest -rpcuser=user -rpcpassword=password getinfo

You should now see the usual output of the getinfo command similar to the test that we have done before directly in the container.

That is nice, but there is a serious issue. If you look at the image that we have just created using docker images, you will most likely be shocked – our container is more than 800 Mb in size. This is a problem. The reason is that our image contains everything that we have left behind during the build process – the development environment, header files, the source code, object files and so on. It would be much nicer if we could build a clean image that only contains the executables and libraries needed at runtime.

Fortunately, current versions of Docker offer a feature called multi-stage build which is exactly what we need. Let us take a look at the following docker file to explain this.

FROM bitcoin-alpine-bin as build

RUN echo "In build stage"

FROM alpine

# Copy the binaries from the build to our new container
COPY --from=build /usr/local/bin/bitcoind /usr/local/bin

# Install all dependencies
RUN apk update && apk add boost boost-filesystem \
            boost-program_options \
            boost-system boost-thread busybox db-c++ \
            libevent libgcc libressl2.6-libcrypto \ 
            libstdc++ musl

# Copy the bitcoin.conf file from
# the build context into the container
COPY bitcoin.conf /bitcoin.conf

# Expose the port for the RPC interface
EXPOSE 18332/tcp

# Start the bitcoin server
ENTRYPOINT ["/usr/local/bin/bitcoind"]
CMD ["-conf=/bitcoin.conf", "-regtest", "-rest=1", "-server=1", "-printtoconsole", "-txindex=1"]

We see that this docker file has two FROM statements. When it is executed, it starts with the image specified by the first FROM statement. In our case, this is the bin image that did already contain the executable binaries. In this stage, we do nothing in our simple example but simply print a message. Then, with the second FROM command, Docker will start a new image based on the raw Alpine image from scratch. However – and this is the magic – we still have access to the files from the first image, and we can copy them to our new image using the --from specifier. This is what we do here – we copy the executable into our new container. Then we add only the runtime libraries that we really need and the configuration file.

This gives us a nice and small container – when I checked it was below 30MB!

In further posts, I will assume that you have build this container and made available in your local Docker repository as bitcoin-alpine. If you have not followed all the steps of this post, you can simply pull a docker file that merges all the files explained above in one file and use it to build this container as follows.

$ git clone
$ cd bitcoin/docker
$ docker build --rm -f Dockerfile -t bitcoin-alpine .

In the version of the dockerfile on Github, I have also added a startup script that overwrites the default credentials if the environment variables BC_RPC_USER and BC_RPC_PASSWORD are set – this makes it easier to use custom credentials, for instance in a Kubernetes environment.

That was it for today. Starting with the next post, we will use this test installation to see how we can create and publish a full transaction in the bitcoin network.

Restricted Boltzmann machines

In the previous post, we have seen that a Boltzmann machine as studied so far suffers from two deficiencies. First, training is very slow as we have to run a Gibbs sampler until convergence for every iteration of the gradient descent algorithm. Second, we can only see the second moments of the data distribution and the learning rule ignores higher moments.

A class of networks called Restricted Boltzmann machines (RBM) has been designed to overcome these problems. An RBM is a Boltzmann machine with two additional architectural features. First, it has hidden units. This simply means that we split the set of all units in the network into two disjoint sets called visible units and the said hidden units. When we we train the network, we connect the data samples only to the visible units. The hidden units, however, also follow the dynamical rules of the network and serve as latent variables – you can think of them as additional parameters of the network which are adapted during training but are not directly prescribed by the training set, similar to a hidden layer in a feed-forward neuronal network.

Second, in a restricted Boltzmann machine, certain restrictions on the weights are in effect. Specifically, we only allow hidden units to be connected to visible units and vice versa, so there are no connections between hidden units and no connections between visible units. Effectively, a restricted Boltzmann machine is therefore organised in two layers – one layer containing the hidden units and one layer containing the visible units, as shown below.


What does this imply for the mathematical description of the network? In fact, we will see that this simplifies things considerably. First, corresponding to the differentiation between hidden and visible units, our index set can be written as

\{ 1, \dots, N \} = I_v \cup I_h

so that unit i is a hidden unit if i is in the set I_v and a hidden unit if i is in the set I_h. Second, it is common to use 0 and 1 as states instead of -1 and +1. Our state space then splits

\{ 0, 1\}^N = {\mathcal S} = {\mathcal V} \times \mathcal {H}

and correspondingly we can write any state as

s = (v,h)

where v specifies the state of the visible units and h the state of the hidden units. As only visible units correspond to actual input, the purpose of the training phase is now to adjust the marginal distribution

P(v) = \sum_h P(v,h) = \frac{1}{Z} \sum_h e^{-\beta E(v,h)}

such that is it as close as possible to the empirical distribution of the test data.

The expression for the energy also simplifies greatly, as all terms involving only hidden units and only visible units disappear. If we replace the matrix W that contains all connections by a reduced matrix – that we again call W – that only contains the remaining connections between visible and hidden units, we can express the energy as

E(v,h) = - \sum_{i \in I_v, j \in I_h} W_{ij} v_i h_j

In addition, we will now also add an explicit bias to both the hidden and visible units, so that our full energy is

E(v,h) = - \sum_{i \in I_v, j \in I_h} W_{ij} v_i h_j - \sum_i v_i b_i - \sum_j h_j c_j

Of course the matrix W is now no longer symmetric and not even quadratic (as the number of hidden units will in general not be the same as the number of visible units).

We can now again calculate the update rules as before. First, we write down the likelihood function

l({\mathcal D} | W) = - \frac{1}{K} \ln P({\mathcal D} | W) = - \frac{1}{K} \sum_k \ln \sum_h e^{-\beta E(v^{(k)},h)}+ \ln Z

where now v^{(k)} is the k-the sample point corresponding to a set of values for the visible units.

Again we will need the derivatives of this with respect to the weights. For the second term – the logarithm of the partition function – we have already seen in the last post how this works. Recalling the results from this post, we easily find that

\frac{\partial}{\partial W_{ij}} \ln Z = - \beta \langle \frac{\partial E}{\partial W_{ij}} \rangle_P = \beta \langle v_i h_j \rangle_P

so that the derivative is again an expectation value which we could try to approximate using a sample of the model distribution. The first term requires a bit more work. Let us first calculate

\frac{\partial }{\partial W_{ij}} \ln \sum_h e^{-\beta E(v,h)} = \frac{1}{Z P(v)} \sum_h \frac{\partial }{\partial W_{ij}} e^{-\beta E(v.h)}= - \beta \sum_h \frac{\partial E(v,h)}{\partial W_{ij}} P(h | v)

But this is again an expectation value, this time it is an expectation value with respect to the conditional distribution of the hidden units given the visible units.

\frac{\partial }{\partial W_{ij}} \ln \sum_h e^{-\beta E(v,h)} = - \beta \langle \frac{\partial E(v,h)}{\partial W_{ij}} \rangle_{P(\cdot | v)}

The derivative of the energy with respect to the weights is as above, and we finally obtain the following update rule for the weights:

\Delta W_{ij} = \lambda \beta \left[ \langle \langle v_i h_j \rangle_{P(\cdot | v)} \rangle_{\mathcal D} - \langle v_i h_j \rangle_P \right]

Note that the first term is a double expectation value – for each sample v^{(k)} for the visible units, we use the expectation value under the conditional distribution over the hidden units given this value for the visible units.

Now let us start to simplify this expression a bit further, leveraging the restrictions on the geometry of the network. Let us first try to find an expression for the conditional probability

P(h_j = 1 | v)

This is in fact easy to calculate in our situation. As the state of a hidden unit does not depend on the other hidden units, but only on the visible units, we find that

P(h_j = 1 | v)= \sigma(\beta (\sum_i W_{ij} v_i + c_j)) = \sigma(\beta a_j)


a_j = \sum_i W_{ij} v_i + c_j

is the activation of the hidden unit j. Using this, we can already simplify the first term in the update rule as follows:

\langle v_i h_j \rangle_{P(\cdot | v)} = \sum_h P(h | v) v_i h_j = v_i \sum_{h : h_j = 1} P(h | v)

But this is of course nothing but

v_i P(h_j = 1 | v)

so that we eventually find

\langle v_i h_j \rangle_{P(\cdot | v)} = v_i \sigma (\beta a_j)

A similar argument works for the second term in the update rule. We have

\langle v_i h_j \rangle_P = \sum_v \sum_h v_i h_j P(v,h) = \sum_v v_i P(v) \sum_h h_j P(h | v)

Now the second term sum is again the conditional probability for h_j to be one given v, so that this turns into

\langle v_i h_j \rangle_P = \sum_v v_i P(v) \sigma(\beta a_j) = \langle v_i \sigma(\beta a_j) \rangle_{P(v)}

We therefore finally obtain the following simplified update rule.

\Delta W_{ij} = \beta \left[ \langle v_i \sigma(\beta a_j) \rangle_{\mathcal D} - \langle v_i \sigma(\beta a_j) \rangle_{P(v)} \right]

Thus again, we see that the gradient is composed of two terms, which we call the positive phase and the negative phase. In each phase, we sample the same expression, once over the data distribution and once over the marginal distribution.

How do we actually calculate these terms? The positive phase is easy – we have written this as an expectation value, but it is nothing but an ordinary sum. For each vector in the sample, we calculate the activation of the hidden unit j, apply the multiplication by \beta and the sigmoid function and multiply the result with the value of the visible unit. So this is in fact an easily calculated analytical expression.

Whereas we have found an analytic expression for the positive phase, there is no obvious analytic expression for the negative phase, so we again need a sampling procedure to calculate this term. At this point, the special structure of the network again helps to make the sampling easier. Suppose we wanted to apply an ordinary Gibbs sampler, where instead of choosing the neuron that we update next randomly, we cycle sequentially through all the neurons. We could then do all the hidden neurons first and then continue with the visible units. Now, as the visible units only depend on the hidden units and vice versa, we could as well update all hidden units in parallel and then all visible units in parallel, using that as in the case of hidden units, the conditional probability for a visible unit to be one can be expressed as

P(v_i = 1 | h) = \sigma(\beta (\sum_j W_{ij} h_j + b_i))

This procedure is called Gibbs sampling with block updates. It is also obvious that sampling from the joint distribution P(v,h) in this way and then ignoring the values of the hidden units in this way gives a sampler for the marginal distribution.

Therefore our algorithm to calculate the second term of the update rule would be as follows. We would start with some value for the visible units. Then we would calculate the probability that each hidden unit is on given these values for the visible units and update the hidden units according to this distribution. We would then use the new values for the hidden units, calculate the conditional distribution of the visible units and update the visible units according to this distribution. This would constitute a full Gibbs sampling step. We would repeat this process until convergence is reached and then sample for a few steps to calculate the expectation values above. Plugging this into the update rule and calculating the first term analytically, we would then obtain the needed update for the weights.

So it looks like we are back to our old problem – to calculate one weight update during the gradient descent procedure, we have to run a Gibbs sampler to convergence. Fortunately, it turns out that several approximations exist that make this calculation feasible. Next, we will look at two of these approaches – constrastive divergence and its companion persistent contrastive divergence (PCD). We will then implement both algorithms in Python and try it out, first on a small sample set and then finally on the MNIST data set. But this post has already grown a bit lengthy – so let us save this for the next post in this series.


Signing and verifying bitcoin transactions

If you have followed my blockchain posts so far, you know how to create bitcoin transactions by assembling transaction inputs and transaction outputs into a transaction data structure and serializing it. However, there is one subtlety that we have ignored so far – what exactly do you sign?

You cannot sign the entire bitcoin transaction, simply because the signature is part of it! So there is a sort of chicken-or-egg problem. Instead, you need to build a transaction without the signature, serialize it, create a hash value, sign this hash value, then add the signature to the transaction and publish this transaction.

This is essentially how signing actually works. However, there are a few details that we still miss. So let us look at the source code of the reference client once more to make sure that we really understand how this works.

As a starting point, remember that there is a script opcode called OP_CHECKSIG that verifies a signature. So let us look at the corresponding method which we find in script/interpreter.cpp. When this opcode is executed, the function EvalScript delegates to the method TransactionSignatureChecker::CheckSig. This method removes the last byte of the signature script (which is the hash type) and then calls the function SignatureHash. This function receives the following parameters (you might have to trace back a bit starting at EvalScript to verify this):

  • the script code of the public key script
  • the transaction that is to be signed
  • the index of the transaction input within that transaction to which the signature refers
  • the hash type
  • the amount
  • a version number which is zero in our case and indicates whether the segregated witness feature is in effect
  • a data structure that serves as a cache

We have not yet talked about the hash type. Essentially the hash type controls which parts of a transaction are included in the signature and therefore protected against being changed. The most common case is SIGHASH_ALL = 1, and this is the value for this field that we assume in this post (if you understand this case, it will not be difficult to figure out the other cases as well).

In any case, the function SignatureHash is what we are searching for. It serializes the transaction and creates a hash value which is then passed to the crytographic subroutines to actually verify the signature. This function again delegates the actual work to an instance of the helper class CTransactionSignatureSerializer, more precisely to its method Serialize. This method is in fact not very surprising.

void Serialize(S &s) const {
    // Serialize nVersion
    ::Serialize(s, txTo.nVersion);
    // Serialize vin
    unsigned int nInputs = fAnyoneCanPay ? 1 :;
    ::WriteCompactSize(s, nInputs);
    for (unsigned int nInput = 0; nInput < nInputs; nInput++)
         SerializeInput(s, nInput);
    // Serialize vout
    unsigned int nOutputs = fHashNone ? 0 : (fHashSingle ? nIn+1 : txTo.vout.size());
    ::WriteCompactSize(s, nOutputs);
    for (unsigned int nOutput = 0; nOutput < nOutputs; nOutput++)
         SerializeOutput(s, nOutput);
    // Serialize nLockTime
    ::Serialize(s, txTo.nLockTime);

Ignoring the flag fAnyoneCanPay, this method behaves similar to ordinary serialization – first the version number is written, then the number of inputs, followed by the actual inputs, then the number of outputs, the actual outputs and finally the lock time.

The actual difference to an ordinary transaction serialization is hidden in the method SerializeInput.

void SerializeInput(S &s, unsigned int nInput) const {
    // In case of SIGHASH_ANYONECANPAY, only the input being signed is serialized
    if (fAnyoneCanPay)
        nInput = nIn;
    // Serialize the prevout
    // Serialize the script
    if (nInput != nIn)
        // Blank out other inputs' signatures
        ::Serialize(s, CScript());
    // Serialize the nSequence
    if (nInput != nIn && (fHashSingle || fHashNone))
        // let the others update at will
        ::Serialize(s, (int)0);

Again, this starts out as usual by serializing the reference to the spent transaction output (transaction ID and index). However, we soon find the first difference – if the index of the transaction input to be signed differs from the input that is currently serialized, an empty script is serialized instead of the signature script – which translates into a single byte 0x0 which is the length of the empty script.

The second difference is located in the method SerializeScriptCode. Here, instead of serializing the not yet existing signature, the content of the variable scriptCode is serialized. Going back a bit, we see that this is the public key script of the corresponding transaction output! So basically, this script takes the position of the signature script in this serialization step.

But we are not yet done – let us return to the function SignatureHash and see what it does close to the end.

CTransactionSignatureSerializer txTmp(txTo, scriptCode, nIn, nHashType);

// Serialize and hash
CHashWriter ss(SER_GETHASH, 0);
ss << txTmp << nHashType;
return ss.GetHash();

Using the operator <<, we see that the method Serialize that we have just studied is invoked. However, we see that in addition, the hash type – which was stripped off at the beginning – is now added again at the end. All this is routed into a CHashWriter which is defined in hash.h and is essentially a wrapper around a 256 bit hashing routine. This hash is then returned to TransactionSignatureChecker::CheckSig which then tries to verify the signature with that message.

After that tour through the reference client, let us try to verify the signature of a real transaction from the bitcoin network. In the library btc/, I have implemented a function signatureHash which does – for the special case discussed – the same as the function that we have just looked at. Using that function, it is now not difficult to verify a signature, given a transaction input and the corresponding spent output. Here is what we need to do in an ipython session, again assuming that you have downloaded my btc package.

First, we get our sample transaction that we have already used before and the transaction output spent by the first transaction input.

In [1]: import btc.utils
In [2]: import btc.txn
In [3]: raw = btc.utils.getRawTransaction("ed70b8c66a4b064cfe992a097b3406fa81ff09641fe55a709e4266167ef47891")
In [4]: txn = btc.txn.txn()
In [5]: txn.deserialize(raw)
In [6]: txin = txn.getInputs()[0]
In [7]: raw = btc.utils.getRawTransaction(txin.prevTxid)
In [8]: prev = btc.txn.txn()
In [9]: prev.deserialize(raw)
In [10]: spentTxo = prev.getOutputs()[txin.getVout()]

Next, we determine the signature hash and interpret it as a big endian encoded integer.

In [11]: h = int.from_bytes(btc.script.signatureHash(txn, 0, spentTxo), "big")

Next we need to determine the other ingredients of the signature – the r and s values.

In [12]: r = txin.getScriptSig().getSignatureR()
In [13]: s = txin.getScriptSig().getSignatureS()

Now we need the ECDSA cryptography library. We get the SECP256K1 curve and its parameters from there and determine the point on the curve that corresponds to the public key.

In [14]: import ecdsa
In [15]: curve = ecdsa.curves.SECP256k1
In [16]: G = curve.generator
In [17]: p = curve.curve.p()
In [18]: txin.getScriptSig().getScriptType()
Out[18]: 'P2PKH'
In [19]: pubKey = txin.getScriptSig().getPubKeyHex()
In [20]: pubKey
Out[20]: '038c2d1cbe4d731c69e67d16c52682e01cb70b046ead63e90bf793f52f541dafbd'

We see that the first byte of the public key is 0x03. If you remember what we have learned about the public key encoding, this means that y is odd. We can therefore determine the x and y coordinate as follows.

In [21]: x = int.from_bytes(bytes.fromhex(pubKey[2:66]), 'big')
In [22]: y = ecdsa.numbertheory.square_root_mod_prime((x**3 +7) % p , p)
In [23]: y
Out[23]: 91238655223056717466178782248030327689649326505413694215614940048183357985838
In [24]: y = (p - y) % p
In [25]: y
Out[25]: 24553434014259477957392202760657580163620658160226869823842643959725476685825
In [26]: assert(curve.curve.contains_point(x, y))

Now we are almost done.

In [27]: Q = ecdsa.ellipticcurve.Point(curve.curve, x, y)
In [28]: pKey = ecdsa.ecdsa.Public_key(G, Q)
In [29]: signature = ecdsa.ecdsa.Signature(r,s)
In [30]: pKey.verifies(h, signature)
Out[30]: True

So our signature is valid, as expected – after all this is a real transaction from the bitcoin network. The full source code for this exercise is also contained in btc/ in the function verifySignature.

Of course we could now revert this process – given a transaction, we could use the function signatureHash to create a hash value which we then sign using the ECDSA library. This now puts us in a position to actually create a transaction that we can then push into the bitcoin network. Now, this is clearly something you do not want to try in the real bitcoin network. Therefore I will show you in my next post how you can set up a small test network at home to be able to play with this without the risk of losing actual bitcoins.

Bitcoin security and the Mt Gox incident

In February 2014, the bitcoin exchange Mt Gox – at this time one of the largest trading platforms in the market – had to suspend withdrawals completely, apparently in an attempt to recover from a massive attack. A few days later, the company filed for bankruptcy, claiming that it had lost 850.000 BTC, at this time worth roughly 500 Mio. USD. Earlier press releases issued by Mt Gox indicated that a weakness of the bitcoin protocol called transaction malleability had been leveraged by an attacker to steal the bitcoin – a claim that could not be confirmed by an independent analysis carried out by two researchers from the ETH Zürich a few weeks later. Still, that research also showed that this weakness was actively exploited, even though not at the scale claimed by Mt Gox. The purpose of this post is to explain this weakness in a bit more detail and show how it is related to the scripting language used in the bitcoin protocol.

We have seen that signatures and public keys in a bitcoin transaction come as parts of scripts, and that verifying a signature is equivalent to executing a script built out of a transaction input and the corresponding transaction output. Theoretically, this approach is extremely flexible – we could use any valid script and obtain a valid bitcoin transaction.

However, in practice, the bitcoin reference implementation has a few built in checks that restrict the set of scripts that are accepted, and also the scripting language itself has some limitations, there are, for instance, no loops. This has several reasons.

The first reason is the possibility of a denial-of-service attack on a bitcoin node. Suppose the scripting language were powerful enough to build a program that never terminates. When a transaction containing such a script is validated by a node, it starts the scripting engine to validate the transaction and gets stuck in an infinite loop, thus blocking at least one thread forever. Even worse, a very powerful and complicated scrip could leverage errors in the implementation of the scripting engine and so malicious code could be injected into a bitcoin node as part of a transaction. Thus, from a security point of view, it makes sense to place some limitations on the set of allowed scripts.

However, there is another, more subtle reason which is a consequence of the fact that, if a bitcoin transaction is signed, the signature script is not part of the data to which the signature is actually applied, as we will see in a later post. To illustrate why this could be a problem, suppose Alice owes Bob 50 bitcoins and makes a corresponding payment. So she would assemble a transaction – let me call that A – and sign it with her private key. She would then publish that transaction in the network. For bookkeeping purposes, her wallet would probably remember the ID of the transaction. By convention, the ID of a bitcoin transaction is simply its hash value.

Bob could now try the following attack. He could obtain the transaction from the network before it is added to a block, change its signature script by adding some code to the script that essentially does nothing, and publish that transaction as transaction B, as indicated below.


If the new, updated signature script is functionally equivalent to the script contained in A, both transactions would be valid. Moreover, both refer to the same unspent transaction output. However, as they have different signature scripts, they have different hash values and therefore different transaction IDs.

What Bob is hoping for is that the manipulated transaction B created by him is picked up by a miner first and added to a block. Bob would then obtain the bitcoin payment. However, Alice transaction A would be rejected due to double spend! Therefore Bob could now approach Alice and claim that he never received the payment. If Alice checks the status of the transaction ID that she knowns – that of transaction A – she would in fact learn that the transaction has been rejected by the network. In the worst case, she would feel obliged to initiate another payment, and Bob would receive the amount twice.

A similar situation could arise if Bob simply alters the signature created by Alice by replacing s by -s. This would still be a valid signature, but also changes the hash value and therefore the transaction ID.

This is the weakness that has been termed transaction malleability – the ability to modify a transaction without having to sign it again.

Can that be exploited to actually steal bitcoins? Well, maybe. Suppose that Alice is in fact a bitcoin exchange like Mt Gox, managing a large amount of bitcoins for different users. Suppose further that this exchange has implemented a wallet that automatically re-issues a transaction if the first transaction is rejected by the network. In this case, a malicious attacker like Bob could request withdrawals from the exchange, i.e. transfer to his own bitcoin address, and could then create and post modified transactions for each of the resulting transactions generated by the exchange. Even if only a certain percentage of the modified transactions are actually included in the blockchain, this could in fact lead to multiple payments triggered in favor of Bob, so that Bob can actually steal a significant amount of bitcoins from the exchange.

With BIP 62 and BIP 66, certain standards for scripts and the decoding of the signature are enforced to avoid these issues. With the segregated witness features (BIP 141 (Segregated witness)), additional protection is implemented as the signature is no longer part of the transaction that is signed. This at least addresses the currently known sources of transaction malleability. However, as always, this does not guarantee that similar flaws will not be detected and actively exploited in the future. The lesson is obvious – like any other complex distributed system, the blockchain is not immune to this kind of problems and will never be – developers and users should be aware of this.