Kubernetes 101 – creating pods and deployments

In the last posts, we have seen how we can set up a Kubernetes cluster on Amazons EKS platform and spin up our first nodes. Today, we will create our first workloads and see pods and deployments in action.

Creating pods

We have already introduces pods in an earlier post as the smallest units that Kubernetes manages. To create pods, there are several options. First, we can use the kubectl command line tool and simply pass the description of the pod as arguments. Second, we can use a so-called manifest file which contains the specification of the pod, which has the obvious advantage that this file can be reused, put under version control, developed and tested, according to the ideas of the “Infrastructure as a code” approach. A manifest file can either be provided in JSON format or using the YAML markup language. And of course, we can again use the Kubernetes API directly, for instance by programming against the Python API.

In this post, we will demonstrate how to create pods using manifest file in YAML format and Python scripts. We start with a YAML file which looks as follows.

apiVersion: v1
kind: Pod
metadata:
  name: naked-pod-demo
  namespace: default
spec:
  containers:
  - name: naked-pod-demo-ctr
    image: nginx

Let us go through this file step by step to understand what it means. First, this is a YAML file, and YAML is essentially a format for specifying key-value pairs. Our first key is apiVersion, with value v1. This is the first line in all manifest files and simply specifies the API version that we will use.

The next key – kind – specifies the type of Kubernetes resource that we want to access, in this case a pod. The next key is metadata. The value of this key is a dictionary that again has two keys – name and namespace. The name is the name that our pod will have. The namespace key specifies the namespace in which the pod will start (namespaces are a way to segment Kubernetes resources and are a topic for a future post, for the time being we will simply use the so-called default namespace).

The next key – spec now contains the actual specification of the pod. This is where we tell Kubernetes what our pod should be running. Remember that a pod can contain one or more application containers. Therefore there is a key containers, which, as a value, has a list of containers. Each container has a name and further attributes.

In our case, we create one container and specify that the image should be nginx. At this point, we can specify every image that we could also specify if we did a docker run locally. In this case, Kubernetes will pull the nginx image from the default docker registry and run it.

So let us now trigger the creation of a pod by passing this specification to kubectl. To do this, we use the following command (assuming that you have saved the file as naked_pod.yaml, the reason for this naming will become clear later).

kubectl apply -f naked_pod.yaml

After a few seconds, Kubernetes should have had enough time to start up the container, so let us use kubectl to check whether the node has been created.

$ kubectl get pods 
NAME             READY   STATUS    RESTARTS   AGE
naked-pod-demo   1/1     Running   0          38s

Nice. Let us take a closer look at this pod. If you run kubectl get pods -o wide, kubectl will give you a bit more information on the pod, including the name of the node on which the pod is running. So let us SSH into this node. To easily log into one of your nodes (the first node in this case, change Node to 1 to log into the second node), you can use the following sequence of commands. This will open the SSH port for all instances in the cluster for incoming connections from the your current IP address. Note that we use the cluster name as a filter, so you might need to replace myCluster in the commands below by the name of your cluster.

$ Node=0
$ IP=$(aws ec2 describe-instances --output text --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --query Reservations[$Node].Instances[0].PublicIpAddress)
$ SEC_GROUP_ID=$(aws ec2 describe-instances  --output text --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --query Reservations[0].Instances[0].SecurityGroups[0].GroupId)
$ myIP=$(wget -q -O- https://ipecho.net/plain)
$ aws ec2 authorize-security-group-ingress --group-id $SEC_GROUP_ID --port 22 --protocol tcp --cidr $myIP/32
$ ssh -i ~/eksNodeKey.pem ec2-user@$IP

Use this – or your favorite method (if you use my script up.sh to start the cluster, it will already open ssh connections to both nodes for you) – to log into the node on which the pod has been scheduled by Kubernetes and execute a docker ps there. You should see that, among some management containers that Kubernetes brings up, a container running nginx appears.

Now let us try a few things. First, apply the YAML file once more. This will not bring up a second pod. Instead, Kubernetes realizes that a pod with this name is already running and tells you that no change has been applied. This makes sense – in general, a manifest file specifies a target state, and we are already in the target state, so there is no need for action.

Now, on the EKS worker node on which the pod is running, use docker kill to actually stop the container. Then wait for a few seconds and do a docker ps again. Surprisingly, you will see the container again, but with a different container ID. What happened?

The answer is hidden in the logfiles of the component of Kubernetes that controls a single node – the kubelet. On the node, the kubelet is started using systemctl (you might want to verify this in the AWS provided bootstrap.sh script). So to access its logs, we need

$ journalctl -u kubelet

Close to the end of the logfile, you should see a line stating that … container … is dead, but RestartPolicy says that we should restart it. In fact, the kubelet constantly monitors the running containers and looks for deviations from the target state. The restart policy is a piece of the specification of a pod that tells the kubelet how to handle the case that a container has died. This might be perfectly fine (for batch jobs), but the default restart policy is Always, so the the kubelet will restart our container.

This is nice, but now let us simulate a more drastic event – the node dies. So on the AWS console, locate the node on which the pod is running and terminate the pod.

After a few seconds, you will see that a new node is created, thanks to the AWS auto-scaling group that controls our nodes. However, if you check the status of the pods using kubectl get pods, you will see that Kubernetes did not reschedule the pod on a different node, nor does it restart the pod once the replacement node is up and running again. So Kubernetes takes care (via the kubelet) of the containers that belong to a pod on a per-node level, but not on a per-cluster level.

In productions, this is of course typically not what you want – instead, it would be nice if Kubernetes could monitor the pods for you and restart them automatically in case of failure. This is what a deployment does.

Replica sets and deployments

A deployment is an example for a more general concept in Kubernetes – controllers. Basically, this is a component that constantly monitors the cluster state and makes changes if needed to get back into the target state. One thing that a deployment does is to automatically bring up a certain number of instances called replicas of a Docker image and distribute the resulting pods across the cluster. Once the pods are running, the deployment controller will monitor them and take action of the number of pods running deviates from the desired number of pods. As an example, let us consider the following manifest file.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alpine
spec:
  selector:
    matchLabels:
      app: alpine
  replicas: 2
  template:
    metadata:
      labels:
        app: alpine
    spec:
      containers:
        - name: alpine-ctr
          image: httpd:alpine

The first line is specifying the API version as usual (though we need to use a different value for the API version, the additional apps is due to the fact that the Deployment API is part of the API group apps). The second line designates the object that we want to create as a deployment. We then again have a name which is part of the metadata, followed by the actual specification of the deployment.

The first part of this specification is the selector. To understands its role, recall that a deployment is supposed to make sure that a specified number of pods of a given type are running. Now, the term “of a given type” has to be made precise, and this is what the selector is being used for. All pods which are created by our deployment will have a label with key “app” and value “alpine”. The selector field specifies that all pods with that label are to be considered as pods controlled by the deployment, and our controller will make sure that there are always exactly two pods with this label.

The second part of the specification is the template that the deployment uses to create the pods controlled by this deployment. This looks very much like the definition of a pod as we have seen it earlier. However, the label that is specified here of course needs to match the label in the selector field (kubectl will actually warn you if this is not the case).

Let us now delete our existing pod, run this deployment and see what is happening. Save the manifest file as deployment.yaml, apply it and then list the running nodes.

$ kubectl delete -f naked_pod.yaml
$ kubectl apply -f deployment.yaml
deployment.apps/alpine created
$ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
alpine-8469fc798f-rq92t   1/1     Running   0          21s
alpine-8469fc798f-wmsqc   1/1     Running   0          21s

So two pods have been created and scheduled to nodes of our cluster. To inspect the container further, let us ask kubectl to provide all details as JSON and use the wonderful jq to process the output and extract the container information from it (you might need to install jq to run this).

$ kubectl get pods --output json | jq ".items[0].spec.containers[0].image"
"httpd:alpine"

So we see that the pods and the containers have been created according to our specification and run the httpd:alpine docker image.

Let us now repeat our experiment from before and stop one of the node. To do this, we first extract the instance ID of the first instance using the AWS CLI and then use the AWS CLI once more to stop this instances.

$ instanceId=$(aws ec2 describe-instances --output text --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --query Reservations[0].Instances[0].InstanceId)
$ aws ec2 terminate-instances --instance-ids $instanceId

After some time, Kubernetes will find that the instance has died, and if you run a kubectl get nodes, you will find that the node has disappeared. If you now run kubectl get nodes -o wide, however, you will still find two pods, but now both are running on the remaining node. So the deployment controller has replaced the pod that went down with our node by a new pod on the remaining node. This behaviour is in fact not realized by the deployment, but by the underlying ReplicaSet. We could also create a replica set directly, but a deployment offers some additional features, like the possibility to run a rolling upgrade automatically.

Creating deployments in Python

Having discussed how to create a deployment from a YAML file, let us again do the same thing in Python. The easiest approach is to walk our way upwards through the YAML file. After the usual preparations (imports, loading the configuration), we therefore start with the specification of the container.

import client
container = client.V1Container(
              name="alpine-ctr",
              image="httpd:alpine")

This is similar to our YAML file – in each pod, we want to run a container with the httpd:alpine image. Having the container object, we can now create the template section.

template = client.V1PodTemplateSpec(
        metadata=client.V1ObjectMeta(
                   labels={"app": "alpine"}),
        spec=client.V1PodSpec(containers=[container]))

Again, note the label that we will use later to select the pods that our deployment controller is supposed to watch. Now we put the template into the specification part of our controller.

selector = client.V1LabelSelector(
              match_labels={"app" : "alpine"})
spec = client.V1DeploymentSpec(
        replicas=2,
        template=template, 
        selector=selector)

And finally, we can create the actual deployment object and ask Kubernetes to apply it – the complete Python script is available for download here).

deployment = client.V1Deployment(
       api_version="apps/v1",
       kind="Deployment",
       metadata=client.V1ObjectMeta(name="alpine"),
       spec=spec)

apps.create_namespaced_deployment(
              namespace="default", body=deployment)

This completes our post for today. We now know how to create pods and deployments to run Docker images in our cluster. In the next few posts, we will look at networking in Kubernetes – services, load balancers, ingress and all that.

Kubernetes – an overview

Docker containers are nice and offer a very lean and structured way to package and deploy applications. However, if you really want to run large scale containerized applications, you need more – you need to deploy, orchestrate and manager all your containers in a highly customizable and automated way. In other words, you need a container management platform like Kubernetes.

What is a container management platform?

Imagine you have a nice, modern application based on microservices, and you have decided to deploy each microservice into a separate container. Thus your entire application deployment will consist of a relatively large number, maybe more than hundred, individual containers. Each of these containers needs to be deployed, configured and monitored. You need to make sure that if a container goes down, it is restarted and traffic is diverted to a second instance. You need to distribute your load evenly. You need to figure out which container should run on which virtual or physical machine. When you upgrade or deploy, you need to take dependencies into account. If traffic reaches a certain threshold, you would like to scale, either vertically or horizontally. And so forth…

Thus managing a large number of containers in a production-grade environment requires quite some effort. It would be nice to have a platform that is able to

  • Offer loadbalancing, so that incoming traffic is automatically routed to several instances of a service
  • Easily set up container-level networks
  • Scale out automatically if load increases
  • Monitor running services and restart them if needed
  • Distribute containers evenly across all nodes in your network
  • Spin up new nodes or shut down nodes depending on the traffic
  • Manage your volumes and storage
  • Offer mechanisms to upgrade your application and the platform and / or rollback upgrades

In other words, you need a container management platform. In this post – and in a few following posts – I will focus on Kubernetes, which is emerging as a de facto standard for container management platforms. Kubernetes is now the underlying platform for industry grade products like RedHat OpenShift or Rancher, but can of course be used stand-alone as well. In addition, all major cloud providers like Amazon (EKS), Google (GKE), Azure (AKS) and IBM ( IBM Cloud Kubernetes Service) offer Kubernetes as a service on their respective platforms.

Basic Kubernetes concepts

When you first get in touch with the world of Kubernetes, the terminology can be a bit daunting – there are pods, clusters, services, containers, namespaces, deployments, and many other terms that will be thrown at you. Time to explain some of the basic concepts very high level – we will get in closer contact with most of them over the next few posts.

The first object we need to understand is a Kubernetes node. Essentially, a node is a compute ressource, i.e. some sort of physical or virtual host on which Kubernetes can run things. In a Kubernetes cluster, there are two types of nodes. A master node is a node on which the Kubernetes components itself are running. A worker node is a node on which application workloads are running.

On each worker node, Kubernetes will start several agents, for instance the kubelet which is the primary agent responsible for controlling a node. In addition to the agent, there are of course the application workloads. One might suspect that these workloads are managed on a container basis, however, this is not quite the case. Instead, Kubernetes groups containers into objects called pods. A pod is the smallest unit handled by Kubernetes. It encompasses one or more containers that share resources like storage volumes, network namespaces and the same lifecycle. It is quite common to see pods with only one application container, but one might also decide to put, for instance, a microservice and a database into one pod. In this case, when the pod is started, both containers (the container containing the microservice and the container containing the database) are brought up on the same node, can see each other in the network under “localhost” and can access the same volumes. In a certain sense, a pod is therefore a sort of logical host on which our application containers run. Kubernets assigns IP addresses to pods, and applications running in different pods need to use these IP addresses to communicate via the network.

When a user asks Kubernetes to run a pod, a Kubernetes component called the scheduler identifies a node on which the pod is then brought up. If the pod dies and needs to be restarted, it might be that the pod is restarted on a different node. The same applies if the node goes down, in this case Kubernetes will restart pods running on this container on a different node. There are also other situations in which a pod can be migrated from one node to a different node, for instance for the purpose of scaling. Thus, a pod is a volatile entity, and an application needs to be prepared for this.

KubernetesBasics

Pods themselves are relatively dumb objects. A large part of the logic to handle pods is located in componentes called controllers. As the name suggests, these are components that control and manage pods. There are, for instance, special controllers that are able to autoscale the number of pods or to run a certain number of instances of the same pod – called replicas. In general, pods are not started directly by a Kubernetes user, but controllers are defined that manage the pods life cycle.

In Kubernetes, a volume is also defined in the context of a pod. When the pod ceases to exist, so does the volume. However, a volume can of course be backed by persistent storage, in which case the data on the volume will outlive the pod. On AWS, we could for instance mount an EBS volume as a Kubernetes volume, in which case the data will persist even if we delete the pod, or we could use local storage, in which case the data will be lost when either the pod or the host on which the pod is running (an AWS virtual machine) is terminated.

There are many other objects that are important in Kubernetes that we have not yet discussed – networks, services, policies, secrets and so forth. This post is the first in a series which will cover most of these topics in detail. My current plans are to discuss

Thus, in the next post, we will learn how to set up and configure AWS EKS to see some pods in action (I will start with EKS for the time being, but also cover other platforms like DigitalOcean later on).

Controlling Docker container with Python

In the last few posts on the bitcoin blockchain, I have already extensively used Docker container to quickly set up test environments. However, it turned out to be a bit tiresome to run the containers, attach to them, execute commands etc. to get into a defined state. Time to learn how this can be automated easily using our beloved Python – thanks to the wonderful Docker Python SDK.

This package uses the Docker REST API and offers an intuitive object model to represent container, images, networks and so on. The API can be made available via a TCP port in the Docker configuration, but be very careful if you do this – everybody who has access to that port will have full control over your Docker engine. Fortunately, the Python package can also connect via the standard Unix domain socker on the file system which is not exposed to the outside world.

As always, you need to install the package first using

$ pip install docker

Let us now go through some objects and methods in the API one by one. At the end of the post, I will show how you a complete Python notebook that orchestrates the bitcoind container that we have used in our tests in the bitcoin series.

The first object we have to discuss is the client. Essentially, a client encapsulates a connection to the Docker engine. Creating a client with the default configuration is very easy.

import docker
client = docker.client.from_env()

The client object has only very few methods like client.info() or client.version() that return global status information. The more interesting part of this object are the collections attached to it. Let us start with images, which can be accessed via client.images. To retrieve a specific instance, we can use client.images.list(), passing as an argument a name or a filter. For instance, when we know that there is exactly one image called “alice:latest”, we can get a reference to it as follows.

alice = client.images.list("alice:latest")[0]

Other commands, like pull or push, are the equivalents of the corresponding docker CLI commands.

Let us now turn to the client.containers collection. Maybe the most important method that this collection offers is the run method. For instance, to run a container and capture its output, use

output = client.containers.run("alpine", "ls", auto_remove=True)

This will run a container based on the alpine image and pass “ls” as an argument (which will effectively execute ls as alpine container will run a shell for you) and return the output as a sequence of bytes. The container will be removed after execution is complete.

By setting detach=True, the container will run in detached mode and the call will return immediately. In this case, the returned object is not the output, but a reference to the created container which you can use later to work with the container. If, for instance, you wanted to start an instance of the alice container, you could do that using

alice = client.containers.run("alice:latest", auto_remove=True, detach=True)

You could then use the returned handle to inspect the logs (alice.logs()), to commit, to exec code in the container similar to the docker exec command (alice.exec_run) and so forth.

To demonstrate the possibilities that you have, let us look at an example. The following notebook will start two instances (alice, bob) of the bitcoin-alpine image that you hopefully have build when you have followed my series on bitcoin. It then uses the collection client.networks to figure out to which IP address on the bridge network bob is connected. Then, we attach to the alice container and run bitcoin-cli in this container to instruct the bitcoind to connect to the instance running in container bob.

We then use the bitcoin-cli running inside the container alice to move the blockchain into a defined state – we mine a few blocks, import a known private key into the wallet, transfer a certain amount to a defined address and mine a few additional blocks to confirm the transfer. Here is the notebook.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Make sure to stop all containers again when you are done, it is comparatively easy to produce a large number of stopped containers if you are not careful and use this for automated tests. I usually run the container with the --rm flag on the command line or the auto_remove=True flag in Python to make sure that they are removed by the Docker engine automatically when they stop.

Of course nobody would use this to simply run a few docker containers with a defined network setup, there are much better tools like Docker Swarm or other container management solutions for this. However, the advantage of using the Python SDK is that we can interact with the containers, run commands, perform tests etc. And all this can be integrated nicely into integration tests using test fixtures, for instance those provided by pytest. A fixture could bring up the environment, could be defined on module level or test level depending on the context, and can add a finalizer to shut down the environment again after the test has been executed. This allows for a very flexible test setup and offers a wide range of options for automated testing.

This post could only give a very brief introduction into the Python Docker SDK and we will not at all discuss pytest and fixtures – but I invite you to browse the Docker SDK documentation and pytest fixtures and hope you enjoy to play with this!

Docker internals: networking part II

In this post, we will look in more detail at networking with Docker if communication between a Docker container and either the host or the outside world is involved.

It turns out that in these cases, the Linux Netfilter/iptables facility comes into play. This post is not meant to be an introduction into iptables, and I will assume that the reader is aware of the basics (for more information, I found the tutorial on frozentux and the overview on digitalozean very helpful).

Setup and basics

To simplify the setup, we will only use one container in this post. So let us again designate a terminal to be the container terminal and in this terminal, enter

$ docker run --rm -d --name "container1" httpd:alpine
$ docker exec -it container1 "/bin/sh"

You should now see a shell prompt inside the container. When you run netstat -a from that prompt, you should see a listening socket being bound to port 80 within the container.

Now let us print out the iptables configuration on the host system.

$ sudo iptables -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
-N DOCKER
-N DOCKER-ISOLATION
-N DOCKER-USER
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION -j RETURN
-A DOCKER-USER -j RETURN

Here we see that Docker has added three new chains and a couple of rules. The first chain that Docker has added is the DOCKER chain. In our configuration, that chain is empty, we will see later that this will change once we expose ports to the outside world.

The second chain that we see is the DOCKER-ISOLATION chain. I have not been able to find out much about this chain so far, but it appears that Docker uses this chain to add rules that isolate containers when you do not use the default bridge device but connect your containers to user defined bridges.

Finally, there is the chain DOCKER-USER that Docker adds, but otherwise leaves alone, so that firewall rules can be added by an administrator with a bit less conflict of clashing with the manipulations that Docker performs.

All these chains are empty or just consist of a RETURN statement, so we can safely ignore them for the time being.

Host-to-container traffic

As a first use case, let us now try to understand what happens when an application (curl in our case) in the host namespace wants to talk to the web server running in our container. To be able to better see what is going on, let us add two logging rules to the iptables configuration to log traffic coming in via the docker0 bridge and going out via the docker0 bridge.

$ sudo iptables -A INPUT -i docker0 -j LOG --log-prefix "IN: " --log-level 3
$ sudo iptables -A OUTPUT -o docker0 -j LOG --log-prefix "OUT: " --log-level 3

With these rules in place, let us now create some traffic. In the host terminal, enter

$ curl 172.17.0.2

We can now inspect the content of /var/log/syslog to see what happened. The first two entries should like like this (stripping of host name and time stamps):

OUT: IN= OUT=docker0 SRC=172.17.0.1 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=3460 DF PROTO=TCP SPT=34322 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 
IN: IN=docker0 OUT= PHYSIN=veth376d25c MAC=02:42:25:b7:e5:38:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=80 DPT=34322 WINDOW=28960 RES=0x00 ACK SYN URGP=0 

So we see that the first logging rule that has been triggered is the rule in the OUTPUT chain. Let us try to understand in detail how this log entry was created.

When curl asks the kernel to establish a connection with 17.17.0.2, i.e. to send a TCP SYN request, a TCP packet will be generated and handed over to the kernel. The kernel will consult its routing table, find the route via docker0 and send the packet to the bridge device.

At this point, the packet leaves the namespace for which this set of iptables rules is responsible, so the OUTPUT chain is traversed and our log entry is created.

What happens next? The packet is picked up by the container namespace, processed and the answer goes back. We can see the answer coming in again, this time triggering the logging in the INPUT rule – this is the second line, the SYN ACK packet.

Except our logging rule, no other rules are defined in the INPUT and OUTPUT chains, so the default policies apply for our packets. As both policies are set to ACCEPT, netfilter will allow our packets to pass and the connection works.

Getting out of the container

The story is a bit different if we are trying to talk to a web server on the host or on the LAN from within the container. Thus, from the kernels point of view, we are now dealing with traffic involving more than one interface, and in addition to the INPUT and OUTPUT chains, the FORWARD chain becomes relevant. To be able to inspect this traffic, let us therefore add two logging rules to the FORWARD chain.

$ sudo iptables -I FORWARD -i docker0 -j LOG --log-prefix "IN_FORWARD: " --log-level 3
$ sudo iptables -I FORWARD -o docker0 -j LOG --log-prefix "OUT_FORWARD: " --log-level 3

Now let us generate some traffic. The first thing that I have tried is to reach my SAN on the same network which is a Synology diskstation listening on port 5000 of 192.168.178.28. So in the container window, I did a

# telnet 192.168.178.28:5000

and entered some nonsens (it does not matter so much what you enter here, it will most likely result in a “bad request” message, but it generates traffic – do not forget to hit return). This will again produce some logging output, the first two lines being

IN_FORWARD: IN=docker0 OUT=enp4s0 PHYSIN=veth376d25c MAC=02:42:25:b7:e5:38:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=192.168.178.28 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=28280 DF PROTO=TCP SPT=54062 DPT=5000 WINDOW=29200 RES=0x00 SYN URGP=0 
OUT_FORWARD: IN=enp4s0 OUT=docker0 MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC=192.168.178.28 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=0 DF PROTO=TCP SPT=5000 DPT=54062 WINDOW=14480 RES=0x00 ACK SYN URGP=0

Let us again try to understand what happened. An application (telnet in our case) wants to reach the IP address 192.168.178.28. The kernel will first consult the routing table in the namespace of the container and decide to use the default route via the eth0 device. Thus the packet will go to the bridge device docker0. There, it will be picked up by the netfilter chain in the host namespace. As the destination address is not one of the IP addresses of the host, it will be handled by the FORWARD chain, which will trigger our logging rules.

Let us now inspect the other rules in the forward chain once more using iptables -S FORWARD. We see that in addition to the rules pointing to the docker generated subchains and in addition to our own logging rules, there are two rules relevant for our case.

$ sudo iptables  -S  FORWARD
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT

The first rule will accept all traffic that comes in at any network device and is routed towards the bridge device if that traffic belongs to an already established connection. This allows the answer to our TCP request to travel back from the network interface connected to the local network (enp4s0 in my case) to the container. However, unsolicited requests, i.e. new connection requests targeted towards the bridge device will be left to the default policy of the FORWARD chain and therefore dropped.

The second rule will allow outgoing traffic – all packets coming from the docker0 bridge device targeted towards any other interface will be accepted and hence forwarded. As there is no filter on the connection state, this allows an application inside the container to establish a new connection to the outside world.

However, I have been cheating a bit and skipped one important point. Suppose our SYN request happily leaves our local network adapter and travels through the LAN. The request comes from within the container, so from the IP address 172.17.0.2. If that IP address would still appear in the IP header, the external server (the disk station in my case) would try to send the answer back to this address. However, this address is not known in the LAN, only locally on my machine, and the response would get lost.

To avoid this, Docker will in fact add one more rule to the NAT table. Let us try to locate this rule.

$ sudo iptables  -S -t nat 
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N DOCKER
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN

Again, we see that docker is adding a new chain and some rules pointing to this chain. In addition, there is a rule being added to the POSTROUTING chain which is invoked immediately before a packet leaves the host. This is a so called masquerading rule which will replace the source IP address in the IP header of the outgoing packet by the IP address of the device through which the packet is sent. Thus, from the point of view of my brave diskstation, the packet will look as if it originated from the host and will therefore be sent back to the host. When the response comes in, netfilter will revert the process and forward the packet to the correct destination, in our case the bridge device.

Reaching a server from the outside world

This was already fairly complicated, but now let us try to see what happens if we want to connect to the web server running in our container from the outside world.

Now if I simply ssh into my diskstation and run curl there to reach 172.17.0.2, this will of course fail. The diskstation does not have a route to that destination, and the default gateway cannot help either as this is a private class B network. If I replace the IP address with the IP address of the host, it will not work either – in this case, the request reaches the host, but on the host network address, no process is listening in port 80. So we somehow need to map the port from the container into the host networking system.

If you consult the docker documentation on this case, you will learn that in order to do this, you have to run the container with the -p switch. So let us stop and restart our container and apply that option.

$ docker stop container1
$ docker run --rm -d --name "container1" -p 80:80 httpd:alpine
$ docker exec -it container1 "/bin/sh"

If we now inspect the chains and see what has changed, we can find the following new rule which has been added to the filter table.

A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 80 -j ACCEPT

This rule will apply to all incoming traffic that is targeted towards 172.17.0.2:80 and not coming from the bridge device, and accept it. In addition, two new rules have been added to the NAT table.

-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 80 -j MASQUERADE
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 80 -j DNAT --to-destination 172.17.0.2:80

The first rule will again apply a network address translation (SNAT, i.e. manipulating the source address) as we have already seen it before and applies to traffic within the virtual network to which the bridge belongs. The second rule is more interesting. This rule has been added to the DOCKER chain and requests DNAT (i.e. destination NAT, meaning that the target address is replaced) for all packets that are not coming from the bridge device, but have destination port 80. For these packets, the target address is rewritten to be 172.17.0.2:80, so all traffic directed towards port 80 is now forwarded to the container network.

Let us again go through one example step by step. For that purpose, it is useful to add some more logging rules, this time to the NAT table.

$ sudo iptables -t nat -I PREROUTING  -j LOG --log-prefix "PREROUTING: " --log-level 3
$ sudo iptables -t nat -I POSTROUTING  -j LOG --log-prefix "POSTROUTING: " --log-level 3

When we now submit a request from another machine on the local network directed towards 192.168.178.27:80 (i.e. towards the IP address of the host!), we find the following log entries in /var/syslog.

PREROUTING: IN=enp4s0 OUT= MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC=192.168.178.28 DST=192.168.178.27 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0 
OUT_FORWARD: IN=enp4s0 OUT=docker0 MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC=192.168.178.28 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0 
POSTROUTING: IN= OUT=docker0 SRC=192.168.178.28 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0

Thus the first packet of the connection arrives (note that NAT rules are not evaluated for subsequent packets of a connection any more) and will first be processed by the PREROUTING chain of the NAT table. As we added our logging rule here, we see the log output. We can also see that at this point, the target address is still 192.168.178.27:80.

The next rule – still in the PREROUTING chain – that is being evaluated is the jump to the DOCKER chain. Here, the DNAT rule kicks in and changes the destination address to 172.17.0.2, the IP address of the container.

Then the kernel routing decision is taken based on the new address and the forwarding mechanism starts. The packet will appear in the forward chain. As the routing has already determined the target interface to be docker0, our OUT_FORWARD logging rule applies and the second log entry is produced, also confirming that the new target address is 172.17.0.2:80. Then, the jump to the DOCKER chain matches, and within that chain, the rule is accepted as its target port is 80.

Finally, the POSTROUTING chain in the NAT table is traversed. This produces our third log file entry. However, the SNAT rule does not apply, as the source address does not belong to the network 172.17.0.2/32 – you can use tcpdump on the bridge device to see that when the packet leaves the device, it still has the source IP address belonging to the diskstation. So again, the configuration works and we can reach an application inside the container from the outside world.

There are many other aspects of networking with Docker that I have not even touched upon – user defined bridges, overlay networks or the famous docker-proxy, just to mention a few fo them – but this post is already a bit lengthy, so let us stop here for today. I hope I could provide at least some insight into the internals of networking with Docker – and for me, this was actually a good opportunity to refresh and improve my still very basic knowledge of the Linux networking stack.

Docker internals: networking part I

In this post, we will investigate one of the more complex topics when working with Docker – networking.

We have already seen in the previous post that namespaces can be used to isolate networking resources used by different containers as well as resources used by containers from those used by the host system. However, by nature, networking is not only about isolating but also about connecting – how does this work?

So let us do some tests with the httpd:alpine image. First, get a copy of that image into your local repository:

$ docker pull httpd:alpine

Then open two terminals that we call Container1 and Container2 and attach to them (note that you need to switch to the second terminal after entering the second line).

$ docker run --rm -d  --name="Container1" httpd:alpine
$ docker exec -it Container1 "/bin/sh"

$ docker run --rm -d  --name="Container2" httpd:alpine
$ docker exec -it Container2 "/bin/sh"

Next, make curl available in both containers using apk update followed by apk add curl. Now let us switch to cointainer 1 and inspect the network configuration.

/usr/local/apache2 # ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02  
          inet addr:172.17.0.2  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1867 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1017 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1831738 (1.7 MiB)  TX bytes:68475 (66.8 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

We see that there is an ethernet interface eth0 with IP address 172.17.0.2. If you do the same in the second container, you should see a similar output with a different IP address, say 172.17.0.3.

Now let us make a few connection tests. First, go to the terminal running inside container 1 and enter

curl 172.17.0.3

You should now see a short HTML snippet containing the text “It works!”. So apparently we can reach container 2 from container 1 – and similarly, you should be able to reach container 1 from container 2. Finally, try the same from a terminal attached to the host – you should be able to reach both containers from there. Finally, if you also have a web server or a similar server running on the host, you will see that you can also reach that from within the containers. In my case, I have a running tomcat being bound to 0.0.0.0:8080 on my local host, and was able to connect using

curl 192.168.178.27:8080

from within the container. How does this work?

To solve the puzzle, go back to a terminal on the host system and take a look at the routing table.

$ ip route show
default via 192.168.178.1 dev enp4s0  proto static  metric 100 
169.254.0.0/16 dev enp4s0  scope link  metric 1000 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 
192.168.178.0/24 dev enp4s0  proto kernel  scope link  src 192.168.178.27  metric 100 

We see that docker has apparently added an additional routing table entry and has created an additional networking device – docker0 – to which all packages with destination in the class B network 172.17.0.0 are sent.

This device is a so called bridge. A (software) bridge is very similar to an ethernet bridge in hardware. It connects two or more devices – each packet that goes to one device is forwarded to all other devices connected to the bridge. The Linux kernel offers the option to establish a bridge in software which does exactly the same thing.

Let us list all existing bridges using the brctl utility.

$ brctl show
bridge name	bridge id		STP enabled	interfaces
docker0		8000.0242fd67b17b	no		veth1f8d78c
							vetha1692e1

Let us compare this with the output of the ifconfig command (again on the host). The corresponding output is

veth1f8d78c Link encap:Ethernet  HWaddr 56:e5:92:e8:77:1e  
          inet6 addr: fe80::54e5:92ff:fee8:771e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1034 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1964 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:70072 (70.0 KB)  TX bytes:1852757 (1.8 MB)

vetha1692e1 Link encap:Ethernet  HWaddr ca:73:3e:20:36:f7  
          inet6 addr: fe80::c873:3eff:fe20:36f7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1133 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2033 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:78715 (78.7 KB)  TX bytes:1860499 (1.8 MB)

These two devices are also appearing in the output of brctl show and are two devices that are connected to the bridge. These devices are called a virtual ethernet devices. They are always created in pairs and act like a pipe: traffic flowing in at one of the two devices appears to come out of the second device and vice versa – like a virtual network cable connecting the two devices.

We just said that virtual devices are always created in pairs. We have two containers, and if you start them one by one and look at the output of ifconfig, we see that each of the two containers contributes one device. Can that be correct? After starting the first container we should already see a pair, and after starting the second one we should see four instead of just two devices. So one in each pair is missing. Where did it go?

The answer is that Docker did create it, but move it into the namespace of the container, so that it is no longer visible on the host. To verify this, we can see the connection between the two interfaces as follows. First, enter

ip link

on the host. The output should look similar to the following lines

$ ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp4s0:  mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 1c:6f:65:c0:c9:85 brd ff:ff:ff:ff:ff:ff
3: docker0:  mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:fd:67:b1:7b brd ff:ff:ff:ff:ff:ff
25: vetha1692e1@if24:  mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default 
    link/ether ca:73:3e:20:36:f7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
27: veth1f8d78c@if26:  mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default 
    link/ether 56:e5:92:e8:77:1e brd ff:ff:ff:ff:ff:ff link-netnsid 1

This will show you (the very first number in each line) the so called ifindex which is a unique identifier for each network device within the kernel. In our case, the virtual ethernet devices visible in the host namespace have the indices 25 and 27. After each name, after the “at” symbol, you see a second number – 24 and 26. Now let us execute the same commmand in the first container.

/usr/local/apache2 # ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
24: eth0@if25:  mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff

Here suddenly the device with index 24 appears, and we see that it is connected to device if25, which is displayed as vetha1692e1 in the host namespace! Similarly, we find the device if26 inside container two:

/usr/local/apache2 # ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
26: eth0@if27:  mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff

This gives us now a rather complete picture of the involved devices. Ignoring the loopback devices for a moment, the following picture emerges.

dockernetworking.png

Now we can understand what happens when an application in container 1 sends a packet to container 2. First, the kernel will inspect the routing table in container 1. This table looks as follows.

/usr/local/apache2 # ip route
default via 172.17.0.1 dev eth0 
172.17.0.0/16 dev eth0  src 172.17.0.2

So the kernel will determine that the packet should be sent to the interface known as eth0 in container 1 – this is the interface with the unique index 24. As this is part of a virtual ethernet device pair, it appears on the other side of the pair, i.e. at the device vetha1692e1. This device in turn is connected to the bridge docker0. Being a bridge, it will distribute the packet to all other attached devices, so it will reach veth1f8d78c. This is now one endpoint of the second virtual ethernet device pair, and so the packet will finally end up at the interface with the unique index 26, i.e. the interface that is called eth0 in container 2. On this interface, the HTTP daemon is listening, receives the message, prepares an answer and that answer goes the same way back to container 1.

Thus, effectively, it appears from inside the containers as if all container network interfaces would be attached to the same network segment. To complete the picture, we can actually follow the trace of a packet going from container 1 to container 2 using arp and traceroute.

/usr/local/apache2 # arp
? (172.17.0.1) at 02:42:fd:67:b1:7b [ether]  on eth0
? (172.17.0.3) at 02:42:ac:11:00:03 [ether]  on eth0
/usr/local/apache2 # traceroute 172.17.0.3
traceroute to 172.17.0.3 (172.17.0.3), 30 hops max, 46 byte packets
 1  172.17.0.3 (172.17.0.3)  0.016 ms  0.013 ms  0.010 ms

We can now understand how applications in different containers can communicate with each other. However, what we have discussed so far is not yet sufficient to explain how an application on the host can access a HTTP server running inside a container, and what additional setup we need to access a HTTP server running in a container from the LAN. This will be the topic of my next post.

 

 

Setting up and testing our bitcoin network

In the last post in my series on bitcoin and the blockchain, we have successfully “dockerized” the bitcoin core reference implementation and have built the bitcoin-alpine docker container that contains a bitcoin server node on top of the Alpine Linux distribution. In this post, we will use this container image to build up and initialize a small network with three nodes.

First, we start a node that will represent the user Alice and manage her wallet.

$ docker run -it --name="alice" -h="alice" bitcoin-alpine

Note that we have started the container with the option -it so that it stays attached to the current terminal and we can see its output. Let us call this terminal terminal 1. We can now open a second terminal and attach to the running server and use the bitcoin CLI to verify that our server is running properly.

$ docker exec -it alice  "sh"
# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password getinfo

If you run – as shown above – the bitcoin CLI inside the container using the getbalance RPC call, you will find that at this point, the balance is still zero. This not surprising, our network does not contain any bitcoins yet. To create bitcoins, we will now bring up a miner node and mine a certain amount of bitcoin. So in a third terminal, enter

$ docker run -it --name="miner" -h="miner" bitcoin-alpine

This will create a second node called miner and start a second instance of the bitcoin server within this network. Now, however, our two nodes are not yet connected. To change this, we can use the addnode command in the bitcoin CLI. So let us figure out how they can reach each other on the network. We have started both containers without any explicit networking options, so they will be connected to the default bridge network that Docker creates on your host. To figure out the IP addresses, run

$ docker network inspect bridge

on your host. This should produce a list of network devices in JSON format and spit out the IP addresses of the two nodes. In my case, the node alice has the IP address 172.17.0.2 and the miner has the IP address 172.17.0.3. So let us change back to terminal number 2 (attached to the container alice) and run

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password addnode 172.17.0.3 add

In the two terminal windows that are displaying the output of the two bitcoin daemons on the nodes alice and miner, you should now see an output similar to the following.

2018-03-24 14:45:38 receive version message: /Satoshi:0.15.0/: version 70015, blocks=0, us=[::]:0, peer=0

This tells you that the two peers have exchanged a version message and recognized each other. Now let us mine a few bitcoins. First attach a terminal to the miner node – this will be terminal four.

$ docker exec -it miner "sh"

In that shell, enter the following command

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password generate 101
# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password getbalance

You should now see a balance of 50 bitcoins. So we have mined 50 bitcoins (in fact, we have mined much more, but it requires additional 100 blocks before a mined transaction is considered to be part of the balance and therefore only the amount of 50 bitcoins corresponding to the very first block shows up).

This is already nice, but when you execute the RPC method getbalance in terminal two, you will still find that Alice balance is zero – which is not surprising, after all the bitcoins created so far have been credited to the miner, not to Alice. To change this, we will now transfer 30 bitcoin to Alice. As a first step, we need to figure out what Alice address is. Each wallet maintains a default account, but we can add as many accounts as we like. So let us now add an account called “Alice” on the alice node. Execute the following command in terminal two which is attached to the container alice.

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password getnewaddress "Alice"

The output should be something like mkvAYmgqrEFEsJ9zGBi9Z87gP5rGNAu2mx which is the bitcoin address that the wallet has created. Of course you will get a different address, as the private key behind that address will be chosen randomly by the wallet. Now switch to the terminal attached to the miner node (terminal four). We will now transfer 30 bitcoin from this node to the newly created address of Alice.

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password sendtoaddress "mkvAYmgqrEFEsJ9zGBi9Z87gP5rGNAu2mx" 30.0

The output will be the transaction ID of the newly created transaction, in my case this was

2bade389ac5b3feabcdd898b8e437c1d464987b6ac1170a06fcb24ecb24986a8

If you now display the balances on both nodes again, you will see that the balance in the default account on the miner node has dropped to something a bit below 20 (not exactly 20, because the wallet has added a fee to the transaction). The balance of Alice, however, is still zero. The reason is that the transaction does not yet count as confirmed. To confirm it, we need to mine six additional blocks. So let us switch to the miners terminal once more and enter

# bitcoin-cli -regtest --rpcuser=user --rpcpassword=password generate 6

If you now display the balance on the node alice, you should find that the balance is now 30.

Congratulations – our setup is complete, we have run our first transaction and now have plenty of bitcoin in a known address that we can use. To preserve this wonderful state of the world, we will now create two new images that contain snapshots of the nodes so that we have a clearly defined state to which we can always return when we run some tests and mess up our test data. So switch back to a terminal on your host and enter the following four commands.

$ docker stop miner
$ docker stop alice
$ docker commit miner miner
$ docker commit alice alice

If you now list the available images using docker images, you should see two new images alice and miner. If you want, you can now remove the stopped container instances using docker rm.

Now let us see how we can talk to a running server in Python. For that purpose, let us start a detached container using our brand new alice image.

$ docker run  -d --rm --name=alice -p 18332:18332 alice

We can now communicate with the server using RPC requests. The bitcoin RPC interface follows the JSON RPC standard. Using the powerful Python package requests, it is not difficult to write a short function that submits a request and interprets the result.

def rpcCall(method, params = None, port=18332, host="localhost", user="user", password="password"):
    #
    # Create request header
    #
    headers = {'content-type': 'application/json'}
    #
    # Build URL from host and port information
    #
    url = "http://" + host + ":" + str(port)
    #
    # Assemble payload as a Python dictionary
    # 
    payload = {"method": method, "params": params, "jsonrpc": "2.0", "id": 1}        
    #
    # Create and send POST request
    #
    r = requests.post(url, json=payload, headers=headers, auth=(user, password))
    #
    # and interpret result
    #
    json = r.json()
    if 'result' in json and json['result'] != None:
        return json['result']
    elif 'error' in json:
        raise ConnectionError("Request failed with RPC error", json['error'])
    else:
        raise ConnectionError("Request failed with HTTP status code ", r.status_code)

If have added this function to my btc bitcoin library within the module utils. We can use this function in a Python program or an interactive ipython session similar to the bitcoin CLI client. The following ipython session demonstrates a few calls, assuming that you have cloned the repository.

In [1]: import btc.utils
In [2]: r = btc.utils.rpcCall("listaccounts")
In [3]: r
Out[3]: {'': 0.0, 'Alice': 30.0}
In [4]: r = btc.utils.rpcCall("dumpprivkey", ["mkvAYmgqrEFEsJ9zGBi9Z87gP5rGNAu2mx"])
In [5]: r
Out[5]: 'cQowgjRpUocje98dhJrondLbHNmgJgAFKdUJjCTtd3VeMfWeaHh7'

We are now able to start our test network at any time from the saved docker images and communicate with it using RPC calls in Python. In the next post, we will see how a transaction is created, signed and propagated into the test network.

Docker internals: process isolation with namespaces and cgroups

A couple of years back, when I first looked into Docker in more detail, I put together a few pages on how Docker is utilizing some Linux kernel technologies to realize process isolation. Recently I have been using Docker again, so I thought it would be a good point in time to dig out some of that and create two or maybe three posts on some Docker internals. Lets get started….

Container versus virtual machines

You probably have seen the image below or a similar image before, but for the sake of completeness let us quickly recap what the main difference between a container like Docker and a virtual machine is.

DockerVsVirtualMachine

On the left hand side, we see a typical stack when full virtualization is used. The exact setup will depend on the virtualization model that is being used, but in many cases (like running VirtualBox on Linux), you will have the actual hardware, a host operating system like Linux that consists of the OS kernel and on top of that a file system, libraries, configuration files etc. On these layers, the virtual machine is executing as an application. Inside the virtual machine, the guest OS is running. This could again be Linux, but could be a different distribution, a different kernel or even a completely different operating system. Inside each virtual machine, we then again have an operating kernel system kernel, all required libraries and finally the applications.

This is great for many purposes, but also introduces some overhead. If you decide to slice and dice your applications into small units like microservices, your actual applications can be rather small. However, for every application, you still need the overhead of a full operating system. In addition, a full virtualization will typically also consume a few resources on the host system. So full virtualization might not always be the perfect solution.

Enter containers. In a container solution, there is only one kernel – in fact, all containers and the applications running in them use the same kernel, namely the kernel of the host OS. At least logically, however, they all have their own root file system, libraries and so on. Thus containers still have the benefit of a certain isolation, i.e. different applications running in different container are still isolated on the file system level, can use networking resources like ports and sockets without conflicting and so forth, while reducing the overhead by sharing the kernel. This makes containers a good choice if you can live with the fact that all applications run one one OS and kernel version.

But how exactly does the isolation work? How does a container create the illusion for a process running inside it that it is the exclusive user of the host operating system? It turns out that Docker uses some technologies built into the Linux kernel to do this. Let us take a closer look at those core technologies one by one.

Core technologies

Let us start with namespaces. If you know one or more programming languages, you have probably heard that term before – variables and other objects are assigned to namespaces so that a module can use a variable x without interfering with a variable of the same name in a different module. So namespaces are about isolation, and that is also their role in the container world.

Let us look at an example to understand this. Suppose you want to run a web server inside a container. Most web servers will try to bind to port 80 on startup. Now at any point in time, only one application can listen on that port (with the same IP address). So if you start two containers that both run a web server, you need a mechanism to make sure that both web servers can – inside their respective container – bind to port 80 without a conflict.

A similar issue exists for process IDs. In the “good old days”, a Linux process was uniquely identified by its process ID, so there was exactly one process with ID 1 – usually the init process. ID 1 is a bit special, for instance when it comes to signal handling, and usually different copies of the user space OS running in different containers will all try to start their own init process and might rely on it having process ID one. So again, there is a clash of resources and we need some magic to separate them between containers.

That is exactly what Linux namespaces can do for you. Citing from the man page,

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.

In fact, Linux offers namespaces for different types of resources – networking resources, but also mount points, process trees or the good old System V inter-process communication devices. Individual processes can join a namespace or leave a namespace. If you spawn a new process using the clone system call, you can either ask the kernel to assign the new process to the same namespaces as the parent process, or you can create new namespaces for the child process.

Linux exposes the existing namespaces as symbolic links in the directory /proc/XXXX/ns, where XXXX is the process id of the respective process. Let us try this out. In a terminal, enter (recall that $$ expands to the PID of the current process)

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 net -> net:[4026531957]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 pid -> pid:[4026531836]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 user -> user:[4026531837]
lrwxrwxrwx 1 chr chr 0 Apr  9 09:36 uts -> uts:[4026531838]

Here you see that each namespace to which the process is assigned is represented by a symbolic link in this directory (you could use ls -Li to resolve the link and display the actual inode to which it is pointing). If you compare this with the content of /proc/1/ns (you will need to be root to do this), you will probably find that the inodes are the same, so your shell lives in the same namespace as the init process. We will later see how this changes when you run a shell inside a container.

Namespaces already provide a great deal of isolation. There is one namespace, the mount namespace, which even allows different processes to mount different volumes as root directory, and we will later see how Docker uses this to realize file system isolation. Now, if every container really had its own, fully, independent root file system, this would again introduce a high overhead. If, for instance, you run two containers that both use Ubuntu Linux 16.04, a large part of the root file system will be identical and therefore duplicated.

To be more efficient, Docker therefore uses a so called layered file system or union mount. The idea behind this is to merge different volumes into one logical view. Suppose, for instance, that you have a volume containing a file /fileA and another volume containing a file /fileB. With a traditional mount, you could mount any of these volumes and would then see either file A or file B. With a union mount, you can mount both volumes and would see both files.

That sounds easy, but is in fact quite complex. What happens, for instance, if both volumes contain a file called /fileA? To make this work, you have to add layers, where each layer will overlay files that already exist in lower layers. So your mounts will start to form a stack of layers, and it turns out that this is exactly what we need to efficiently store container images.

To understand this, let us again look at an example. Suppose you run two containers which are both based on the same image. What then essentially happens is that Docker will create two union mounts for you. The lowest layer in both mounts will be identical – it will simply be the common image. The second layer, however, is specific to the respective container. When you now add or modify a file in one container, this operation changes only the layer specific to this container. The files which are not modified in any of the containers continue to be stored in the common base layer. Thus, unless you execute heavy write operations in the container, the specific layers will be comparatively small, reducing the overhead greatly. We will see this in action soon.

Finally, the last ingredient that we need are control groups, abbreviated as cgroups. Essentially, cgroups provide a way to organize Linux processes into hierarchies in order to manage resource limits. Being hierarchies, cgrous are again exposed as part of the file system. On my machine, this looks as follows.

chr:~$ ls /sys/fs/cgroup/
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  systemd

We can see that there are several directories, each representing a specific type of resources that we might want to manage. Each of these directories can contain an entire file system tree, where each directory represents a node in the hierarchy. Processed can be assigned to a node by adding their process ID to the file tasks that will find in each of these nodes. Again, the man page turns out to be a helpful resource and explains the meaning of the different entries in the /sys/fs/cgroup directories.

Practice

Let us now see how all this works in practice. For that purpose, let us open two terminals. In one of the terminals – let me call this the container terminal – start a container running the Alpine distribution using

docker run --rm -it alpine

In the second window which I will call the host window, we can now use ps -axf to inspect the process tree and then look at the directories in /proc to browse the namespaces.

What you will find is that there are three processes involved. First, there is the docker daemon itself, called dockerd. In my case, this process has PID 1496. Then, there is a child process called docker-containerd, which is the actual container runtime within the Docker architecture stack. This process in turn calls a process called docker-containerd-shim (PID 10126) which then spawns the shell (PID 10142 in my case) inside the container.

ProcessTree

 

Now let us inspect the namespaces associated with these processes first. We start with the shell itself.

$ sudo ls -Lil /proc/10142/ns
total 0
4026531835 -r--r--r-- 1 root root 0 Apr  9 11:20 cgroup
4026532642 -r--r--r-- 1 root root 0 Apr  9 11:20 ipc
4026532640 -r--r--r-- 1 root root 0 Apr  9 11:20 mnt
4026532645 -r--r--r-- 1 root root 0 Apr  9 10:48 net
4026532643 -r--r--r-- 1 root root 0 Apr  9 11:20 pid
4026531837 -r--r--r-- 1 root root 0 Apr  9 11:20 user
4026532641 -r--r--r-- 1 root root 0 Apr  9 11:20 uts

Let us now compare this with the namespaces to which the containerd-shim process is assigned.

$ sudo ls -Lil /proc/10126/ns
total 0
4026531835 -r--r--r-- 1 root root 0 Apr  9 11:21 cgroup
4026531839 -r--r--r-- 1 root root 0 Apr  9 11:21 ipc
4026531840 -r--r--r-- 1 root root 0 Apr  9 11:21 mnt
4026531957 -r--r--r-- 1 root root 0 Apr  9 08:49 net
4026531836 -r--r--r-- 1 root root 0 Apr  9 11:21 pid
4026531837 -r--r--r-- 1 root root 0 Apr  9 11:21 user
4026531838 -r--r--r-- 1 root root 0 Apr  9 11:21 uts

We see that Docker did in fact create new namespaces for almost all possible namespaces (ipc, mnt, net, pid, uts).

Next, let us compare mount points. Inside the cointainer, run mount to see the existing mount points. Usually at the very top of the output, you should see the mount for the root filesystem. In my case, this was

none on / type aufs (rw,relatime,si=e92adf256343919e,dio,dirperm1)

Running the same command on the host system, I got a line like

none on /var/lib/docker/aufs/mnt/a9c5d26a45307d4e168b3936bd65d301c8dd039336083a324ed1a0b7c2bd0c52 type aufs (rw,relatime,si=e92adf256343919e,dio,dirperm1)

The identical si attribute tells you that this is fact the same mount. You also verify this directly. Inside the container, create a file test using touch test. If you then use ls to display the contents of the mount point as seen on the host, you should actually see this file. So the host process and the process inside the container see different mount points – made possible by the namespace technology! You can now access the files from inside the container or from outside the container without having to use docker exec (though I am note sure I would recommend this).

If you want, you can even trace the individual layers of this file system on the host system by using ls /sys/fs/aufs/si_e92adf256343919e/ and printing out the contents of the various files that you will find there – you will find that there are in fact two layers, one of them being a read-only layer and the second on top being a read-write layer.

ls /sys/fs/aufs/si_e92adf256343919e/
br0  br1  br2  brid0  brid1  brid2  xi_path
root:~# cat /sys/fs/aufs/si_e92adf256343919e/xi_path
/dev/shm/aufs.xino
root:~# cat /sys/fs/aufs/si_e92adf256343919e/br0
/var/lib/docker/aufs/diff/a9c5d26a45307d4e168b3936bd65d301c8dd039336083a324ed1a0b7c2bd0c52=rw
root:~# cat /sys/fs/aufs/si_e92adf256343919e/br1
/var/lib/docker/aufs/diff/a9c5d26a45307d4e168b3936bd65d301c8dd039336083a324ed1a0b7c2bd0c52-init=ro+wh

You can even “enter the container” using the nsenter Linux command to manually attach to defined namespaces of a process. To see this, enter

$ sudo nsenter -t 10142 -m -p -u "/bin/sh"
/ # ls
bin    dev    etc    home   lib    media  mnt    proc   root   run    sbin   srv    sys    test   tmp    usr    var
/ #

in a host terminal. This will attach to the mount, PID and user namespaces of the target process specified via the -t parameter, in our case this is the PID of the shell inside the container, and run the specified command /bin/sh. As a result, you will now see the file test created inside the container and see the same filesystem that is also visible inside the container.

Finally, let us take a look at the cgroups docker has created for this container. The easiest way to find them is to search for the first few characters of the container ID that you can figure out using docker ps (I have cut off some lines at the end of the output).

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
7c0f142bfbfd        alpine              "/bin/sh"           About an hour ago   Up About an hour                        quizzical_jones
$ find /sys/fs/cgroup/ -name "*7c0f142bfbfd*"
/sys/fs/cgroup/pids/docker/7c0f142bfbfdb9320166f92d17ecbf9d462e9c234916632f23ec1b454fb6eb52
/sys/fs/cgroup/pids/system.slice/var-lib-docker-containers-7c0f142bfbfdb9320166f92d17ecbf9d462e9c234916632f23ec1b454fb6eb52-shm.mount
/sys/fs/cgroup/freezer/docker/7c0f142bfbfdb9320166f92d17ecbf9d462e9c234916632f23ec1b454fb6eb52
/sys/fs/cgroup/net_cls,net_prio/docker/7c0f142bfbfdb9320166f92d17ecbf9d462e9c234916632f23ec1b454fb6eb52
/sys/fs/cgroup/cpuset/docker/7c0f142bfbfdb9320166f92d17ecbf9d462e9c234916632f23ec1b454fb6eb52

If you now inspect the task files in each of the newly created directories, you will find the PID of the container shell as seen from the root namespace, i.e. 10142 in this case.

This closes this post which is already a bit lengthy. We have seen how Docker uses union file systems, namespaces and cgroups to manage and isolate container and how we can link the resources as seen from within the container to resources on the host system. In the next posts, we will look in more detail at networking in Docker. Until then, you might want to consult the followings links that contain additional material.

  • J. Pettazone has put slides from a talk online
  • The overview from the official Docker documentation briefly explains the core technologies
  • There are of course already many excellent blog posts on Docker internals, like the one by Julia Evans
  • The post from the Docker Saigon community

In case you are interested in container platforms in general, you also might want to take a look at my series on Kubernetes that I have just started.

Building a bitcoin container with Docker

When we are playing with bitcoin transactions, we need some playground where making a mistake does not cost us real bitcoins and therefore money. In addition, we might want to play around with more than one bitcoin server to see how networkings works and how the messages are exchanged in the bitcoin peer-to-peer network.

There are several ways to do this. First, we could run more than one bitcoin server locally, and use different data directories and different configuration files and ports to separate them. However, there is a different option – we can use Docker container. This will make it ease to spin up as many instances as needed while only having to put together the configuration once, supports networking and allows us to easily clean up after we have tried something and restart our scenarios from a well defined state. In this post, I will guide you through the steps needed to build and install the bitcoin core software inside a container. This post is not meant to be a general introduction into container technology and Docker, but do not worry – I will not assume that you are familiar with the concepts, but we will start from scratch. However, if you want to understands the basics, there are many good posts out there, like this introduction or the Docker overview page which is part of the official documentation.

First, we will need to make sure that we have the latest Docker CE installed. On Ubuntu 16.04, this requires the following steps. First, we need to remove any potentially existing docker versions.

$ sudo apt-get remove docker docker-engine docker.io

Next we will add the Docker repository to the repository list maintained by APT to be able to install from there. So we first use curl to get the public key from the docker page, feed this into the apt-key tool and then add the repository.

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - 
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable"

We can then install docker from there and make sure that it is started at boot time using

$ sudo apt-get update
$ sudo apt-get install docker-ce
$ sudo systemctl enable docker

You also should make sure that there is a docker group to which your user is added – use the command groups to verify that this is the case and follow the instructions on the Docker page if not.

When everything is done, let us test our installation. We will ask docker to download and run the hello-world minimal image.

$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
ca4f61b1923c: Pull complete
Digest: sha256:97ce6fa4b6cdc0790cda65fe7290b74cfebd9fa0c9b8c38e979330d547d22ce1
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://cloud.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/engine/userguide/

Now we can start with the actual installation. At the end, of course, there will be a Dockerfile – but I will guide you through the individual steps one by one and we will create the final file iteratively. If you would only like to see the result, you can skip to the last paragraph of this post that explains how to retrieve this file from my GitHub repository and to use it … but if course you are more than welcome to join me on the way there.

Actually, we will even do a bit more than we would need if we only wanted to have a running bitcoind, but this post is also meant as a refresher if you have built docker images before and as a hands-on introduction if you have never done this.

As a basis, we will use the Alpine Linux distribution. From there, we have several options to proceed. Of course, the Alpine Linux distribution has a docker package, but simply that or any other precompiled binary would not give us the flexibility that we need – it is for instance extremely useful to be able to add debugging output to the source code if something is not as expected. Besides, it is more fun to compile from scratch. So our target will be to pull the source code, compile and install in a container.

Based on the Alpine Linux container, we will first create several container images with separate docker files that are based on each other, which will make it easier to iterate and check things manually as we go. In the end, we will assemble everything into one docker file.

As a first step, we add a few libraries to the Alpine image that we need to be able to fetch and compile the bitcoin source code. We call the target image alpine-dev as it can be reused as a development environment for other purposes as well. Here is the docker file Dockerfile.dev

FROM alpine
RUN apk update && apk add git \ 
                          make \
                          file \
                          autoconf \
                          automake \ 
                          build-base \
                          libtool \
                          db-c++ \
                          db-dev \
                          boost-system \
                          boost-program_options \
                          boost-filesystem \ 
                          boost-dev \ 
                          libressl-dev \ 
                          libevent-dev 

Save this file somewhere in a local directory (ideally, this directory is otherwise empty as during the build process, its context will be transferred to the docker daemon and used as build context, so we better keep this small) and then build the image using

docker build -f Dockerfile.dev -t alpine-dev .

If you now check your local repository with docker images, you should see that the image has been added successfully (it is big, but do not worry, we fix that later).

The next image that we will be building is an image that is based on alpine-dev but contains the specific version of the bitcoin source code that we want, i.e. it pulls the code from the bitcoin directory. So its docker file is very short.

FROM alpine-dev
RUN git clone https://github.com/bitcoin/bitcoin --branch v0.15.0 --single-branch

You can build this image with

docker build -f Dockerfile.build -t bitcoin-alpine-build .

Next we write the docker file that performs the actual build. This is again not difficult (but here it comes in handy that we can run a container from the image that we have just generated and play around with the configuration to arrive at the final set of options below).

FROM bitcoin-alpine-build
RUN (cd bitcoin  && ./autogen.sh && \
                      ./configure --disable-tests \
                      --disable-bench --disable-static  \
                      --without-gui --disable-zmq \ 
                      --with-incompatible-bdb \
                      CFLAGS='-w' CXXFLAGS='-w' && \
                      make -j 4 && \
                      strip src/bitcoind && \
                      strip src/bitcoin-cli && \
                      strip src/bitcoin-tx && \
                      make install )

Again, we build the image and place it into our local repository using

docker build -f Dockerfile.install -t bitcoin-alpine-bin .

Let us now test our installation. We first bring up an instance of our new container.

docker run -it bitcoin-alpine-bin

Within the container, we can now run the bitcoin daemon.

bitcoind -server=1 -rest=1 -regtest -txindex=1 -daemon

Once the daemon has started, we can verify that it is up and running and ready to accept commands. We can for instance – from within the container – run bitcoin-cli -regtest getinfo and should see some status information.

So let us now move on to write the docker file that will create the run time environment for our daemon. There is still a couple of things we need to think about. First, we will have to communicate with our bitcoin server using JSON-RPC, so we need to expose this port towards the host.

Second, we need a configuration file. We could generate this on the fly, but the easiest approach is to place this in the build context and copy it when we run the daemon.

Finally, we have to think about the RPC authorization mechanism. Typically, the bitcoin server writes a cookie file into the configuration directory which is then picked up the by client, but this does not work if the server is running inside the container and the client locally on our host system. Probably the safest way would be to use the option to provide a hashed password to the server and keep the actual password secure. We will use a different approach which is of course not recommended for production use as it could allow the world access to your wallet – we specify the username and password in the configuration file. Our full configuration file is

regtest=1
server=1
rpcbind=0.0.0.0:18332
rpcuser=user
rpcpassword=password
rpcallowip=0.0.0.0/0
rpcport=18332
rpcuser=user
rpcpassword=password

Note that this opens the RPC port to the world and is probably not secure, but for use in a test setup this will do. Our docker file for the last stage is now

FROM bitcoin-alpine-bin

#
# Copy the bitcoin.conf file from
# the build context into the container
#
COPY bitcoin.conf /bitcoin.conf

#
# Expose the port for the RPC interface
#
EXPOSE 18332/tcp

#
# Start the bitcoin server
#
ENTRYPOINT ["/usr/local/bin/bitcoind"]
CMD ["-conf=/bitcoin.conf", "-regtest", "-rest=1", "-server=1", "-printtoconsole", "-txindex=1"]

We can now test our client. First, we build and run the container.

$ docker build -f Dockerfile.run -t bitcoin-alpine-run .
$ docker run --rm -it -p 18332:18332 bitcoin-alpine-run

Then, in a second terminal, let us connect. We assume that you have an installation of the bitcoin-cli client on the host machine as well. Run

$ bitcoin-cli -regtest -rpcuser=user -rpcpassword=password getinfo

You should now see the usual output of the getinfo command similar to the test that we have done before directly in the container.

That is nice, but there is a serious issue. If you look at the image that we have just created using docker images, you will most likely be shocked – our container is more than 800 Mb in size. This is a problem. The reason is that our image contains everything that we have left behind during the build process – the development environment, header files, the source code, object files and so on. It would be much nicer if we could build a clean image that only contains the executables and libraries needed at runtime.

Fortunately, current versions of Docker offer a feature called multi-stage build which is exactly what we need. Let us take a look at the following docker file to explain this.

FROM bitcoin-alpine-bin as build

RUN echo "In build stage"

FROM alpine

#
# Copy the binaries from the build to our new container
#
COPY --from=build /usr/local/bin/bitcoind /usr/local/bin

#
# Install all dependencies
#
RUN apk update && apk add boost boost-filesystem \
            boost-program_options \
            boost-system boost-thread busybox db-c++ \
            libevent libgcc libressl2.6-libcrypto \ 
            libstdc++ musl

#
# Copy the bitcoin.conf file from
# the build context into the container
#
COPY bitcoin.conf /bitcoin.conf

#
# Expose the port for the RPC interface
#
EXPOSE 18332/tcp

#
# Start the bitcoin server
#
ENTRYPOINT ["/usr/local/bin/bitcoind"]
CMD ["-conf=/bitcoin.conf", "-regtest", "-rest=1", "-server=1", "-printtoconsole", "-txindex=1"]

We see that this docker file has two FROM statements. When it is executed, it starts with the image specified by the first FROM statement. In our case, this is the bin image that did already contain the executable binaries. In this stage, we do nothing in our simple example but simply print a message. Then, with the second FROM command, Docker will start a new image based on the raw Alpine image from scratch. However – and this is the magic – we still have access to the files from the first image, and we can copy them to our new image using the --from specifier. This is what we do here – we copy the executable into our new container. Then we add only the runtime libraries that we really need and the configuration file.

This gives us a nice and small container – when I checked it was below 30MB!

In further posts, I will assume that you have build this container and made available in your local Docker repository as bitcoin-alpine. If you have not followed all the steps of this post, you can simply pull a docker file that merges all the files explained above in one file and use it to build this container as follows.

$ git clone https://github.com/christianb93/bitcoin
$ cd bitcoin/docker
$ docker build --rm -f Dockerfile -t bitcoin-alpine .

In the version of the dockerfile on Github, I have also added a startup script that overwrites the default credentials if the environment variables BC_RPC_USER and BC_RPC_PASSWORD are set – this makes it easier to use custom credentials, for instance in a Kubernetes environment.

That was it for today. Starting with the next post, we will use this test installation to see how we can create and publish a full transaction in the bitcoin network.