Docker internals: networking part II

In this post, we will look in more detail at networking with Docker if communication between a Docker container and either the host or the outside world is involved.

It turns out that in these cases, the Linux Netfilter/iptables facility comes into play. This post is not meant to be an introduction into iptables, and I will assume that the reader is aware of the basics (for more information, I found the tutorial on frozentux and the overview on digitalozean very helpful).

Setup and basics

To simplify the setup, we will only use one container in this post. So let us again designate a terminal to be the container terminal and in this terminal, enter

$ docker run --rm -d --name "container1" httpd:alpine
$ docker exec -it container1 "/bin/sh"

You should now see a shell prompt inside the container. When you run netstat -a from that prompt, you should see a listening socket being bound to port 80 within the container.

Now let us print out the iptables configuration on the host system.

$ sudo iptables -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
-N DOCKER
-N DOCKER-ISOLATION
-N DOCKER-USER
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION -j RETURN
-A DOCKER-USER -j RETURN

Here we see that Docker has added three new chains and a couple of rules. The first chain that Docker has added is the DOCKER chain. In our configuration, that chain is empty, we will see later that this will change once we expose ports to the outside world.

The second chain that we see is the DOCKER-ISOLATION chain. I have not been able to find out much about this chain so far, but it appears that Docker uses this chain to add rules that isolate containers when you do not use the default bridge device but connect your containers to user defined bridges.

Finally, there is the chain DOCKER-USER that Docker adds, but otherwise leaves alone, so that firewall rules can be added by an administrator with a bit less conflict of clashing with the manipulations that Docker performs.

All these chains are empty or just consist of a RETURN statement, so we can safely ignore them for the time being.

Host-to-container traffic

As a first use case, let us now try to understand what happens when an application (curl in our case) in the host namespace wants to talk to the web server running in our container. To be able to better see what is going on, let us add two logging rules to the iptables configuration to log traffic coming in via the docker0 bridge and going out via the docker0 bridge.

$ sudo iptables -A INPUT -i docker0 -j LOG --log-prefix "IN: " --log-level 3
$ sudo iptables -A OUTPUT -o docker0 -j LOG --log-prefix "OUT: " --log-level 3

With these rules in place, let us now create some traffic. In the host terminal, enter

$ curl 172.17.0.2

We can now inspect the content of /var/log/syslog to see what happened. The first two entries should like like this (stripping of host name and time stamps):

OUT: IN= OUT=docker0 SRC=172.17.0.1 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=3460 DF PROTO=TCP SPT=34322 DPT=80 WINDOW=29200 RES=0x00 SYN URGP=0 
IN: IN=docker0 OUT= PHYSIN=veth376d25c MAC=02:42:25:b7:e5:38:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=172.17.0.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=80 DPT=34322 WINDOW=28960 RES=0x00 ACK SYN URGP=0 

So we see that the first logging rule that has been triggered is the rule in the OUTPUT chain. Let us try to understand in detail how this log entry was created.

When curl asks the kernel to establish a connection with 17.17.0.2, i.e. to send a TCP SYN request, a TCP packet will be generated and handed over to the kernel. The kernel will consult its routing table, find the route via docker0 and send the packet to the bridge device.

At this point, the packet leaves the namespace for which this set of iptables rules is responsible, so the OUTPUT chain is traversed and our log entry is created.

What happens next? The packet is picked up by the container namespace, processed and the answer goes back. We can see the answer coming in again, this time triggering the logging in the INPUT rule – this is the second line, the SYN ACK packet.

Except our logging rule, no other rules are defined in the INPUT and OUTPUT chains, so the default policies apply for our packets. As both policies are set to ACCEPT, netfilter will allow our packets to pass and the connection works.

Getting out of the container

The story is a bit different if we are trying to talk to a web server on the host or on the LAN from within the container. Thus, from the kernels point of view, we are now dealing with traffic involving more than one interface, and in addition to the INPUT and OUTPUT chains, the FORWARD chain becomes relevant. To be able to inspect this traffic, let us therefore add two logging rules to the FORWARD chain.

$ sudo iptables -I FORWARD -i docker0 -j LOG --log-prefix "IN_FORWARD: " --log-level 3
$ sudo iptables -I FORWARD -o docker0 -j LOG --log-prefix "OUT_FORWARD: " --log-level 3

Now let us generate some traffic. The first thing that I have tried is to reach my SAN on the same network which is a Synology diskstation listening on port 5000 of 192.168.178.28. So in the container window, I did a

# telnet 192.168.178.28:5000

and entered some nonsens (it does not matter so much what you enter here, it will most likely result in a “bad request” message, but it generates traffic – do not forget to hit return). This will again produce some logging output, the first two lines being

IN_FORWARD: IN=docker0 OUT=enp4s0 PHYSIN=veth376d25c MAC=02:42:25:b7:e5:38:02:42:ac:11:00:02:08:00 SRC=172.17.0.2 DST=192.168.178.28 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=28280 DF PROTO=TCP SPT=54062 DPT=5000 WINDOW=29200 RES=0x00 SYN URGP=0 
OUT_FORWARD: IN=enp4s0 OUT=docker0 MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC=192.168.178.28 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=0 DF PROTO=TCP SPT=5000 DPT=54062 WINDOW=14480 RES=0x00 ACK SYN URGP=0

Let us again try to understand what happened. An application (telnet in our case) wants to reach the IP address 192.168.178.28. The kernel will first consult the routing table in the namespace of the container and decide to use the default route via the eth0 device. Thus the packet will go to the bridge device docker0. There, it will be picked up by the netfilter chain in the host namespace. As the destination address is not one of the IP addresses of the host, it will be handled by the FORWARD chain, which will trigger our logging rules.

Let us now inspect the other rules in the forward chain once more using iptables -S FORWARD. We see that in addition to the rules pointing to the docker generated subchains and in addition to our own logging rules, there are two rules relevant for our case.

$ sudo iptables  -S  FORWARD
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT

The first rule will accept all traffic that comes in at any network device and is routed towards the bridge device if that traffic belongs to an already established connection. This allows the answer to our TCP request to travel back from the network interface connected to the local network (enp4s0 in my case) to the container. However, unsolicited requests, i.e. new connection requests targeted towards the bridge device will be left to the default policy of the FORWARD chain and therefore dropped.

The second rule will allow outgoing traffic – all packets coming from the docker0 bridge device targeted towards any other interface will be accepted and hence forwarded. As there is no filter on the connection state, this allows an application inside the container to establish a new connection to the outside world.

However, I have been cheating a bit and skipped one important point. Suppose our SYN request happily leaves our local network adapter and travels through the LAN. The request comes from within the container, so from the IP address 172.17.0.2. If that IP address would still appear in the IP header, the external server (the disk station in my case) would try to send the answer back to this address. However, this address is not known in the LAN, only locally on my machine, and the response would get lost.

To avoid this, Docker will in fact add one more rule to the NAT table. Let us try to locate this rule.

$ sudo iptables  -S -t nat 
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N DOCKER
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN

Again, we see that docker is adding a new chain and some rules pointing to this chain. In addition, there is a rule being added to the POSTROUTING chain which is invoked immediately before a packet leaves the host. This is a so called masquerading rule which will replace the source IP address in the IP header of the outgoing packet by the IP address of the device through which the packet is sent. Thus, from the point of view of my brave diskstation, the packet will look as if it originated from the host and will therefore be sent back to the host. When the response comes in, netfilter will revert the process and forward the packet to the correct destination, in our case the bridge device.

Reaching a server from the outside world

This was already fairly complicated, but now let us try to see what happens if we want to connect to the web server running in our container from the outside world.

Now if I simply ssh into my diskstation and run curl there to reach 172.17.0.2, this will of course fail. The diskstation does not have a route to that destination, and the default gateway cannot help either as this is a private class B network. If I replace the IP address with the IP address of the host, it will not work either – in this case, the request reaches the host, but on the host network address, no process is listening in port 80. So we somehow need to map the port from the container into the host networking system.

If you consult the docker documentation on this case, you will learn that in order to do this, you have to run the container with the -p switch. So let us stop and restart our container and apply that option.

$ docker stop container1
$ docker run --rm -d --name "container1" -p 80:80 httpd:alpine
$ docker exec -it container1 "/bin/sh"

If we now inspect the chains and see what has changed, we can find the following new rule which has been added to the filter table.

A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 80 -j ACCEPT

This rule will apply to all incoming traffic that is targeted towards 172.17.0.2:80 and not coming from the bridge device, and accept it. In addition, two new rules have been added to the NAT table.

-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 80 -j MASQUERADE
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 80 -j DNAT --to-destination 172.17.0.2:80

The first rule will again apply a network address translation (SNAT, i.e. manipulating the source address) as we have already seen it before and applies to traffic within the virtual network to which the bridge belongs. The second rule is more interesting. This rule has been added to the DOCKER chain and requests DNAT (i.e. destination NAT, meaning that the target address is replaced) for all packets that are not coming from the bridge device, but have destination port 80. For these packets, the target address is rewritten to be 172.17.0.2:80, so all traffic directed towards port 80 is now forwarded to the container network.

Let us again go through one example step by step. For that purpose, it is useful to add some more logging rules, this time to the NAT table.

$ sudo iptables -t nat -I PREROUTING  -j LOG --log-prefix "PREROUTING: " --log-level 3
$ sudo iptables -t nat -I POSTROUTING  -j LOG --log-prefix "POSTROUTING: " --log-level 3

When we now submit a request from another machine on the local network directed towards 192.168.178.27:80 (i.e. towards the IP address of the host!), we find the following log entries in /var/syslog.

PREROUTING: IN=enp4s0 OUT= MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC=192.168.178.28 DST=192.168.178.27 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0 
OUT_FORWARD: IN=enp4s0 OUT=docker0 MAC=1c:6f:65:c0:c9:85:00:11:32:77:fe:46:08:00 SRC=192.168.178.28 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0 
POSTROUTING: IN= OUT=docker0 SRC=192.168.178.28 DST=172.17.0.2 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=57720 DF PROTO=TCP SPT=54075 DPT=80 WINDOW=14600 RES=0x00 SYN URGP=0

Thus the first packet of the connection arrives (note that NAT rules are not evaluated for subsequent packets of a connection any more) and will first be processed by the PREROUTING chain of the NAT table. As we added our logging rule here, we see the log output. We can also see that at this point, the target address is still 192.168.178.27:80.

The next rule – still in the PREROUTING chain – that is being evaluated is the jump to the DOCKER chain. Here, the DNAT rule kicks in and changes the destination address to 172.17.0.2, the IP address of the container.

Then the kernel routing decision is taken based on the new address and the forwarding mechanism starts. The packet will appear in the forward chain. As the routing has already determined the target interface to be docker0, our OUT_FORWARD logging rule applies and the second log entry is produced, also confirming that the new target address is 172.17.0.2:80. Then, the jump to the DOCKER chain matches, and within that chain, the rule is accepted as its target port is 80.

Finally, the POSTROUTING chain in the NAT table is traversed. This produces our third log file entry. However, the SNAT rule does not apply, as the source address does not belong to the network 172.17.0.2/32 – you can use tcpdump on the bridge device to see that when the packet leaves the device, it still has the source IP address belonging to the diskstation. So again, the configuration works and we can reach an application inside the container from the outside world.

There are many other aspects of networking with Docker that I have not even touched upon – user defined bridges, overlay networks or the famous docker-proxy, just to mention a few fo them – but this post is already a bit lengthy, so let us stop here for today. I hope I could provide at least some insight into the internals of networking with Docker – and for me, this was actually a good opportunity to refresh and improve my still very basic knowledge of the Linux networking stack.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s