Networking – Page 2 – LeftAsExercise

Virtual networking labs – Open vSwitch in practice

In the last post, we have discussed the architecture of Open vSwitch and how it with a control plane to realize an SDN. Today, we will make this a bit more tangible by running two hands-on labs with OVS.

The labs in this post are modelled after some of the How-to documents that are part of the Open vSwitch documentation, but use a combination of virtual machines and Docker to avoid the need for more than one physical machine. In both labs, we bring up two virtual machines which are connected via a VirtualBox virtual network, and inside each machine, we bring up two Docker containers that will eventually interact via OVS bridges.

Lab 11: setting up an overlay network with Open vSwitch

In the first lab, we will establish interaction between the OVS bridges on the two involved virtual machines using an overlay network. Specifically, the Docker containers on each VM will be connected to an OVS bridge, and the OVS bridges will use VXLAN to talk to each other, so that effectively, all Docker containers appear to be connected to an Ethernet network spanning the two virtual machines.

Instead of going through all steps required to set this up, we will again bring up the machines automatically using a combination of Vagrant and Ansible, and then discuss the major steps and the resulting setups. To run the lab, you will again have to download the code from my repository and start Vagrant.

git clone https://github.com/christianb93/networking-samples
cd lab11
vagrant up

While this is running, let us quickly discuss what the scripts are doing. First, of course, we create two virtual machines, each running Ubuntu Bionic. In each machine, we install Open vSwitch and Docker. We then install the docker Python3 module to make Ansible happy.

Next, we bring up two Docker containers, each running an image which is based on NGINX but has some networking tools installed on top. For each container, we set up a pair of two VETH devices. One of the devices is then moved into the networking namespace of the container, and one of the two devices will later be added to our bridge, so that these VETH device pairs effectively operate like an Ethernet cable connecting the containers to the bridge.

We then create the OVS bridge. In the Ansible script, we use the Ansible OVS module to do this, but if you wanted to create the bridge manually, you would use a command like

ovs-vsctl add-br myBridge \
           -- add-port myBridge web1_veth1 \
           -- add-port myBridge web2_veth1

This is actually a combination of three commands (i.e updates on the OVSDB database) which will be run in one single transaction (the OVS CLI uses the double dashes to combine commands into one transaction). With the first part of the command, we create a virtual OVS bridge called myBridge. With the second and third line, we then add two ports, connected to the two VETH pairs that we have created earlier.

Once the bridge exists and is connected to the containers, we add a third port, which is a VLXAN port, which, using a manual setup, would be the result of the following commands.

ovs-vsctl add-port myBridge vxlan0 \\
          -- set interface vxlan0 type=vxlan options:remote_ip=192.168.50.4  options:dst_port=4789 options:ttl=5

Again, we atomically add the port to the bridge and pass the VXLAN options. We set up the VTEP as a point-to-point connection to the second virtual machine, using the standard UDP port and a TTL of five to avoid that UDP packets get lost.

Finally, we configure the various devices and assign IP addresses. To configure the devices in the container namespaces, we could attach to the containers, but it is easier to use netns to run the required commands within the container namespaces.

Once the setup is complete, we are ready to explore the newly created machines. First, use vagrant ssh boxA to log into boxA. From there, use Docker exec to attach to the first container.

sudo docker exec -it web1 "/bin/bash"

You should now be able to ping all other containers, using the IP addresses 172.16.0.2 – 172.16.0.4. If you run arp -n inside the container, you will also find that all three IP addresses are directly resolved into MAC addresses and are actually present on the same Ethernet segment.

To inspect the bridges that OVS has created, exit the container again so that we are now back in the SSH session on boxA and use the command line utility ovs-vsctl to list all bridges.

sudo ovs-vsctl list-br

This will show us one bridge called myBridge, as expected. To get more information, run

sudo ovs-vsctl show

This will print out the full configuration of the current OVS node. The output should look similar to the following snippet.

af3a2230-3d2a-4364-a0b9-1da4e32211e4
    Bridge myBridge
        Port "web2_veth1"
            Interface "web2_veth1"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {dst_port="4789", remote_ip="192.168.50.5", ttl="5"}
        Port "web1_veth1"
            Interface "web1_veth1"
        Port myBridge
            Interface myBridge
                type: internal
    ovs_version: "2.9.2"

We can see that the output nicely reflects the structure of our network. There is one bridge, with three ports – the two VETH ports and the VXLAN port. We also see the parameters of the VXLAN ports that we have specified during creation. It is also possible to obtain the content of the OVSDB tables that correspond to the various objects in JSON format.

sudo ovs-vsctl list bridge
sudo ovs-vsctl list port
sudo ovs-vsctl list interface

Lab 12: VLAN separation with Open vSwitch

In this lab, we will use a setup which is very similar to the previous one, but with the difference that we use layer 2 technology to span our network across the two virtual machines. Specifically, we establish two VLANs with ids 100 (containing web1 and web3) and 200 (containing the other two containers). On those two logical Ethernet networks, we establish two different layer 3 networks – 192.168.50.0/24 and 192.168.60.0/24.

The first part of the setup – bringing up the containers and creating the VETH pairs – is very similar to the previous labs. Once this is done, we again set up the two bridges. On boxA, this would be done with the following sequence of commands.

sudo ovs-vsctl add-br myBridge
sudo ovs-vsctl add-port myBridge enp0s8
sudo ovs-vsctl add-port myBridge web1_veth1 tag=100
sudo ovs-vsctl add-port myBridge web2_veth1 tag=200

This will create a new bridge and first add the VM interface enp0s8 to it. Note that by default, every port added to OVS is a trunk port, i.e. the traffic will carry VLAN tags. We then add the two VETH ports with the additional parameter tag which will mark the port as an access port and define the corresponding VLAN ID.

Next we need to fix our IP setup. We need to remove the IP address from the enp0s8 as this is now part of our bridge, and set the IP address for the two VETH devices inside the containers.

sudo ip addr del 192.168.50.4 dev enp0s8
web1PID=$(sudo docker inspect --format='{{.State.Pid}}' web1)
sudo nsenter -t $web1PID -n ip addr add  192.168.50.1/24 dev web1_veth0
web2PID=$(sudo docker inspect --format='{{.State.Pid}}' web2)
sudo nsenter -t $web2PID -n ip addr add  192.168.60.2/24 dev web2_veth0

Finally, we need to bring up the devices.

sudo nsenter -t $web1PID -n ip link set  web1_veth0 up
sudo nsenter -t $web2PID -n ip link set  web2_veth0 up
sudo ip link set web1_veth1 up
sudo ip link set web2_veth1 up

The setup of boxB proceeds along the following lines. In the lab, we again use Ansible scripts to do all this, but if you wanted to do it manually, you would have to run the following on boxB.

sudo ovs-vsctl add-br myBridge
sudo ovs-vsctl add-port myBridge enp0s8
sudo ovs-vsctl add-port myBridge web3_veth1 tag=100
sudo ovs-vsctl add-port myBridge web4_veth1 tag=200
sudo ip addr del 192.168.50.5 dev enp0s8
web3PID=$(sudo docker inspect --format='{{.State.Pid}}' web3)
sudo nsenter -t $web3PID -n ip addr add  192.168.50.3/24 dev web3_veth0
web4PID=$(sudo docker inspect --format='{{.State.Pid}}' web4)
sudo nsenter -t $web4PID -n ip addr add  192.168.60.4/24 dev web4_veth0
sudo nsenter -t $web3PID -n ip link set  web3_veth0 up
sudo nsenter -t $web4PID -n ip link set  web4_veth0 up
sudo ip link set web3_veth1 up
sudo ip link set web4_veth1 up

Instead of manually setting up the machines, I have of course again composed a couple of Ansible scripts to do all this. To try this out, run

git clone https://github.com/christianb93/networking-samples
cd lab12
vagrant up

Now log into one of the boxes, say boxA, attach to the web1 container and try to ping web3 and web4.

vagrant ssh boxA
sudo docker exec -it web1 /bin/bash
ping 192.168.50.3
ping 192.168.60.4

You should see that you can get a connection to web3, but not to web4. This is of course what we expect, as the VLAN tagging is supposed to separate the two networks. To see the VLAN tags, open a second session on boxA and enter

sudo tcpdump -e -i enp0s8

When you now repeat the ping, you should see that the traffic generated from within the container web1 carries the VLAN tag 100. This is because the port to which enp0s8 is attached has been set up as a trunk port. If you stop the dump and start it again, but this time listening on the device web1_veth1 which we have added to the bridge as an access port, you should see that no VLAN tag is present. Thus the bridge operates as expected by adding the VLAN tag according to the tag of the access port on which the traffic comes in.

In the next post, we will start to explore another important feagure of OVS – controlling traffic using flows.

Virtual networking labs – a short introduction to Open vSwitch

In the previous posts, we have used standard Linux tools to establish and configure our network interfaces. This is nice, but becomes very difficult to manage if you need to run environments with hundreds or even thousands of machines. Open vSwitch (OVS) is an Open source software switch which can be integrated with SDN control planes and cloud management software. In this post, we will look a bit at the theoretical background of OVS, leaving the practical implementation of some examples to the next post.

Some terms from the world of software defined networks

It is likely that you have heard the magical word SDN before, and it is also quite likely that you have already found that giving a precise meaning to this term is hard. Still, there is a certain agreement that one of the core ideas of SDN is to separate data flow through your networking devices from the and networking configuration.

In a traditional data center, your network would be implemented by a large number of devices like switches and routers. Each of these devices holds some configuration and typically has a way to change that configuration remotely. Thus, the configuration is tightly integrated with the networking infrastructure, and making sure that the entire configuration is consistent and matches the desired state of your network is hard.

With sofware defined networking, you separate the configuration from the networking equipment and manage it centrally. Thus, the networking equipment handles the flow of data – and is referred to as the data plane or flow plane – while a central component called the control plane is responsible for controlling the flow of data.

This is still a bit vague, but becomes a bit more tangible when we look at an example. Enter Open vSwitch (OVS). OVS is a software switch that turns a Linux server (which we will call a node) into a switch. Technically, OVS is a set of server processes that are installed on each node and that handle the network flow between the interfaces of the node. These nodes together make up the data plane. On top of that, there is a control plane or controller. This controller talks to the individual nodes to make sure the rules that they use to manage traffic (called the flows) are set up accordingly.

To allow controllers and switch nodes to interact, an open standard called OpenFlow has been created which defines a common way to describe flows and to exchange data between the controller and the switches. OVS supports OpenFlow (currently only version 1.1 is supported) and thus can be combined with OpenFlow based controllers like Faucet or Open Daylight, creating a layered architecture as follows. Additionally, a switch can be configured to ask the controller how to handle a packet for which no matching flow can be found.

Here, OVS uses OpenFlow to exchange flows with the controller. To exchange information on the underlying configuration of the virtual bridge (which ports are connected, how are these ports set up, …) OVS provides a second protocol called OVSDB (see below) which can also be used by the control plane to change the configuration of the virtual switch (some people would probably prefer to call the part of the control logic which handles this the management plane in contrast to the control plane, which really handles the data flow only).

Components of Open vSwitch

Let us now dig a little bit into the architecture of OVS itself. Essentially, OVS consists of three components plus a set of command-line interfaces to operate the OVS infrastructure.

First, there is the OVS virtual switch daemon ovs-vswitchd. This is a server process running on the virtual switch and is connected to a socket (usually a Unix socket, unless it needs to communicate with controllers not on the same machine). This component is responsible for actually operating the software defined switch.

Then, there is a state store, in the form of the ovsdb-server process. This process is maintaining the state that is managed by OVS, i.e. the objects like bridges, ports and interfaces that make up the virtual switch, and tables like the flow tables used by OVS. This state is usually kept in a file in JSON format in /etc/openvswitch. The OVSDB connects to the same Unix domain socket as the Switch daemon and uses it to exchange information with the Switch daemon (in the database world, the switch daemon is a client to the OVSDB database server). Other clients can connect to the OVSDB using a JSON based protocol called the OVSDB protocol (which is described in RFC 7047) to retrieve and update information.

The third main component of OVS is a Linux kernel module openvswitch. This module is now part of the official Linux kernel tree and therefore is typically pre-installed. This kernel module handles one part of the OVS data path, sometimes called the fast path. Known flows are handled entirely in kernel space. New flows are handled once in the user space part of the datapath (slow path) and then, once the flow is known, subsequently in the kernel data path.

Finally, there are various command-line interfaces, the most important one being ovs-vsctl. This utility can be used to add, modify and delete the switch components managed by OVS like bridges, port and so forth – more on this below. In fact, this utility operates by making updates to the OVSDB, which are then detected and realized by the OVS switch daemon. So the OVSDB is the leading provider of the target state of the system.

The OVS data model

To understand how OVS operates, it is instructive to look at the data model that describes the virtual switches deployed by OVS. This model is verbally described in the man pages. If you have access to a server on which OVS is installed, you can also get a JSON representation of the data model by running

ovsdb-client get-schema Open_vSwitch

At the top level of the hierarchy, there is a table called Open_vSwitch. This table contains a set of configuration items, like the supported interface types or the version of the database being used.

Next, there are bridges. A bridge has one or more ports and is associated with a set of tables, each table representing a protocol that OVS supports to obtain flow information (for instance NetFlow or OpenFlow). Note that the Flow_Table does not contain the actual OpenFlow flow table entries, but just additional configuration items for a flow table. In addition, there are mirror ports which are used to trace and monitor the network traffic (which we ignore in the diagram below).

Each port refers to one or more interfaces. In most situations, each port has one interface, but in case of bonding, for instance, one port is supported by two interfaces. In addition, a port can be associated with QoS settings and queue for traffic control.

Finally, there are controllers and managers. A controller, in OVS terminology, is some external system which talks to OVS via OpenFlow to control the flow of packets through a bridge (and thus is associated with a bridge). A manager, on the other hand, is an external system that uses the OVSDB protocol to read and update the OVSDB. As the OVS switch daemon constantly polls this database for changes, a manager can therefore change the setup, i.e. add or remove bridges, add or remove ports and so on – like a remote version of the ovs-vsctl utility. Therefore, managers are associated with the overall OVS instance.

Installation and first steps with OVS

Before we get into the actual labs in the next post, let us see how OVS can be installed, and let us use OVS to create a simple bridge in order to get used to the command line utilities.

On an Ubuntu distribution, OVS is available as a collection of APT packages. Usually, it should be sufficient to install openvswitch-switch, which will pull in a few additional dependencies. There are similar packages for other Linux distributions.

Once the installation is complete, you should see that two new server processes are running, called (as you might expect from the previous sections) ovsdb-server and ovs-vswitchd. To try out that everything worked, you can now run the ovs-vsctl utility to display the current configuration.

$ ovs-vsctl show
518a2ed6-56d0-433f-98f2-a575daf20f72
    ovs_version: "2.9.2"

The output is still very short, as we have not yet defined any objects. What it shows you is, in fact, an abbreviated version of the one and only entry in the Open_vSwitch table, which shows the unique row identifier (UUID) and the OVS version installed.

Now let us populate the database by creating a bridge, currently without any ports attached to it. Run

sudo ovs-vsctl add-br myBridge

When we now inspect the current state again using ovs-vsctl show, the output should look like this.

518a2ed6-56d0-433f-98f2-a575daf20f72
    Bridge myBridge
        Port myBridge
            Interface myBridge
                type: internal
    ovs_version: "2.9.2"

Note how the output reflects the hierarchical structure of the database. There is one bridge, and attached to this bridge one port (this is the default port which is added to every bridge, similarly to a Linux bridge where creating a bridge also creates a device that has the same name as the bridge). This port has only one interface of type “internal”. If you run ifconfig -a, you will see that OVS has in fact created a Linux networking device myBridge as well. If, however, you run ethtool -i myBridge, you will find that this is not an ordinary bridge, but simply a virtual device managed by the openvswitch driver.

It is interesting to directly inspect the content of the OVSDB. You could either do this by browsing the file /etc/openvswitch/conf.db, or, a bit more conveniently, using the ovsdb-client tool.

sudo ovsdb-client dump Open_vSwitch

This will provide a nicely formatted dump of the current database content. You will see one entry in the Bridge table, representing the new bridge, and corresponding entries in the Port and Interface table.

This closes our post for today. In the next post, we will setup an example (again using Vagrant and Ansible to do all the heavy lifting) in which we connect containers on different virtual machines using OVS bridges and a VXLAN tunnel. In the meantime, you might want to take a look at the following references which I found helpful.

Click to access OpenVSwitch.pdf

https://github.com/openvswitch/ovs
https://tools.ietf.org/html/rfc7047

Virtual networking labs – overlay networks

In the last post, we have looked at virtual networking on the Ethernet level. In modern cloud environments, a second class of virtual networks has gained importance, which uses higher level protocols to tunnel Ethernet frames. These networks are called overlay networks, and we will start to look at them in this post.

VXLAN – the basics

The VLAN technology that we have looked at in the last post is useful, but has some limitations. First, there is the maximum number of possible VLANs (4096). In practice, certain VLAN ranges need to be reserved for internal purposes, further limiting the number of available VLANs. In cloud environments with a large number of tenants, this limit can easily be reached if we try to implement all virtual networks via VLAN. In addition, VLAN tags inserted by the tenants could conflict with the VLAN tags inserted by the host operating systems.

To solve these problems, a new standard called VXLAN was developed a couple of years back, which is described (though not defined, as this is an informational RFC) in RFC 7348. The basic idea of VXLAN is actually quite simple. On each host involved, we create a virtual network device. When an Ethernet frame needs to be transmitted via this device, the host creates a UDP packet, puts the Ethernet frame as payload into this packet and sends it to the target host. The target host receives the packet, strips off the headers, and re-injects the payload (i.e. the original Ethernet frame) into the networking stack of the target system. Thus Ethernet frames travels on top of UDP, and the virtual Ethernet networks logically sits on top of the layer 3 IP network used to exchange the UDP packets, leading to the name overlay network.

To be able to isolate different VXLANs from each other, a 24 bit VXLAN network identifier (VNI) is used. The implementation needs to make sure that Ethernet frames are only delivered within the same VNI, thus isolating the different VXLAN networks from each other. A host that is able to provide VXLAN devices and to participate in the exchange of UDP packets is called a VXLAN tunnel endpoint (VTEP). Thus to send an Ethernet frame over VXLAN, a VTEP needs to

Add a VXLAN header that contains the VNI, so that the receiving VTEP can make sure that the frame is only delivered within the correct VNI
Pass the resulting data as payload to the own IP stack, which will add a UDP, IP and Ethernet header to be able to transmit the frame over an existing layer 2 network

To be able to locate the UDP target address to which we have to send an encapsulated Ethernet frame, each VTEP needs to maintain a table containing mapping between the IP addresses of other VTEPs and the corresponding MAC addresses. A VTEP typically learns how to populate this table and uses IP multicast to ask other VTEPS to resolve unknown MAC addresses, similar to the ARP protocol.

When VXLAN is used, there are a few points that should be kept in mind. First, we do of course add quite a bit of overhead. For every Ethernet frame that is being exchanged, we add a second Ethernet header, an IP header and a UDP header, plus the processing time it takes on the host to travel the networking stack up and down once more. In addition, there is a problem with the MTU (maximum transfer unit) configured for the VXLAN endpoints. As the Ethernet frames on the physical network are longer than the Ethernet frames on the overlay network (as we need the additional headers), we will have to increase the MTU on the physical network to account for this in order to avoid unnecessary fragmentation. Also, using VXLAN implies that your Ethernet frames flow in clear text over the IP connection, so if you want to use VXLAN across unsecure network areas, then you should use some form of encryption like IPSec.

Lab 9: setting up a point-to-point VXLAN connection

To see this in action, let us first implement a very basic scenario. Assume that we have two hosts (virtual machines provided by VirtualBox in our case) that are part of the same layer 3 network. On each host, we ask the Linux kernel to create a virtual device of type VXLAN. To this virtual device, we can assign IP addresses as usual. Any Ethernet frames sent to the device will be encapsulated using the VXLAN protocol and will be sent to the peer, where the Linux kernel will strip off the outer header and re-inject the Ethernet frame. So the Linux kernel acts as a VTEP on both sides.

Again, I have automated the setup using Vagrant and Ansible. To run the example, simply enter the following commands

git clone https://github.com/christianb93/networking-samples
cd networking-samples/lab9
vagrant up

To inspect the setup, let us first SSH into boxA. If you run ifconfig -a, you will in fact see a new device called vxlan0. This device has been created and configured by our Ansible script using the following commands.

ip link add type vxlan id 100 remote 192.168.50.5 dstport 4789 dev enp0s8
ip addr add 192.168.60.4/24 dev vxlan0
ip link set vxlan0 up

The first command creates the device, specifying the VNI 100, the IP address of the peer, the port number to use for the UDP connection (we use the port number defined in RFC 7348) and the physical device to be used for the transmission. The second and third command then assign an IP address and bring the device up.

When you run netstat -a on boxA, you will also find that a UDP socket has been created on port 4789, this socket is ready to accept UDP packets from the peer carrying encapsulated Ethernet frames. The setup on boxB is similar, using of course a different IP address.

Let us now try to exchange traffic and to display the packets that go forth and back. For that purpose, open an SSH session on boxB as well and start a tcdump session listening on vxlan0.

sudo tcpdump -e -i vxlan0

You will a sequence of ARP and IPv4 packets, with the source and target MAC addresses matching the MAC addresses of the vxlan0 devices on the respective hosts. Thus the device acts like an ordinary Ethernet device, as expected.

Now let us change the setup and start to dump traffic on the underlying physical interface.

sudo tcpdump -e -i enp0s8

When you now repeat the ping, you will see that the packets arriving at the physical interface are UDP packets. In fact, tcpdump properly recognizes these frames as VXLAN frames and also prints the inner headers. We see that the outer Ethernet headers contain the MAC addresses of the underlying network interfaces of boxA and boxB, whereas the inner headers contain the MAC addresses of the vxlan0 devices.

Lab 10: VXLAN and IP multicasting

So far, we have used a direct point-to-point connection between the two hosts involved in the VXLAN network. In reality, of course, things are more complicated. Suppose, for instance, that we have three hosts representing VTEP endpoints. If an Ethernet frame on one of the hosts reaches the VXLAN interface, the kernel needs to determine to which of the other hosts the resulting UDP packet should be sent.

Of course, we could simply broadcast the packet to all hosts on the IP network, using a broadcast, but this would be terribly inefficient. Instead, VXLAN uses IP multicast functionality. To this end, the administrator setting up VXLAN needs to associate an IP multicast address with each VNI. A VTEP will then join this group and will use the IP multicast address for all traffic that needs to go to one or more Ethernet destinations. In a local network, you want to use one of the “private” IP multicast groups in the range 239.0.0.0 – 239.255.255.255 reserved by RFC 2365, for instance within the local scope 239.255.0.0/16.

To study this, I have created lab 10 which establishes a scenario in which three hosts serve as VTEP to span a VXLAN with VNI 100. As always, grab the code from GitHub, cd into the directory lab10 and run vagrant up to start the example.

The setup is very similar to the setup for the point-to-point connection above, with the difference that when bringing up the VXLAN device, we have removed the remote parameter and replaced it by the group parameter to tie the VNI to the multicast group.

ip link \
   add type vxlan \
   id 100 \
   group 239.255.0.1 \
   ttl 5 \
   dstport 4789 \
   dev enp0s8

Note the parameter TTL which defines the initial TTL that will be set on the UDP packets sent out by the VTEP. When I tried this setup first, I did not set the TTL, resulting in the default of one. With this setup, however, ARP requests were not answered by the target host, and I had to increase the TTL by adding the additional parameter.

Let us test this setup. Open SSH connections to the boxA and boxB. First, we can use ip maddr show enp0s8 to verify that on both machines, the interface enp0s8 has joined the multicast group 239.255.0.1 that we specified when bringing up the VXLAN. Then, start a tcpdump session on enp0s8 on boxC and ping the VXLAN IP address 192.168.60.5 of boxB from boxA. As this is the first time we establish this connection, the Ethernet device should emit an ARP request. This ARP request is encapsulated and sent out as an IP multicast with IP target address 239.255.0.1. In tcpdump, the corresponding output (again displaying the outer and inner headers) looks as follows.

06:30:44.255166 08:00:27:fe:3b:d0 (oui Unknown) > 01:00:5e:7f:00:01 (oui Unknown), ethertype IPv4 (0x0800), length 92: 192.168.50.4.57732 > 239.255.0.1.4789: VXLAN, flags [I] (0x08), vni 100
b6:02:f0:c8:15:85 (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 192.168.60.5 tell 192.168.60.4, length 28

We can clearly see that the outer IP header has the multicast IP address as target address, and that the inner frame is an ARP request, looking to resolve the IP address of the VXLAN device on boxB.

The multicast mechanism is used to initially discover the mapping of IP addresses to Ethernet addresses. However, this is typically only required once, because the VTEP is able to learn this mapping by storing it in a forwarding database (FDB). To see this mapping, switch to boxA and run

bridge fdb show dev vxlan0

In the output, you should be able to locate the Ethernet address of the VXLAN device on boxC, being mapped to the IP address of boxC on the underlying network, i.e. 192.168.50.6.

Virtual networking labs – virtual Ethernet networks with VLAN tags

In the previous posts, we have mainly been looking at virtual networking within one single physical hosts. This is nice, but to build cloud environments, we need to establish virtual networks across several physical hosts. In this post, we will start to look into technologies that make this possible and learn how VLAN tagging supports virtual Ethernet networks.

An introduction to virtual Ethernet networks

Today, essentially every Ethernet network you will come across is a switched network, where every server is more or less directly connected to a switch, and the switches are connected to each other to propagate traffic through your data center. A naive approach would be to use layer 2 switches to combine all Ethernet networks into one large broadcast domain, where every node is connected to every other node by a sequence of switches. This approach, however, creates a very large broadcast domain and is difficult to maintain as changes to the topology need to be done by a physical rearrangement. It might therefore be beneficial to have some way of dividing your physical Ethernet network into two or more logical (“virtual”) networks.

For servers that are connected to the same switch, this can be implemented by an approach known as port-based VLAN. To illustrate the idea, let us look at the following configuration, where four servers are connected to four different ports of one switch.

With this setup, a broadcast issued by one server will reach every other server, and all servers are part of one Ethernet network. To introduce virtualization, we could simply add some logic to the switch to divide the ports into two sets, where forwarding of Ethernet frames is only done within those two sets. If, for instance, we define one set to consist of the two ports connected to server 1 and server 2 (green), and the other consisting of the remaining two ports (red), and configure the switch such that it will only forward frames between ports with the same color, we will effectively have established two virtual networks.

This is nice, as – if your switch supports it – no additional hardware is required and you can define and change the configuration entirely in software. But there is a problem. Typically, your data center will have more than one switch. How can you extend these virtual networks across multiple switches? Of course, you could add an additional connection for every virtual network between any two switches, but this will blow up your hardware requirements and again make changes in hardware necessary. To avoid this, a technology called VLAN trunking is needed.

With VLAN trunking, different virtual LANs (VLANs) can share the same physical connection. To enable this, Ethernet frames that travel on this shared part of your infrastructure are enhanced by adding a VLAN tag which contains a numerical ID identifying the VLAN to which they belong, as indicated in the following diagram.

Here, we have two switches, which both use port-based virtual networks as just discussed. The upper two ports of each switch belong to the green network which is assigned the ID 1 (VLAN ID or VID, note that in reality, this ID is often reserved) and the other set of ports is part of VLAN 2 (the red network). When a frame leaves, for instance, the server in the upper left corner and needs to be forwarded to the server in the upper right corner, the switch will add a VLAN tag to indicate that this frame is part of VLAN 1. Then the frame travels across the connection between the two switches. Then the switch on the right hand side receives the frame, it strips off the VLAN frame again and, based on the tag, injects the frame back into its own VLAN 1, so that it can only reach the green ports on the right hand side.

Thus your network is divided into two parts. In the middle, on the connection between the two switches, frames carry the VLAN tag to flag them as being part of the red or green network. Thus the ports facing this part need to be aware of the VLAN tag – these ports are often called trunk ports. The parts of the network behind the switches, however, do never see a VLAN tag, as it is added and removed by the switches when transmitting and receiving on trunk ports. These ports are called access ports. Thus the servers do not need to known to which VLAN they belong, and the configuration can be done entirely on the switches and in software.

The standard that describes all this and also defines how a VLAN tag is added to an Ethernet frame is called IEEE 802.1Q. This standard adds a 16-bit field called TCI – tag control information to the layout of an Ethernet frame. Four bits of this field are reserved for other purposes, so that 12 bits remain for the VLAN ID, allowing a maximum of 4096 different VLANs.

Lab 8: VLAN networking with Linux

Linux has the capability to create virtual Ethernet devices that are associated with a VLAN network. To see this in action, get lab 8 from my GitHub repository and run it.

git clone https://github.com/christianb93/networking-samples
cd networking-samples/lab8
vagrant up

The Vagrantfile and the three Ansible playbooks that are located in this directory will now execute and bring up three virtual machines. Here is a diagram summarizing the network configuration that the scripts create (we will see how this is done manually further below).

We see that all three machines are connected to one virtual Ethernet cable (we use a VirtualBox internal network for that purpose). The three interfaces attached to this network are configured as part of the IP network 192.168.50.0/24.

However, in addition, we have set up two virtual networks – one network with VLAN ID 100 (green), and a second network with VLAN ID 200 (red). In each Linux machine, the virtual networks to which the machine is attached is represented by a virtual device called a VLAN device.

Let us look at boxA to see how this works. On boxA, the Ansible playbook that got executed during the vagrant up did run the following command

vconfig add enp0s8 100

This command is creating a new network interface enp0s8.100 sitting on top of enp0s8 but being associated with the VID 100. This device is an ordinary device from the point of view of the operating system, i.e. you can assign IP addresses, add routes and so forth.

Such a VLAN device operates as follows. When an Ethernet frame arrives on the underlying device, enp0s8 in our case, the kernel checks whether the frame contains a VLAN tag. If no, the processing is as usual. If yes, then the kernel next checks whether a VLAN device is associated with this VID. If there is one, it strips off the VLAN tag, changes the frame so that it appears to be coming from the virtual VLAN device and re-injects the frame into the networking stack. The frame then travels up the stack and can be processed by the higher layers, e.g. the IP layer. Conversely, if a frame needs to be transmitted on enp0s8.100, the kernel adds a VLAN tag with the VID 100 to the frame and redirects it to the physical device enp0s8.

Let us see this in action. Open two SSH connections, one to boxA, and one to boxB – if you use the Gnome terminal, simply run

for i in "A" "B" ; do gnome-terminal -e "vagrant ssh box$i"; done

In boxA, start a tcpdump session on the VLAN device.

sudo tcpdump -e -i enp0s8.100

On boxB, ping boxA, using the IP address 192.168.60.4 (the IP address of the VLAN device). You will see an ordinary frame coming in, with ethertype IPv4. There is no VLAN tag within this frame, and the VLAN device operates like a physical device with no VLAN tagging.

Now, stop the tcpdump session and start it again, but this time, use enp0s8 instead of enp0s8.100, i.e. the underlying physical device. If you now run a ping again, you will see that the ethertype of the incoming packages has changed and is now 802.1Q, indicating that the frame is tagged (tcpdump will also show you the VLAN ID 100).

When you ping boxA from boxB using the IP address 192.168.50.4, the traffic will be as expected, coming in on enp0s8 without any VLAN tag, and will not reach enp0s8.100. Thus even though you have put a VLAN device on top of the physical interface, you can still use the physical interface as usual.

It is instructive to check the ARP cache on boxB using arp -n after the pings have been exchanged. You will see that the MAC address of the enp0s8 device on boxA now appears twice, once with the IP address 192.168.50.4 and once with 192.168.60.4. So the MAC address is shared between the virtual VLAN device and the physical device.

Still, the traffic is separated by the Linux kernel. If, for instance, you try to ping 192.168.70.6 (one of the IP addresses of boxC) from boxA, you will not be successful, because this IP address is on the red network and not reachable from the green network. If you run the ping on boxB, however, it will work, because boxB participates in both virtual networks.

This closes todays lab. In the next lab, we will start to look at a completely different approach to building virtual networks – overlay networks.

Virtual networking labs – more on bridges

In the previous post, we have seen how a software-defined Linux bridge can be established and how it transparently connects two Ethernet devices. In this post, we will take a closer look at how to set up and monitor bridges and learn how VirtualBox uses bridges for virtual networking.

Lab 6: setting up and monitoring bridges

For this lab, we will start with the setup of lab 5 that we have gone through in the previous post. If you have destroyed your environments again, the easiest way to get back to the point where we left off is to let Vagrant and Ansible do the work. I have created a Vagrantfile and a set of playbooks to take care of this. So simply do

git clone https://github.com/christianb93/networking-samples
cd lab6
vagrant up

to bring up all machines and configure the network interfaces as in my last post. You can then use vagrant ssh to SSH into one of the three virtual machines.

First, let us go through the steps that we have used to set up boxB, the machine on which the bridge is running. Recall that, after installing the bridge-utils package, we used the following sequence of commands.

sudo brctl addbr myBridge
sudo ifconfig enp0s8 promisc 0.0.0.0
sudo ifconfig enp0s9 promisc 0.0.0.0
sudo brctl addif myBridge enp0s8
sudo brctl addif myBridge enp0s9
sudo ifconfig myBridge up

The first command is easy to understand. It uses the brctl command line utility to actually set up a bridge called myBridge.

Next, we re-configure the two devices that we will turn into bridge ports. As explained in chapter 10 of “Understanding Linux network internals”, if an Ethernet frame is received on an interface which has been added to a bridge, the usual processing of the frame (i.e. passing the frame to all registered layer 3 protocol handlers) is skipped, and the frame is handed over to the bridging code. Therefore, it does not make sense to have an IP address associated with our bridge ports enp0s8 and enp0s9 any more. In addition, we need to set the devices into promiscuous mode, i.e. we need to enable them to receive packets which are not directed towards their own Ethernet address. This becomes clear if you look at our network diagram once more.

If an Ethernet frame is sent out by boxC, directed towards the interface of boxA, it will have the MAC address of this interface as target address in its Ethernet header. Still, it needs to be picked up by the enp0s9 device on boxB so that it can be handed over to the bridge. If we would not put the device into promiscuous mode, it would drop the frame as its target MAC address does not match its own MAC address (strictly speaking, setting the device into promiscuous mode manually is not really needed, as the Linux kernel will do this automatically when we add the port to the bridge, but we do this here explicitly to highlight this point).

Once we have re-configured our two network devices, we add them to the bridge using brctl addif. We finally bring up the bridge using ifconfig.

Let us now look a bit into the details of our bridge. First, recall that a bridge usually operates by learning MAC addresses. For a Linux bridge, this holds as well, and in fact, a Linux bridge maintains a table of known MAC addresses and the ports behind which they are located. To display this table, open an SSH connection to boxB and run

sudo brctl showmacs myBridge

If you look at the output, you will see that the bridge differentiates between local and non-local addresses. A local address is the MAC address of an interface which is attached to the bridge. In our case, these are the two interfaces enp0s9 and enp0s8 that are part of your bridge on boxB. A non-local address is the address of an Ethernet device on the local network which is not directly attached to the bridge. In our example, these are the Ethernet devices enp0s8 on boxA and boxC.

You also see that these entries are ageing, i.e. if no frames related to an interface that the bridge knows are seen for some time, the entry is dropped and recreated if the interface appears again. The reason for this behaviour is to avoid problems if you reconfigure your physical network so that maybe an Ethernet device thas has been part of the network behind port 1 moves into a part of the network which is behind port 2.

You can also monitor the traffic that flows through the bridge. If, for instance, you run a sniffer like tcpdump on box B using

sudo tcpdump -e -i myBridge

and then create some traffic using for instance ping, you will see that the packets cross the Ethernet bridge.

It is also instructive to run a traceroute on boxA targeted towards boxC. If you do this, you will find that there is no hop between the two devices, again confirming that our bridge operates on layer 2 and behaves like a direct connection between boxA and boxC.

Finally, let us quickly discuss the configuration of the bridge itself. If you look at the configuration using ifconfig myBridge, you will see that the bridge has a MAC address itself, which is the lowest MAC address of all devices added to the bridge (but can also be set manually). In fact, we will see in a second that it is also possible to assign an IP address to a bridge!

This is a bit confusing, after all, a bridge is logically simply a direct connection between the two ports, but nothing which can by itself emit and absorb Ethernet frames. However, on Linux, setting up a bridge also creates a “default-port” on the bridge which is handled like any other network device. Technically speaking, the bridge driver is itself a network device driver (implemented here), and you can ask it to transmit frames. I tend to think of the situation as in the following image.

When the Linux kernel asks the bridge to transmit a frame, the bridge code will consult its table of known MAC addresses and send the frame to the correct port. Conversely, if a frame is received by any of the two ports enp0s8 or enp0s9 and forwarded to the bridge, the bridge does not only forward the frame to the correct port depending on the destination address, but also delivers the frame to the higher layers of the Linux networking stack if its Ethernet target address matches the MAC address of the bridge (or any of the local MAC address in the table of known MAC addresses).

Let us try this out. In our configuration so far, we have not been able to reach boxB via the bridged network, and, conversely, we could not reach boxA and boxC from boxB (try a ping to verify this). Let us now assign an IP address to the bridge device itself and add a route. On boxB, run

sudo ifconfig myBridge netmask 255.255.0.0 192.168.70.4

which will automatically add a route as well. Now, our network diagram has changed as follows (note the additional IP address on boxB).

You should now be able to ping boxB (192.168.70.4) from both boxA and boxB and vice versa. This capability allows one to use one Linux host as both an Ethernet bridge and a router at the same time.

Lab 7: bridged networking with VirtualBox

So far, we have used VirtualBox to create virtual machines, and have played with bridges inside these machines. Now we will turn this around and see how conversely, VirtualBox can use bridges to realize virtual networks.

It is tempting to assume that what is called bridged networking in the VirtualBox documentation actually uses bridges. This, however, is no longer the case. Instead, when you define a bridged network with VirtualBox, the vboxnetflt netfilter driver that also featured in our last post will be used to attach a “virtual Ethernet cable” to an existing device, and the device will be set into promiscuous mode so that it can pick up Ethernet frames targeted towards the virtual ethernet card of the VM and redirect them to the VirtualBox networking engine. Effectively, this exposes the virtual device of the VM to the local network. This is the reason that this mode of operations is called public networking in Vagrant.

Let us try this out. Again, you can start the test setup using Vagrant. This time, the Vagrantfile contains several machines which we bring up one by one.

git clone https://github.com/christianb93/network-samples
cd lab7
vagrant up boxA

When you start this script, it will first scan your existing network interfaces on the host and ask you to which it should connect. Choose the device which connects your machine to the LAN, for me this is eno1 which has the IP address 192.168.178.25 assigned to it.

To run these tests, you need a second machine connected to the same LAN to which your host is connected via the device that we have just used (eno1). In my case, this second machine has the IP address 192.168.178.28. According to the diagram above, this machine should now be able to see our VM the local network. In fact, all we have to do is to establish the required route. First, on your second machine, run

sudo route add -net 192.168.0.0 netmask 255.255.0.0 eth0

where eth0 needs to be replaced by the device which this machine uses to connect to the LAN. Now SSH into the virtual machine boxA and set up the corresponding route there.

sudo route add -net 192.168.0.0 netmask 255.255.0.0 enp0s8

In boxA, you should now be able to ping 192.168.178.28, and conversely, in your second machine, you should be able to ping 192.168.50.4. The setup is logically equivalent to the following diagram.

Of course this setup is broken as we work with two different subnets / netmasks on the same Ethernet network, but hopefully serves well to illustrate bridged networking with VirtualBox.

Now we stop this machine again, create a bridge on the host and bring up the second and third machine that are used in this lab.

vagrant destroy boxA --force
sudo brctl addbr myBridge
vagrant up boxB
vagrant up boxC

Here, both machines have a network device using the bridged networking mode. The difference to the previous setup, however, is now that the virtual machines are not attached to an existing physical device, but to a bridge, and both are attached to the same bridge.

This configuration is very flexible and leaves many options. We could, for instance, use an existing bridge created by some other virtualization engine or even Docker to interact with other virtual networks. We could also, as in the previous post, set up forwarding and NAT rules and assign an IP address to the bridge device to use the bridge as a gateway into the LAN. And we can attach additional interfaces like veth and tun/tap devices to the bridge. I invite you to play with this to try out some of these options.

We have now seen some of the typical networking technologies in virtual networks in action. However, there are additional approaches that we have not touched upon net – network separation using VLAN tags and overlay networks. In the next post, we will study to look at VLANs in order to establish virtual networks on layer 2.

Virtual networking labs – VirtualBox internal networks and bridges

So far, we have been playing with virtual networking for one virtual machine, connected to the host. Now let us see how we can establish virtual networks connecting more than one machine.

Lab3: Virtualbox host-only networking with more than one machine

In this lab, we will connect two virtual machines that both use host-only networking. To run the example, you can again clone my repository and use the prepared Vagrantfile.

git clone https://github.com/christianb93/networking-samples
cd lab3
vagrant up

This will bring up two virtual machines, boxA and boxB. When both of them are running, use vagrant ssh boxA and vagrant ssh boxB to connect to them.

When we inspect the network on the host, we see nothing which is really unexpected. Again, there is the virtual device vboxnet0 which has an IP address assigned to it, and there is a new entry in the routing table which sends all traffic for the network 192.168.50.0 to this device.

In each virtual machine, the situation is as in the last post. There is a virtual network interface enp0s3 which is connected to the NAT device, and there is a virtual interface enp0s8 which is connected to vboxnet0 via the mechanisms discussed in the previous post. However, the trick is that both machines are actually connected to the same virtual device, as in the following diagram.

So we should expect that the machines can talk to each other via this device, and in fact they can. You should be able to ping boxB as 192.168.50.5 from boxA and similary boxA as 192.168.50.4 from boxB.

When you run ifconfig -a to get the MAC addresses of the enp0s8 interfaces on both machines and also run arp -n to display the ARP cache, you will see that the MAC address of boxA is known on boxB and vice versa. This demonstrates that the machines can see each other on the Ethernet level, i.e. on layer 2, not only layer 3, as if they were connected to the same Ethernet segment.

Again, the virtual device has a MAC and an IP address and can be reached from the host. Via the route for the network 192.168.50.0 pointing to it, we can also reach both virtual machines from the host as in the case of an individual machine as before. So we could summarize the host-only network as a virtual network to which the machines are attached and which is also connected to the host networking stack.

Lab4: VirtualBox internal networking

This is very useful for many purposes, but sometimes, you want a virtual network that is completely separated from the host network.

This networking option does not require the virtual device vboxnet0, and to verify this, let us first remove it. To do this, open the VirtualBox GUI by running virtualbox, navigate to “Global Tools -> Host Network Manager”, locate vboxnet0 in the list and remove it.

Now let us bring up the virtual machines using Vagrant. If you have not yet done so, run vagrant destroy to complete lab3. Then switch to lab4, start Vagrant there and open two additional terminals with SSH sessions on the machines.

cd ../lab4
vagrant up
gnome-terminal -e 'vagrant ssh boxA' ;   gnome-terminal -e 'vagrant ssh boxB'

When you inspect the virtual machines, the situation is very similar to what we have seen in lab3, when we connected two machines with a host-only network.

Each machine has two interfaces, enp0s3 (the NAT interface) and enp0s8 (the internal networking interface)
Each machine has a route for the network 192.168.50.0 pointing to enp0s8
The machines can see each other as 192.168.50.4 and 192.168.50.4
If you ping the machines and then inspect the ARP cache, you will again find that the MAC address of the respective other machine is stored in the cache, indicating that the machines appear to be on the same Ethernet network

There is, however, a difference on the host. There is no additional virtual networking device being created, and there is no additional routing table entry on the host (nor any local routing table entry). Thus, the new network to which the machines are attached is actually completely isolated from the host network.

We have now considered host-only networking, NAT networking and internal networking in some detail. However, VirtualBox offers a couple of additional networking models. A model which is used similarly by other hypervisors like KVM is bridged networking. To get a feeling for this, we will first study Linux bridging in some detail before starting to see how VirtualBox applies this.

Lab 5: Linux bridging basics

In this lab, we will use a Linux bridge to connect two Ethernet networks and gain a basic understand of bridges.

A Linux bridge is essentially the virtual equivalent of a classical, physical Ethernet bridge. Recall that a bridge connects Ethernet networks on the link layer level. A bridge device has several ports, and is able to direct Ethernet frames entering in one port to the correct outgoing port to forward the packet into the part of the network where the target address is located. Most bridges are able to learn which MAC addresses are behind which port in order to operate efficiently.

Linux bridges are similar. They are virtual network devices to which you can attach other devices. They will then pick up traffic flowing into the bridge from one of these devices, evaluate the Ethernet address of the target and forward the packet to the respective target device (assuming that this is attached as well).

Let us see this in action. For this lab, I have created a configuration which has three virtual machines. Two of them are connected to a private network myNetworkA, two of them are connected to private network myNetworkB, and they all have a NAT device for SSH access.

Now, in this configuration, there is no way how boxC can reach boxA, because the networks myNetworkA and myNetworkB are completely isolated. Let us now set up a bridge to change this. Before we do this, however, we need to change a setting within VirtualBox. VirtualBox allows us to specify per network interface whether switching this device into the promiscuous mode should be allowed. For a bridge, we need this, because the Ethernet devices attached to the bridge should receive packets which are directed towards any other port on the bridge. If the VirtualBox setting is not changed, putting the devices into the promiscuous on the OS level will silently fail, and the bridge will not work (I had a bit of a hard time figuring this out, until I found this post in the VirtualBox forum). To change this setting, run the following commands on the host machine.

vm=$(vboxmanage list vms | grep "boxB" | awk '{print $1}' | sed s/\"//g)
vboxmanage controlvm $vm nicpromisc2 allow-all
vboxmanage controlvm $vm nicpromisc3 allow-all

Now we set up the actual bridge on box B. Switch into boxB and enter the following commands

sudo apt-get update
sudo apt-get install bridge-utils
sudo brctl addbr myBridge
sudo ifconfig enp0s8 promisc 0.0.0.0
sudo ifconfig enp0s9 promisc 0.0.0.0
sudo brctl addif myBridge enp0s8
sudo brctl addif myBridge enp0s9
sudo ifconfig myBridge up
# check that interfaces are in promiscuous mode
ifconfig -a

On boxA, run

sudo ifconfig enp0s8 netmask 255.255.0.0 192.168.50.4

And finally, enter the following commands on boxC:

sudo ifconfig enp0s8 netmask 255.255.0.0 192.168.60.4
ping 192.168.50.4

Let us see the bridge in action by dumping the traffic on the bridge device on boxB. To do this, switch to boxB and enter

sudo tcpdump -e -vvv -i myBridge

Then, in either boxA or boxC, try to ping the other machine. You should see the ICMP packages moving forth and back along the bridge. When you run arp -n on boxA and boxC, you will also see that each host knows the other host on the Ethernet level, i.e. the bridge did actually implement a connection on layer 2 (as opposed to an IP-based router which operates on layer 3). Thus with the bridge in place, the network now looks as follows.

To summarize, a virtual Linux bridge does exactly what a traditional switch in hardware does – it connects two Ethernet networks transparently on the Ethernet layer. But there is more to it, and in the next post, we will dig a bit deeper into how this works and how it can be applied in the context of virtualization.

Virtual networking labs – NAT and host-only networking with VirtualBox

When you work with virtualized environments, you will sooner or later realize that a large part of the complexity of such environments originates in the networking part. Networking itself is a non-trivial endeavor, and in the context of cloud and virtualization technology, you often stack different virtualization layers on top of each other. To provide the basics to understand all this, this series aims at introducing some of the more commonly used techniques using hands-on exercises.

Setup

To follow this series, I highly recommend to run the examples yourself. For that purpose, you will need Vagrant and VirtualBox installed on your machine, which we use for most of the examples. We will also use Docker at times, so this should be installed as well.

The setup of most examples is automated, using tools like Vagrant or Ansible that you will know when you have followed some earlier posts on this blog. The labs are stored in a Github reposítory that you should clone by running

git clone https://github.com/christianb93/networking-samples

on your machine.

Lab 1: NAT networking with VirtualBox

When you take a look at networking options of Linux based virtual machines like KVM, Xen or VirtualBox, you will find that certain networking modes tend to be common to all these virtualization solutions. First, there is typically a networking mode based on network address translation (NAT) to allow access to the internet from within the virtual machine. Then, there are networking modes which allow you to connect one or more virtual machine using software-emulated ethernet bridges. This can be combined with VLANs, the usage of routing tables or iptables firewall rules to realize advanced networking topologies. And finally, all these methods can be combined in a variety of different setups. Networking for VirtualBox is comparatively easy to understand, but still displays some of these ideas nicely. This is why I have chosen VirtualBox as an example hypervisor. The first networking mode that we will look at is called NAT networking and is actually the VirtualBox default.

To see this in action, switch to the lab1 directory and run Vagrant to bring up the example machine, then use Vagrant to SSH into the machine.

cd lab1
vagrant up
vagrant ssh

When you run this for the first time, Vagrant might have to download the used Ubuntu disk image, which might take a few minutes. Once you are logged into the machine, run ifconfig -a to get a list of all network devices.

You will find that there are two networking devices. First, there is of course the standard loopback device lo which is present on every Linux system. Then, there is an interface enp0s3 which looks like an ordinary Ethernet device (but is of course a virtual device). This device has a MAC address and an Ethernet address assigned to it, usually 10.0.2.15.

When you run route -n to list the content of the kernel routing tables, you will find that this is the default interface for outgoing traffic, with gateway IP address being 10.0.2.2. We can try this out – run

ping leftasexercise.com

to verify that you can actually reach servers on the Internet via this device.

How does this work? When an application within the virtual machine sends a TCP/IP packet to the virtual device, VirtualBox picks up the packet and performs a network address translation on it. It then forwards the resulting packet to the network on the host system. When the answer comes back, the reverse process is applied and to the application, it looks like the reply came from a real network device. In this way, we can reach any host which is also reachable from the host – including the host itself and any other virtual networks reachable from the host.

Let us try this out. On the host, start an NGINX container and determine its IP address.

docker run -d --rm --name=nginx  nginx:latest
docker inspect nginx | jq -r ".[0].NetworkSettings.IPAddress"

Let us suppose that the result is 172.17.0.2. Now switch back into the virtual machine and run

curl 172.17.0.2

and you should see the NGINX welcome page. To see the NAT’ing in action run

sudo netstat -t  -c -p

on the host and then run

telnet 172.17.0.2 80

inside the virtual machine to establish a long running connection to the NGINX server. When you stop the output of netstat and browse through it, you should find a connection established by the VBoxHeadless process that connects to port 80 on 172.17.0.2. What happens is that when we run the telnet command inside the virtual machine, VirtualBox will open a socket on the host machine and use that to connect to the target, similar to a NAT’ing device which proxies outgoing connections. So if you wanted to represent the setup diagramatically, the result would be something like this.

By the way, if you are asking yourself how the configuration of the network within the virtual machine has worked, take a look at the file /etc/netplan/50-cloud-init.yaml inside the virtual machine – here we see that the configuration is done by cloud-init and that the IP address is obtained using a DHCP server, which again is emulated by VirtualBox.

But wait, there is still a problem. If we are conceptually behind a gateway, this implies that the virtual machine cannot be reached from the host network. But how can we then SSH into it? The answer is that VirtualBox (respectively Vagrant) has created a port mapping for us, similar like you would configure an incoming forwarding rule in a classical gateway. Let us try to print out this route using the VirtualBox machine manager. First, we retrieve the name of the machine that Vagrant has created for us, place it in an environment variable and then invoke the VMM again to list some details which we search for forwarding rules.

vm=$(vboxmanage list vms | grep "boxA" | awk '{print $1}' | sed s/\"//g)
vboxmanage showvminfo --machinereadable $vm  | grep "Forwarding"

In fact, we see that there is a forwarding rule that directs incoming traffic on port 2222 from the host to port 22 (SSH) in the virtual machine where the SSH daemon is listening. This makes it possible to reach the machine via SSH.

Lab2: host-only networking

Next, we try a slightly different combination. We will bring up a virtual machine with two network devices, one using NAT as before, and one using host-only networking, or, in Vagrant terminology, private networking.

To run this example, first shut down your existing lab, then switch over to lab2 and restart Vagrant from there.

vagrant destroy
cd ../lab2/
vagrant up

The first thing that you will realize by running ifconfig -a on the host is that VirtualBox has actually created a new networking device vboxnet0 with IP address 192.168.50.1 on the host. When you run ethtool -i on this device, you will see that this device is managed by a custom driver which comes with VirtualBox (see source code here). On the host, VirtualBox has also added a new route, sending all traffic for the network destination 192.168.50.0 to this device.

When you log into the machine and run ifconfig -a, you will see that inside the machine, a new interface enp0s8 with IP address 192.168.50.4 is visible. This is the newly created host-only virtual networking device. Internally, VirtualBox captures traffic sent to this device and re-routes it to the vboxnet0 device and vice versa. Graphically, this looks as follows.

Let us briefly discuss how packets travel across this interface. First, inside the virtual machine, a new route has been added, sending traffic for the network
192.168.50.0 to this device. To test this route, let us first get rid of the NAT interface to have a clearer picture. To do this, we again use the VirtualBox machine manager.

vm=$(vboxmanage list vms | grep "boxA" | awk '{print $1}' | sed s/\"//g)
vboxmanage controlvm $vm setlinkstate1 off

If you have used vagrant ssh to SSH into the machine, this will of course kill your connection, as the connection uses the port forward rule associated to the NAT device. But we can easily get it back and, in doing so, also verify our first route, by using the IP address 192.168.50.4 to SSH into the machine from our host. This should work, as, on the host, we have a route to this destination via vboxnet0. However, we first need the location of the private SSH key file that Vagrant has created as part of the provisioning process using vagrant ssh-config, which will show you that the private key file is stored at .vagrant/machines/boxA/virtualbox/private_key. So we can run

ssh -o StrictHostKeyChecking=no -i .vagrant/machines/boxA/virtualbox/private_key vagrant@192.168.50.4

and should be back in our machine. Thus we can actually reach the machine from the host using vboxnet0. To verify that the reverse process also works, let us again bring up our Docker container for NGINX, but this time, we use port forwarding to bind it to a port on the host.

docker run -d --rm --name=nginx  -p 80:80 nginx:latest

This will of course only work if you do not already have a webserver running on port 80, but if there is one, what comes next should also work. If you now switch back to the virtual machine and run

curl 192.168.50.1

you will again see the NGINX default page.

It is also instructive to look at the ARP caches on both machines. First, on the host, when running arp -n, we see an entry for the MAC address of the enp0s8 interface registered with the outgoing interface vboxnet0. So on layer 2, the traffic seems to flow transparently between enp0s8 on the virtual machine and the vboxnet0 device on the host. When you run arp on the virtual machine, the picture is reversed, and we see an entry showing us that the MAC address of vboxnet0 is reachable via enp0s8.

How does all this work? First, let us see what happens when we try to reach 192.168.50.4 from the host, and let us start our investigation by looking at the source code of the VirtualBox network driver.

As every network driver, the Virtualbox network driver has a function hard_start_xmit which is responsible for the actual transmission of a frame. When you look at the source code of this driver, you will see that this function does nothing except updating the statistics. Logically, this means that the device points “into nowhere”. But how can the packet then reach the virtual machine?

This is where for me, things start to become a bit blurry, but I believe that the answer is hidden in the concept of a local route (ip_fib_local_table in the kernel). The local routing table is maintained by the Linux kernel, and when a network device comes up, an entry is added to it automatically. To inspect the table in our case, enter

ip route show table local

on the host. This should yield, among others, an entry for the destination 192.168.50.1 of type local. The presence of this entry means that when delivering IP packets to this destination, the hard_start_xmit function of the device is never actually invoked, but (see for instance chapter 35 of “Understanding Linux network internals” by C. Benvenuti) will be injected back into the kernel’s IP stack, as if the packet came in via vboxnet0. Thus, effectively, the device acts as a loopback device.

When the packet is picked up again on the IP layer, one of the first things that happens is that the netfilter mechanism is invoked. VirtualBox comes with an additional kernel module VBoxNetFlt that attaches itself to the virtual device vboxnet0 (look at the output of dmesg) and seems to divert traffic to and from the virtual network device so that they are processed by VirtualBox. Understanding the details of this mechanism is beyond my own expertise, but conceptually, this seems to be what is happening.

Combining host-only networking with LAN access

Before we close this post, let us try one more thing. We have seen that the virtual device vboxnet0 allows us to connect to the host network. As a Linux host can serve as a router, it should therefore also be possible to connect to the outside world. So let us pick some server on your LAN, for instance the router that you use to connect to the Internet. In my home network, the router is at 192.168.178.1, reachable from the host via the device eno1. The first thing that we have to do is to add a new default route inside the VM, as we have disconnected the NAT device to which the old default route was pointing. So in the VM, enter

sudo route add default gw 192.168.50.1 enp0s8

Next, we have to prepare the host to enable forwarding. First, we enable forwarding globally in the kernel. Then, we set up a set of forwarding rules. As my router is connected to the device eno1, we first allow all new connections from the virtual device to this device, using the conntrack matching extension.

sudo sh -c "echo 1 > /proc/sys/net/ipv4/ip_forward"
sudo iptables -A FORWARD -o eno1 -i vboxnet0 -s 192.168.50.0/24 -m conntrack --ctstate NEW -j ACCEPT

Next, we need to make sure that the reply is allowed back into the system, so we set up a rule that will enable forwarding for all established connections.

sudo iptables -A FORWARD -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

We also need to enable IP masquerading so that the reply is directed towards the host. The following two commands will first flush the POSTROUTING table (which might not be needed, you might want to try without this command first as it might interfere with existing rules) and then add a rule that will enable masquerading (i.e. replacing of the IP source address by the address of the outgoing interface) for all traffic going out via eno1.

sudo iptables -t nat -F POSTROUTING
sudo iptables -t nat -A POSTROUTING -o eno1 -j MASQUERADE

This is already sufficient to reach hosts in the LAN and globally using IP addresses. However, DNS lookups will be broken in the virtual machine. To fix this, edit the file /etc/systemd/resolv.conf in the virtual machine and change the line

#DNS

into

DNS=192.168.178.1

or whatever your preferred IP address is. Then pick up the configuration by running

sudo systemctl restart systemd-resolved

and DNS resolution should work again.

In this post, we have covered the basics of host-only networking and played a bit with only one virtual machine involved. However, with host-only networking, we can do more – we can also connect more than one virtual machine to the same virtual network. We will look into this in detail in the next post.

Kubernetes on your PC: playing with minikube

In my previous posts on Kubernetes, I have used public cloud providers like AWS or DigitalOcean to spin up test clusters. This is nice and quite flexible – you can create clusters with an arbitrary numbers of nodes, can attach volumes, create load balancers and define networks. However, cloud providers will of course charge for that, and your freedom to adapt the configuration and play with the management nodes is limited. It would be nice to have a playground, maybe even on your own machine, which gives you a small environment to play with. This is exactly what the minikube project is about.

Basics and installation

Minikube is a set of tools that allows you to easily create a one-node Kubernetes cluster inside a virtual machine running on your PC. Thus there is only one node, which serves at the same time as a management node and a worker node. Minikube supports several virtualization toolsets, but the default (both on Linux and an Windows) is Virtualbox. So as a first step, let us install this.

$ sudo apt-get install virtualbox

Next, we can install minikube. We will use release 1.0 which has been published end of march. Minikube is one single, statically linked binary. I keep third-party binaries in a directory ~/Local/bin, so I applied the following commands to download and install minikube.

$ curl -Lo minikube https://storage.googleapis.com/minikube/releases/v1.0.0/minikube-linux-amd64 
$ chmod 700 minikube
$ mv minikube ~/Local/bin

Running minikube

Running minikube is easy – just execute

$ minikube start

When you do this for the first time after installation, Minikube needs to download a couple of images. These images are cached in ~/minikube/cache and require a bit more than 2 Gb of disk space, so this will take some time.

Once the download is complete, minikube will bring up a virtual machine, install Kubernetes in it and adapt your kubectl configuration to point to this newly created cluster.

By default, minikube will create a virtual machine with two virtual CPUs (i.e. two hyperthreads) and 2 GB of RAM. This is the minimum for a reasonable setup. If you have a machine with sufficient memory, you can allocate more. To create a machine with 4 GB RAM and four CPUs, use

$ minikube start --memory 4096 --cpus 4

Let us see what this command does. If you print your kubectl config file using kubectl config view, you will see that minikube has added a new context to your configuration and set this context as the default context, while preserving any previous configuration that you had. Next, let us inspect our nodes.

$ kubectl get nodes
NAME       STATUS   ROLES    AGE     VERSION
minikube   Ready    master   3m24s   v1.14.0

We see that there is one node, as expected. This node is a virtual machine – if you run virtualbox, you will be able to see that machine and its configuration.

When you run minikube stop, the virtual machine will be shut down, but will survive. When you restart minikube, this machine will again be used.

There are several ways to actually log into this machine. First, minikube has a command that will do that – minikube ssh. This will log you in as user docker, and you can do a sudo -s to become root.

Alternatively, you can stop minikube, then start the machine manually from the virtualbox management console, log into it (user “docker”, password “tcuser” – it took me some time to figure this out, if you want to verify this look at this file, read the minikube Makefile to confirm that the build uses buildroot and take a look at the description in this file) and then start minikube. In this case, minikube will detect that the machine is already running.

Networking in Minikube

Let us now inspect the networking configuration of the virtualbox instance that minikube has started for us. When minikube comes up, it will print a message like the following

“minikube” IP address is 192.168.99.100

In case you missed this message, you can use run minikube ip to obtain this IP address. How is that IP address reachable from the host?

If you run ifconfig and ip route on the host system, you will find that virtualbox has created an additional virtual network device vboxnet0 (use ls -l /sys/class/net to verify that this is a virtual device) and has added a route sending all the traffic to the CIDR range 192.168.99.0/24 to this device, using the source IP address 192.168.99.1 (the src field in the output of ip route). So this gives you yet another way to SSH into the virtual machine

ssh docker@$(minikube ip)

showing also that the connection works.

Inside the VM, however, the picture is a bit more complicated. As a starting point, let us print some details on the virtual machine that minikube has created.

$ vboxmanage showvminfo  minikube --details | grep "NIC" | grep -v "disabled"
NIC 1:           MAC: 080027AE1062, Attachment: NAT, Cable connected: on, Trace: off (file: none), Type: virtio, Reported speed: 0 Mbps, Boot priority: 0, Promisc Policy: deny, Bandwidth group: none
NIC 1 Settings:  MTU: 0, Socket (send: 64, receive: 64), TCP Window (send:64, receive: 64)
NIC 1 Rule(0):   name = ssh, protocol = tcp, host ip = 127.0.0.1, host port = 44359, guest ip = , guest port = 22
NIC 2:           MAC: 080027BDDBEC, Attachment: Host-only Interface 'vboxnet0', Cable connected: on, Trace: off (file: none), Type: virtio, Reported speed: 0 Mbps, Boot priority: 0, Promisc Policy: deny, Bandwidth group: none

So we find that virtualbox has equipped our machine with two virtual network interfaces, called NIC 1 and NIC 2. If you ssh into the machine, run ifconfig and compare the MAC address values, you fill find that these two devices appear as eth0 and eth1.

Let us first take a closer look at the first interface. This is a so-called NAT device. Basically, this device acts like a router – when a TCP/IP packet is sent to this device, the virtualbox engine extracts the data, opens a port on the host machine and sends the data to the target host. When the answer is received, another address translation is performed and the packet is fed again into the virtual device.

Much like an actual router, this mechanism makes it impossible to reach the virtual machine from the host – unless a port forwarding rule is set up. If you look at the output above, you will see that there is one port forwarding rule already in place, mapping the SSH port of the guest system to a port on the host, in our case 44359. When you run netstat on the host, you will find that minikube itself actually connects to this port to reach the SSH daemon inside the virtual machine – and, incidentally, this gives us yet another way to SSH into our machine.

ssh -p 44359 docker@127.0.0.1

Now let us turn to the second interface – eth1. This is an interface type which the VirtualBox documentation refers to as host-only networking. In this mode, an additional virtual network device is created on the host system – this is the vboxnet0 device which we have already spotted. Traffic sent to the virtual device eth1 in the machine is forwarded to this device and vice versa (this is in fact handled by a special driver vboxnet as you can tell from the output of ethtool -i vboxnet0). In addition, VirtualBox has added routes on the host and the guest system to connect this device to the network 192.168.99.0/24. Note that this network is completely separated from the host network. So our picture looks as follows.

What does this mean for Kubernetes networking in Minikube? Well, the first obvious consequence is that we can use node ports to access services from our host system. Let us try this out, using the examples from a previous post.

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/pods/deployment.yaml
deployment.apps/alpine created
$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/nodePortService.yaml
service/alpine-service created
$ kubectl get svc
NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
alpine-service   NodePort    10.99.112.157           8080:32197/TCP   26s
kubernetes       ClusterIP   10.96.0.1               443/TCP          4d17h

So our service has been created and is listening on the node port 32197. Let us see whether we can reach our service from the host. On the host, open a terminal window and enter

$ nodeIP=$(minikube ip)
$ curl $nodeIP:32197
<h1>It works!</h1>

So node port services work as expected. What about load balancer services? In a typical cloud environment, Kubernetes will create load balancers whenever we set up a load balancer service that is reachable from outside the cluster. Let us see what the corresponding behavior in a minikube environment is.

$ kubectl delete svc alpine-service
service "alpine-service" deleted
$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/loadBalancerService.yaml
service/alpine-service created
$ kubectl get svc
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
alpine-service   LoadBalancer   10.106.216.127        8080:31282/TCP   3s
kubernetes       ClusterIP      10.96.0.1                443/TCP          4d18h
$ curl $nodeIP:31282
<h1>It works!</h1>

You will find that even after a few minutes, the external IP remains pending. Of course, we can still reach our service via the node port, but this is not the idea of a load balancer service. This is not awfully surprising, as there is no load balancer infrastructure on your local machine.

However, minikube does offer a tool that allows you to emulate a load balancer – minikube tunnel. To see this in action, open a second terminal on your host and enter

minikube tunnel

After a few seconds, you will be asked for your root password, as minikube tunnel requires root privileges. After providing this, you should see some status message on the screen. In our first terminal, we can now inspect our service again.

$ kubectl get svc alpine-service
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)          AGE
alpine-service   LoadBalancer   10.106.216.127   10.106.216.127   8080:31282/TCP   17m
$ curl 10.106.216.127:8080
<h1>It works!</h1>

Suddenly, the field external IP is populated, and we can reach our service under this IP address and the port number that we have configured in our service description. What is going on here?

To find the answer, we can use ip route on the host. If you run this, you will find that minikube has added an additional route which looks as follows.

10.96.0.0/12 via 192.168.99.100 dev vboxnet0

Let us compare this with the CIDR range that minikube uses for services.

$ kubectl cluster-info dump | grep -m 1 range
                            "--service-cluster-ip-range=10.96.0.0/12",

So minikube has added a route that will forward all traffic directed towards the IP ranged used for Kubernetes services to the IP address of the VM in which minikube is running, using the virtual ethernet device created for this VM. Effectively, this sets up the VM as a gateway which makes it possible to reach this CIDR range (see also the minikube documentation for details). In addition, minikube will set the external IP of the service to the cluster IP address, so that the service can now be reached from the host (you can also verify the setup using ip route get 10.106.216.127 to display the result of the route resolution process for this destination).

Note that if you stop the separate tunnel process again, the additional route disappears again and the external IP address of the service switches back to “pending”.

Persistent storage in Minikube

We have seen in my previous posts on persistent storage that cloud platforms typically define a default storage class and offer a way to automatically create persistent volumes for a PVC. The same is true for minikube – there is a default storage class.

$ kubectl get storageclass
NAME                 PROVISIONER                AGE
standard (default)   k8s.io/minikube-hostpath   5d1h

In fact, minikube is by default starting a custom storage controller (as you can check by running kubectl get pods -n kube-system). To understand how this storage controller is operating, let us construct a PVC and analyse the resulting volume.

$ kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 512Mi
EOF

If you use kubectl get pv, you will see that the storage controller has created a new persistent volume. Let us attach this volume to a container to play with it.

$ kubectl apply -f - << EOF
apiVersion: v1
kind: Pod
metadata:
  name: pv-test
  namespace: default
spec:
  containers:
  - name: pv-test-ctr
    image: httpd:alpine
    volumeMounts:
      - mountPath: /test
        name: test-volume
  volumes:
  - name: test-volume
    persistentVolumeClaim:
      claimName: my-pvc
EOF

If you then use once more SSH into the VM, you should see our new container running. Using docker inspect, you will find that Docker has again created a bind mount, binding the mount point /test to a directory on the host named /tmp/hostpath-provisioner/pvc-*, where * indicates some randomly generated number. When you attach to the container and create a file /test/myfile, and then display the contents of this directory in the VM, you will in fact see that the file has been created.

So at the end of the day, a persistent volume in minikube is simply a host-path volume, pointing to a directory on the one and only node used by minikube. Also note that this storage is really persistent in the sense that it survives a restart of minikube.

Additional features

There are a view additional features of minikube that are worth being mentioned. First, it is very easy to install an NGINX ingress controller – the command

minikube addons enable ingress

will do this for you. Second, minikube also allows you to install and enable the Kubernetes dashboard. In fact, running

minikube dashboard

will install the dashboard and open a browser pointing to it.

And there are many more addons- you can get a full list with minikube addons list or in the minikube documentation. I highly recommend to browse that list and play with one of them.

Managing traffic with Kubernetes ingress controllers

In one of the previous posts, we have learned how to expose arbitrary ports to the outside world using services and load balancers. However, we also found that this is not very efficient – in the worst case, the number of load balancers we need equals the number of services.

Specifically for HTTP/HTTPS traffic, there is a different option available – an ingress rule.

Ingress rules and ingress controllers

An ingress rule is a Kubernetes resource that defines a kind of routing on the HTTP(S) path level. To illustrate this, assume that you have two services running, call them svcA and svcB. Assume further that both services work with HTTP as the underlying protocol and are listening on port 80. If you expose these services naively as in the previous post, you will need two load balancers, call them LB1 and LB2. Then, to reach svcA, you would use the URL

http://LB1/

and to access svcB, you would use

http://LB2/

The idea of an ingress is to have only one load balancer, with one DNS entry, and to use the path name to route traffic to our services. So with an ingress, we would be able to reach svcA under the URL

http://LB/svcA

and the second service under the URL

http://LB/svcB

With this approach, you have several advantages. First, you only need one load balancer that clients can use for all services. Second, the path name is related to the service that you invoke, which follows best practises and makes coding against your services much easier. Finally, you can easily add new services and remove old services without a need to change DNS names.

In Kubernetes, two components are required to make this work. First, there are ingress rules. These are Kubernetes resources that contain rules which specify how to route incoming requests based on their path (or even hostname). Second, there are ingress controllers. These are controllers which are not part of the Kubernetes core software, but need to be installed on top. These controllers will work with the ingress rules to manage the actual routing. So before trying this out, we need to install an ingress controller in our cluster.

Installing an ingress controller

On EKS, there are several options for an ingress controller. One possible choice is the nginx ingress controller. This is the controller that we will use for our examples. In addition, AWS has created their own controller called AWS ALB controller that you could use as an alternative – I have not yet looked at this in detail, though.

So let us see how we can install the nginx ingress controller in our cluster. Fortunately, there is a very clear installation instruction which tells us that to run the install, we simply have to execute a number of manifest files, as you would expect for a Kubernetes application. If you have cloned my repository and brought up your cluster using the up.sh script, you are done – this script will set up the controller automatically. If not, here are the commands to do this manually.

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/mandatory.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/provider/aws/service-l4.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/provider/aws/patch-configmap-l4.yaml

Let us go through this and see what is going on. The first file (mandatory.yaml) will set up several config maps, a service account and a new role and connect the service account with the newly created role. It then starts a deployment to bring up one instance of the nginx controller itself. You can see this pod running if you do a

kubectl get pods --namespace ingress-nginx

The first AWS specific manifest file service-l4.yaml will establish a Kubernetes service of type LoadBalancer. This will create an elastic load balancer in your AWS VPC. Traffic on the ports 80 and 443 is directed to the nginx controller.

kubectl get svc --namespace ingress-nginx

Finally, the second AWS specific file will update the config map that stores the nginx configuration and set the parameter use-proxy-protocol to True.

To verify that the installation has worked, you can use aws elb describe-load-balancers to verify that a new load balancer has been created and curl the DNS name provided by

kubectl get svc ingress-nginx --namespace ingress-nginx

from your local machine. This will still give you an error as we have not yet defined ingress rule, but show that the ingress controller is up and running.

Setting up and testing an ingress rule

Having our ingress controller in place, we can now set up ingress rules. To have a toy example at hand, let us first apply a manifest file that will

Add a deployment of two instances of the Apache httpd
Install a service httpd-service accepting traffic for these two pods on port 8080
Add a deployment of two instances of Tomcat
Create a service tomcat-service listening on port 8080 and directing traffic to these two pods

You can either download the file here or directly use the URL with kubectl

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/ingress-prep.yaml
deployment.apps/httpd created
deployment.apps/tomcat created
service/httpd-service created
service/tomcat-service created

When all pods are up and running, you can again spawn a shell in one of the pods and use the DNS name entries created by Kubernetes to verify that the services are available from within the pods.

$ pod=$(kubectl get pods --output \
  jsonpath="{.items[0].metadata.name}")
$ kubectl exec -it $pod "/bin/bash"
bash-4.4# apk add curl
bash-4.4# curl tomcat-service:8080
bash-4.4# curl httpd-service:8080

Let us now define an ingress rule which directs requests to the path /httpd to our httpd service and correspondingly requests to /tomcat to the Tomcat service. Here is the manifest file for this rule.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target : "/"
spec:
  rules:
  - http:
      paths:
      - path: /tomcat
        backend:
          serviceName: tomcat-service
          servicePort: 8080
      - path: /alpine
        backend: 
          serviceName: alpine-service
          servicePort: 8080

The first few lines are familiar by now, specifying the API version, the type of resource that we want to create and a name. In the metadata section, we also add an annotation. The nginx ingress controller can be configured using this and similar annotations (see here for a list), and this annotation is required to make the example work.

In the specification section, we now define a set of rules. Our rule is a http rule, which, at the time of writing, is the only supported protocol. This is followed by a list of paths. Each path consists of a selector (“/httpd” and “/tomcat” in our case), followed by the specification of a backend, i.e. a combination of service name and service port, serving requests matching this path.

Next set up the ingress rule. Assuming that you have saved the manifest file above as ingress.yaml, simply run

$ kubectl apply -f ingress.yaml
ingress.extensions/test-ingress created

Now let us try this. We already know that the ingress controller has created a load balancer for us which serves all ingress rules. So let us get the name of this load balancer from the service specification and then use curl to try out both paths

$ host=$(kubectl get svc ingress-nginx -n ingress-nginx --output\
  jsonpath="{.status.loadBalancer.ingress[0].hostname}")
$ curl -k https://$host/httpd
$ curl -k https://$host/tomcat

The first curl should give you the standard output of the httpd service, the second one the standard Tomcat welcome page. So our ingress rule works.

Let us try to understand what is happening. The load balancer is listening on the HTTPS port 443 and picking up the traffic coming from there. This traffic is then redirected to a host port that you can read off from the output of aws elb describe-load-balancers, in my case this was 31135. This node port belongs to the service ingress-nginx that our ingress controller has set up. So the traffic is forwarded to the ingress controller. The ingress controller interprets the rules, determines the target service and forwards the traffic to the target service. Ignoring the fact that the traffic goes through the node port, this gives us the following simplified picture.

In fact, this diagram is a bit simplified as (see here) the controller does not actually send the traffic to the service cluster IP, but directly to the endpoints, thus bypassing the kube-proxy mechanism, so that advanced features like session affinity can be applied.

Ingress rules have many additional options that we have not yet discussed. You can define virtual hosts, i.e. routing based on host names, define a default backend for requests that do not match any of the path selectors, use regular expressions in your paths and use TLS secrets to secure your HTTPS entry points. This is described in the Kubernetes networking documentation and the API reference.

Creating ingress rules in Python

To close this post, let us again study how to implement Ingress rules in Python. Again, it is easiest to build up the objects from the bottom to the top. So we start with our backends and the corresponding path objects.

tomcat_backend=client.V1beta1IngressBackend(
          service_name="tomcat-service", 
          service_port=8080)
httpd_backend=client.V1beta1IngressBackend(
          service_name="httpd-service", 
          service_port=8080)

Having this, we can now define our rule and our specification section.

rule=client.V1beta1IngressRule(
     http=client.V1beta1HTTPIngressRuleValue(
            paths=[tomcat_path, httpd_path]))
spec=client.V1beta1IngressSpec()
spec.rules=[rule]

Finally, we assemble our ingress object. Again, this consists of the metadata (including the annotation) and the specification section.

ingress=client.V1beta1Ingress()
ingress.api_version="extensions/v1beta1"
ingress.kind="Ingress"
metadata=client.V1ObjectMeta()
metadata.name="test-ingress"
metadata.annotations={"nginx.ingress.kubernetes.io/rewrite-target" : "/"}
ingress.metadata=metadata
ingress.spec=spec

We are now ready for our final steps. We again read the configuration, create an API endpoint and submit our creation request. You can find the full script including comments and all imports here

config.load_kube_config()
apiv1beta1=client.ExtensionsV1beta1Api()
apiv1beta1.create_namespaced_ingress(
              namespace="default",
              body=ingress)

Watching Kubernetes networking in action

In this post, we will look in some more detail into networking in a Kubernetes cluster. Even though the Kubernetes networking model is independent of the underlying cloud provider, the actual implementation does of course depend on the cloud provider which communicates with Kubernetes through a CNI plugin.

I will continue to use EKS, so some of this will be EKS specific. To work with me through the example, you will first have to bring up your cluster, start two nodes and deploy a pod running a httpd on one of the nodes. I have written a script up.sh and a manifest file that automates all this. To download and apply all this, enter

$ git clone https://github.com/christianb93/Kubernetes.git
$ cd Kubernetes/cluster
$ chmod 700 up.sh
$ ./up.sh
$ kubectl apply -f ../pods/alpine.yaml

Node-to-Pod networking

Now let us log into the node on which the container is running and collect some information on the local network interface attached to the VM.

$ ifconfig eth0
eth0: flags=4163  mtu 9001
        inet 192.168.118.64  netmask 255.255.192.0  broadcast 192.168.127.255
        inet6 fe80::2b:dcff:fee7:448c  prefixlen 64  scopeid 0x20
        ether 02:2b:dc:e7:44:8c  txqueuelen 1000  (Ethernet)
        RX packets 197837  bytes 274587781 (261.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 25656  bytes 2389608 (2.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

So the local IP address of the node is 192.168.118.64. If we do a kubectl get pods -o wide, we get a different IP address – 192.168.99.199 – for the pod. Let us curl this from the node.

$ curl 192.168.99.199
<h1>It works!</h1>

So apparently we have reached our httpd. To understand why this works, let us investigate the network configuration in more detail. First, on the node on which the container is running, let us take a look at the network configuration inside the container.

$ ID=$(docker ps --filter name=alpine-ctr --format "{{.ID}}")
$ docker exec -it $ID "/bin/bash"
bash-4.4# ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth0@if6:  mtu 9001 qdisc noqueue state UP 
    link/ether 4a:8b:c9:bb:8c:8e brd ff:ff:ff:ff:ff:ff
bash-4.4# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 4A:8B:C9:BB:8C:8E  
          inet addr:192.168.99.199  Bcast:192.168.99.199  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:936 (936.0 B)  TX bytes:0 (0.0 B)
bash-4.4# ip route
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0 scope link

What do we learn from this? First, we see that inside the container namespace, there is a virtual ethernet device eth0, with IP address 192.168.99.199. If you run kubectl get pods -o wide on your local workstation, you will find that this is the IP address of the Pod. We also see that there is a route in the container namespace that direct all traffic to this interface. The output of the ip link command also shows that this device is a virtual ethernet device that has a paired device (with index if6) in a different namespace. So let us exit the container, go back to the node and try to figure out what the network configuration on the node is.

$ ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:2b:dc:e7:44:8c brd ff:ff:ff:ff:ff:ff
3: eni3f5399ec799:  mtu 9001 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 66:37:e5:82:b1:f6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
4: enie68014839ee@if3:  mtu 9001 qdisc noqueue state UP mode DEFAULT group default 
    link/ether f6:4f:62:dc:38:18 brd ff:ff:ff:ff:ff:ff link-netnsid 1
5: eth1:  mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:e8:f0:26:a7:3e brd ff:ff:ff:ff:ff:ff
6: eni97d5e6c4397@if3:  mtu 9001 qdisc noqueue state UP mode DEFAULT group default 
    link/ether b2:c9:58:c0:20:25 brd ff:ff:ff:ff:ff:ff link-netnsid 2
$ ifconfig eth0
eth0: flags=4163  mtu 9001
        inet 192.168.118.64  netmask 255.255.192.0  broadcast 192.168.127.255
        inet6 fe80::2b:dcff:fee7:448c  prefixlen 64  scopeid 0x20
        ether 02:2b:dc:e7:44:8c  txqueuelen 1000  (Ethernet)
        RX packets 197837  bytes 274587781 (261.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 25656  bytes 2389608 (2.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
$ ip route
default via 192.168.64.1 dev eth0 
169.254.169.254 dev eth0 
192.168.64.0/18 dev eth0 proto kernel scope link src 192.168.118.64 
192.168.89.17 dev eni3f5399ec799 scope link 
192.168.99.199 dev eni97d5e6c4397 scope link 
192.168.109.216 dev enie68014839ee scope link

Here we see that – in addition to a few other interfaces – there is a device eth0 to which all traffic is sent by default. However, there is also a device eni97d5e6c4397 which is the other end of the interface visible in the container. And there is a route that sends all traffic that is directed to the IP address of the pod to this interface. Overall, this gives a picture which seems familiar from our earlier analysis of docker networking

If we try to establish a connection to the httpd running in the pod, the routing table entry on the node will send the traffic to the interface eni97d5e6c4397. This is one end of a veth-pair, the other end appears inside the container as eth0. So from the containers point of view, this is incoming traffic received via eth0, which is happily accepted and processed by the httpd. The reply goes the other way – it is directed to eth0 inside the container, travels via the veth pair and ends up inside the host namespace, coming from eni97d5e6c4397.

Pod-to-Pod networking across nodes

Now let us try something else. Log into the second node – on which the container is not running – and try the curl from there. Surprisingly, this works as well! What we have seen so far does not explain this, so there is probably a piece of magic that we are still missing. To find this, let us use the aws cli to print out the network interfaces attached to the node on which the container is running (the following snippet assumes that you have the extremely helpful tool jq installed on your PC).

$ nodeName=$(kubectl get pods --output json | jq -r ".items[0].spec.nodeName")
$ aws ec2 describe-instances --output json --filters Name=private-dns-name,Values=$nodeName --query "Reservations[0].Instances[0].NetworkInterfaces"
---- SNIP -----
{
        "MacAddress": "02:e8:f0:26:a7:3e",
        "SubnetId": "subnet-06088e09ce07546b9",
        "PrivateDnsName": "ip-192-168-84-108.eu-central-1.compute.internal",
        "VpcId": "vpc-060469b2a294de8bd",
        "Status": "in-use",
        "Ipv6Addresses": [],
        "PrivateIpAddress": "192.168.84.108",
        "Groups": [
            {
                "GroupName": "eks-auto-scaling-group-myCluster-NodeSecurityGroup-1JMH4SX5VRWYS",
                "GroupId": "sg-08163e3b40afba712"
            }
        ],
        "NetworkInterfaceId": "eni-0ed2f1cf4b09cb8be",
        "OwnerId": "979256113747",
        "PrivateIpAddresses": [
            {
                "Primary": true,
                "PrivateDnsName": "ip-192-168-84-108.eu-central-1.compute.internal",
                "PrivateIpAddress": "192.168.84.108"
            },
            {
                "Primary": false,
                "PrivateDnsName": "ip-192-168-72-200.eu-central-1.compute.internal",
                "PrivateIpAddress": "192.168.72.200"
            },
            {
                "Primary": false,
                "PrivateDnsName": "ip-192-168-96-163.eu-central-1.compute.internal",
                "PrivateIpAddress": "192.168.96.163"
            },
            {
                "Primary": false,
                "PrivateDnsName": "ip-192-168-99-199.eu-central-1.compute.internal",
                "PrivateIpAddress": "192.168.99.199"
            }
        ],
---- SNIP ----

I have removed some of the output to keep it readable. We see that AWS has attached several elastic network interfaces (ENI) to our node. An ENI is a virtual network interface that AWS creates and manages for you. Each node can have more than one ENI attached, and each ENI can have a primary and multiple secondary IP addresses.

If you look at the last line of the output, you see that there is a network interface eni-0ed2f1cf4b09cb8be that has, as one of the secondary IP addresses, the IP address 192.168.99.199. This is the IP address of our Pod! Let us now go back to the node and inspect its network configuration once more. You will not find a network interface with this exact name, but you will find a network interface on the node on which the pod is running which has the same MAC address, namely eth1.

$ ifconfig eth1
eth1: flags=4163  mtu 9001
        inet6 fe80::e8:f0ff:fe26:a73e  prefixlen 64  scopeid 0x20
        ether 02:e8:f0:26:a7:3e  txqueuelen 1000  (Ethernet)
        RX packets 224  bytes 6970 (6.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 20  bytes 1730 (1.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This is an ordinary VPC interface and visible in the entire VPC under all of its IP addresses. So if we curl our httpd from the second node, the traffic will leave that node via the default interface, be picked up by the VPC, routed to the node on which the pod is running and enter via eth1. As IP forwarding is enabled on this node, the traffic will be routed to the Pod and arrive at the httpd.

This is the missing piece of magic we have been looking for. In fact, for every pod running on a node, EKS will add an additional secondary IP address to an ENI attached to the node (and attach additional ENIs if needed) which will make the Pod IP addressses visible in the entire VPC. This mechanism is nicely described in the documentation of the CNI plugin which EKS uses. So we now have the following picture.

So this allows us to run our httpd in such a way that it can be reached from the entire Pod network (and the entire VPC). Note, however, that it can of course not be reached from the outside world. It is interesting to repeat this experiment with a slighly adapted YAML file that uses the containerPort field:

apiVersion: v1
kind: Pod
metadata:
  name: alpine
  namespace: default
spec:
  containers:
  - name: alpine-ctr
    image: httpd:alpine
    ports: 
      - containerPort: 80

If we remove the old Pod and use this YAML file to create a new pod, we will find that the configuration does not change at all. In particular, running docker ps on the node on which the Pod is scheduled will teach you that this port specification is not the equivalent of the port specification of the docker run port mapping feature – as the Kubernetes API specification states, this field is informational.

Implementation of services

Let us now see how this picture changes if we add a service. First, we will use a service of type ClusterIP, i.e. a service that will make our httpd reachable from within the entire cluster under a common IP address. For that purpose – after deleting our existing pods – let us create a deployment that brings up two instances of the httpd.

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/pods/deployment.yaml

Once the pods are up, you can again use curl to verify that you can talk to every pod from every node and every pod. Now let us create a service.

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/service.yaml

Once that has been done, enter kubectl get svc to get a list all services. You should see a newly created service alpine-service. Note down its cluster IP address – in my case this was 10.100.11.202.

Now log into one of the nodes again, attach to the container, install curl there and try to connect to port 10.100.11.202:8080

$ ID=$(docker ps --filter name=alpine-ctr --format "{{.ID}}")
$ docker exec -it $ID "/bin/bash"
bash-4.4# apk add curl
OK: 124 MiB in 67 packages
bash-4.4# curl 10.100.11.202:8080
<h1>It works!</h1>

So, as promised by the definition of s service, the httpd is visible within the cluster under the cluster IP address of the service. The same works if we are on a node and not attached to a container.

To see how this works, let us log out of the container again and search the NAT tables for the cluster IP address of the service.

$ sudo iptables -S -t nat | grep 10.100.11.202
-A KUBE-SERVICES -d 10.100.11.202/32 -p tcp -m comment --comment "default/alpine-service: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-SXWLG3AINIW24QJC

So we see that Kubernetes (more precisely the kube-proxy running on each node) has added a NAT rule that captures traffic directed towards the service IP address to a special chain. Let us dump this chain.

$ sudo iptables -S -t nat | grep KUBE-SVC-SXWLG3AINIW24QJC
-N KUBE-SVC-SXWLG3AINIW24QJC
-A KUBE-SERVICES -d 10.100.11.202/32 -p tcp -m comment --comment "default/alpine-service: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-SXWLG3AINIW24QJC
-A KUBE-SVC-SXWLG3AINIW24QJC -m comment --comment "default/alpine-service:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-YRELRNVKKL7AIZL7
-A KUBE-SVC-SXWLG3AINIW24QJC -m comment --comment "default/alpine-service:" -j KUBE-SEP-BSEIKPIPEEZDAU6E

Now this is actually pretty interesting. The first line is simply the creation of the chain. The second line is the line that we already looked at above. The next two lines are the lines we are looking for. We see that, with a probability of 50%, we either jump to the chain KUBE-SEP-YRELRNVKKL7AIZL7 or to the chain KUBE-SEP-BSEIKPIPEEZDAU6E. Let us display one of them.

$ sudo iptables -S KUBE-SEP-BSEIKPIPEEZDAU6E -t nat 
-N KUBE-SEP-BSEIKPIPEEZDAU6E
-A KUBE-SEP-BSEIKPIPEEZDAU6E -s 192.168.191.152/32 -m comment --comment "default/alpine-service:" -j KUBE-MARK-MASQ
-A KUBE-SEP-BSEIKPIPEEZDAU6E -p tcp -m comment --comment "default/alpine-service:" -m tcp -j DNAT --to-destination 192.168.191.152:80

So we see the that this chain has two rules. The first rule marks all packages that are originating from the pod running on this node, this mark is later evaluated in the forwarding rules to make sure that the packet is accepted for forwarding. The second rule is where the magic happens – it performs a DNAT, i.e. a destination NAT, and sends our packets to one of the pods. The rule KUBE-SEP-YRELRNVKKL7AIZL7 is similar, with the only difference that it sends the packets to the other pod. So we see that two things are happening

Traffic directed towards port 8080 of the cluster IP address is diverted to one of the pods
Which one of the pods is selected is determined randomly, with a probability of 50% for both pods. Thus these rules implement a simple load balancer.

Let us now see how things change when we use a service of type NodePort. So let us use a slightly different YAML file.

$ kubectl delete -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/service.yaml
$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/nodePortService.yaml

When we now run kubectl get svc, we see that our service appears as a NodePort service, and, as the second entry in the columns PORTS, we find the port that Kubernetes opens for us.

$ kubectl get services
NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
alpine-service   NodePort    10.100.165.3           8080:32755/TCP   13s
kubernetes       ClusterIP   10.100.0.1             443/TCP          7h

In my case, the port 32755 has been used. If we now go back to one of the nodes and search the iptables rules for this port, we find that Kubernetes has created two additional NAT rules.

$ sudo iptables -S  -t nat | grep 32755
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/alpine-service:" -m tcp --dport 32755 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/alpine-service:" -m tcp --dport 32755 -j KUBE-SVC-SXWLG3AINIW24QJC

So we find that for all traffic directed to this port, again a marker is set and the rule KUBE-SVC-SXWLG3AINIW24QJC applies. If you inspect this rule, you will find that it is similar to the rules above and again realizes a load balancer that sends traffic to port 80 of one of the pods.

Let us now verify that we can really reach this pod from the outside world. Of course, this only works once we have allowed incoming traffic on at least one of the nodes in the respective AWS security group. The following commands determine the node Port, your IP address, the security group and the IP address of the node, allow access and submit the curl command (note that I use the cluster name myCluster to select the worker nodes, in case you are not using my scripts to run this example, you will have to change the filter to make this work).

$ nodePort=$(kubectl get svc alpine-service --output json | jq ".spec.ports[0].nodePort")
$ IP=$(aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --output text --query Reservations[0].Instances[0].PublicIpAddress)
$ SEC_GROUP_ID=$(aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --output text --query Reservations[0].Instances[0].SecurityGroups[0].GroupId)
$ myIP=$(wget -q -O- https://ipecho.net/plain)
$ aws ec2 authorize-security-group-ingress --group-id $SEC_GROUP_ID --port $nodePort --protocol tcp --cidr "$myIP/32"
$ curl $IP:$nodePort
<h1>It works!</h1>

Summary

After all these nitty-gritty details, let us summarize what we have found. When you start a pod on a node, a pair of virtual ethernet devices is created, with one end being assigned to the namespace of the container and one end being assigned to the namespace of the host. Then IP routes are added so that traffic directed towards the pod is forwarded to this bridge. This allows access to the container from the node on which they are running. To realize access from other nodes and pods and thus the flat Kubernetes networking model, EKS uses the AWS CNI plugin which attaches the pods IP addresses as secondary IP addresses to elastic network interfaces.

When you start a service, Kubernetes will in addition set up NAT rules that will capture traffic determined for the cluster IP address of the service and perform a destination network address translation so that this traffic gets send to one of the pods. The pod is selected at random, which implements a simple load balancer. For a service of type NodePort, additional rules will be created which make sure that the same NAT processing applies to traffic coming from the outside world.

This completes today post. If you want to learn more, you might want to check out some of the links below.

this post by Kevin Sookocheff
the source code of the iptable interface of the kube proxy
The proposal for the AWS CNI plugin
my own series on networking in Docker (part I, part II)
and of course the Kubernetes documention itself

Category Archives: Networking