Understanding TLS certificates with Ansible and NGINX – part II

In the first part of this short series, we have seen how Ansible can be used to easily generate self-signed certificates. Today, we will turn to more complicated set-ups and learn how to act as a CA, build chains of certificates and create client-certificates.

Creating CA and intermediate CA certificates

Having looked at the creation of a single, self-signed certificate for which issuer and subject are identical, let us now turn to a more realistic situation – using one certificate, a CA certificate, to sign another certificate. If this second certificate is directly used to authorize an entity, for instance by being deployed into a web server, it is usually called an end-entity certificate. If, however, this certificate is used to in turn sign a third certificate, it is called an intermediate CA certificate.

In the first post, we have looked at the example of the certificate presented by github.com, which is signed by a certificate with the CN “DigiCert SHA2 Extended Validation Server CA” (more precisely, of course, by the private key associated with the public key verified by this certificate), which in turn is issued by “DigiCert High Assurance EV Root CA”, the root CA. Here, the second certificate is the intermediate CA certificate, and the certificate presented by github.com is the end-entity certificate.

Let us now try to create a similar chain in Ansible. First, we need a root CA. This will again be a self-signed certificate (which is the case for all root CA certificates). In addition, root CA certificates typically contain a set of extensions. To understand these extensions, the easiest approach is to look a few examples. You can either use openssl x509 to inspect some of the root certificates that come with your operating system, or use your browser certificate management tab to look at some of the certificates there. Doing this, you will find that root CA certificates typically contain three extensions as specified by X509v3, which are also defined in RFC 3280.

  • Basic Constraints: CA: True – this marks the certificate as a CA certificate
  • Key Usage: Digital Signature, Certificate Sign, CRL Sign – this entitles the certificate to be used to sign other certificates, perform digital signatures and sign CRLs (certificate revocation lists)
  • Subject Key identifier: this is an extension which needs to be present for a CA according to RFC 3280 and allows the usage of a hash key of the public key to easily identify certificates for a specific public key

All these requirements can easily be met using our Ansible modules. We essentially proceed as in the previous post and use the openssl_csr to create a CSR from which we then generate a certificate using the openssl_certificate module. The full playbook (also containing the code for the following sections) can be found here. A few points are worth being noted.

  • when creating the CSR, we need to add the fields key_usage and key_usage_critical to the parameters of the Ansible module. The same holds for basic_constraints and basic_constraints_critical
  • The module will by default put the common name into the subject alternative name extension (SAN). To turn this off, we need to set use_common_name_for_san to false.
  • When creating the certificate using openssl_certificate, we need the flag selfsigned_create_subject_key_identifier to instruct the module to add a subject key identifier extension to the certificate. This feature is only available since Ansible version 2.9. So in case you have an older version, you need to use pip3 install ansible to upgrade to the latest version (you might want to run this in a virtual environment)

Having this CA in place, we can now repeat the procedure to create an intermediate CA certificate. This will again be a CA certificate, with the difference that its issuer will be the root certificate that we have just created. So we do no longer use the selfsigned provider when calling the Ansible openssl_certificate module, but the ownca provider. This requires a few additional parameters, most notably of course the root CA and the private key of the root CA. So the corresponding task in the playbook will look like this.

- name: Create certificate for intermediate CA
  openssl_certificate:
    csr_path: "{{playbook_dir}}/intermediate-ca.csr"
    path: "{{playbook_dir}}/etc/certs/intermediate-ca.crt"
    provider: ownca
    ownca_path: "{{playbook_dir}}/etc/certs/ca.crt"
    ownca_create_subject_key_identifier: always_create
    ownca_privatekey_path: "{{playbook_dir}}/etc/certs/ca.rsa" 

When creating the CSR, we also modify the basic constraints field a bit and add the second key/value-pair pathlen:0. This specifies that the resulting certificate cannot be used to create any additional CA certificates, but only to create the final, end-entity certificate.

This is what we will do next. The code for this is more or less the same as that for creating the intermediate CA, but this time, we use the intermediate CA instead of the root CA for signing and we also change the extensions again to create a classical service certificate.

Let us now put all this together and verify that our setup works. To create all certificates, enter the following commands.

git clone https://github.com/christianb93/tls-certificates
cd tls-certificates/lab2
ansible-playbook site.yaml

When the script completes, you should see a couple of certificates created in etc/certs. We can use OpenSSL to inspect them.

for cert in server.crt intermediate-ca.crt ca.crt; do
  openssl x509 -in etc/certs/$cert -noout -text
done

This should display all three certificates in the order listed. Looking at the common names and e-mail addresses (all other attributes of the distinguished name are identical), you should now nicely see that these certificates really form a chain, with the issuer of one element in the chain being the subject of the next one, up to the last one, which is self-signed.

Now let us see how we need to configure NGINX to use our new server certificate when establishing a TLS connection. At the first glance, you might think that we simply replace the server certificate from the last lab with our new one. But there is an additional twist. A client will typically have a copy of the root CA, but it is not clear that a client will have a copy of the intermediate CA as well. Therefore, instead of using just the server certificate, we point NGINX to a file server-chain.crt which contains both the server certificate and the intermediate CA, in this order. So run

cp etc/certs/server.crt etc/certs/server-chain.crt
cat etc/certs/intermediate-ca.crt >> etc/certs/server-chain.crt
docker run -d --rm \
       -p 443:443 \
       -v $(pwd)/etc/conf.d:/etc/nginx/conf.d \
       -v $(pwd)/etc/certs:/etc/nginx/certs \
       nginx

Once the NGINX server is running, we should now be able to build a connection for testing using OpenSSL. As the certificates that the server presents are not self-signed, we also need to tell OpenSSL where the root CA needed to verify the chain of certificates is stored.

openssl s_client \
  --connect localhost:443 \
  -CAfile etc/certs/ca.crt
GET /index.html HTTP/1.0

You should again see the NGINX welcome page. It is also instructive to look at the output that OpenSSL produces and which, right at the beginning, also contains a representation of the certificate chain as received and verified by OpenSSL.

Creating and using client certificates

So far, our certificates have been server certificates – a certificate presented by a server to prove that the public key that the server presents us is actually owned by the entity operating the server. Very often, for instance when securing REST APIs like that of Kubernetes, however, the TLS protocol is used to also authenticate a user.

Let us take the Kubernetes API as an example. The Kubernetes API is a REST API using HTTPS and listening (by default) on port 6443. When a user connects to this URL, a server certificate is used so that the user can verify that the server is really owned by whoever provides the cluster. When a user makes a request to the API server, then, in addition to that, the server would also like to know that the user is a trusted user, and will have to authenticate the user, i.e. associate a certain identity with the request.

For that purpose, Kubernetes can be configured to ask the user for a client certificate during the TLS handshake. The server will then try to verify this certificate against a configured CA certificate. If that verification is successful, i.e. if the server can build a chain of certificates from the certificate that the client presents – the so-called client certificate – then the server will extract the common name and the organization from that certificate and use it as user and group to process the API request.

Let us now see how these client certificates can be created. First, of course, we need to understand what properties of a certificate turn it into a client certificate. Finding a proper definition of the term “client certificate” is not that straightforward as you might expect. There are several recommendations describing a reasonable set of extensions for client certificates (RFC 3279, RFC 5246 and the man page of the OpenSSL X509 tool. Combining these recommendations, we use the following set of extension:

  • keyUsage is present and contains the bits digitalSignature and keyEncipherment
  • extend usage is present and contains the clientAuth key

The Ansible code to generate this certificate is almost identical to the code in the previous section, with the differences due to the different extensions that we request. Thus we again create a self-signed root CA certificate, use this certificate to sign a certificate for an intermediate CA, and then use the intermediate CA certificate to issue certificates for client and server.

We also have to adjust our NGINX setup by adding the following two lines to the configuration of the virtual server.

ssl_verify_client       on;
ssl_client_certificate  /etc/nginx/certs/ca.crt;

With the first line, we instruct NGINX to ask a client for a TLS certificate during the handshake. With the second line, we specify the CA that NGINX will use to verify these client certificates. In fact, as you will see immediately when running our example, the server will even tell the client which CAs it will accept as issuer, this is part of the certificate request specified here.

Time to see all this in action again. To download, run and test the playbook enter the following commands (do not forget to stop the container created in the previous section).

git clone https://github.com/christianb93/tls-certificates
cd tls-certificates/lab3
ansible-playbook site.yaml
openssl s_client \
  --connect localhost:443 \
  -CAfile etc/certs/ca.crt \
  -cert etc/certs/client.crt \
  -cert_chain etc/certs/intermediate-ca.crt \
  -key etc/certs/client.rsa
GET /index.html HTTP/1.0


Note the additional switches to the OpenSSL client command. With the -cert switch, we tell OpenSSL to submit a client certificate when requested and point it to the file containing this certificate. With the -cert_chain parameter, we specify additional certificates (if any) that the client will send in order to complete the certificate chain between the client certificate and the root certificate. In our case, this is the intermediate CA certificate (this would not be needed if we had used the intermediate CA certificate in the server configuration). Finally, the last switch -key contains the location of the private RSA key matching the presented certificate.

This closes our post (and the two-part mini series) on TLS certificates. We have seen that Ansible can be used to automate the generation of self-signed certificates and to build entire chains-of-trust involving end-entity certificates, intermediate CAs and private root CAs. Of course, you could also reach out to a provider to do this for you, but is (maybe) a topic for another post.

Understanding TLS certificates with NGINX and Ansible – part I

If you read technical posts like this one, chances are that you have already had some exposure to TLS certificates, for instance because you have deployed a service that uses TLS and needed to create and deploy certificates for the servers and potentially for clients. Dealing with certificates can be a challenge, and a sound understanding of what certificates actually do is more than helpful for this. In this and the next post, we will play with NGINX and Ansible to learn what certificates are, how they are generated and how they are used.

What is a certificate?

To understand the structure of a certificate, let us first try to understand the problem that certificates try to solve. Suppose you are communicating with some other party over an encrypted channel, using some type of asymmetric cryptosystem like RSA. To send an encrypted message to your peer, you will need the peers public key as a prerequisite. Obviously, you could simply ask the peer to send you the public key before establishing a connection, but then you need to mitigate the risk that someone uses a technique like IP address spoofing to pretend to be the peer you want to connect with, and is sending you a fake public key. Thus you need a way to verify that the public key that is presented to you is actually the public key owned by the party to which you want to establish a connection.

One approach could be to establish a third, trusted and publicly known party and ask that trusted party to digitally sign the public key, using a digital signature algorithm like ECDSA. With that party in place, your peer would then present you the signed public key, you would retrieve the public key of the trusted party, use that key to verify the signature and proceed if this verification is successful.

CertificatesI

So what your peer will present you when you establish a secure connection is a signed public key – and this is, in essence, what a certificate really is. More precisely, a certificate according to the X509 v3 standard consists of the following components (see also RFC 52809.

  • A version number which refers to a version of the X509 specification, currently version 3 is what is mostly used
  • A serial number which the third party (called the issuer) assigns to the certificate
  • a valid-from and a valid-to date
  • The public key that the certificate is supposed to certify, along with some information on the underlying algorithm, for instance RSA
  • The subject, i.e. the party owning the key
  • The issuer, i.e. the party – also called certificate authority (CA) – signing the certificate
  • Some extensions which are additional, optional pieces of data that a certificate can contain – more on this later
  • And finally, a digital signature signing all the data described above

Let us take a look at an example. Here is a certificate from github.com that I have extracted using OpenSSL (we will learn how to do this later), from which I have removed some details and added some line breaks to make the output a bit more readable.

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            0a:06:30:42:7f:5b:bc:ed:69:57:39:65:93:b6:45:1f
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: C = US, O = DigiCert Inc, OU = www.digicert.com, 
                CN = DigiCert SHA2 Extended Validation Server CA
        Validity
            Not Before: May  8 00:00:00 2018 GMT
            Not After : Jun  3 12:00:00 2020 GMT
        Subject: businessCategory = Private Organization, 
                jurisdictionC = US, 
                jurisdictionST = Delaware, 
                serialNumber = 5157550, 
                C = US, ST = California, L = San Francisco, 
                O = "GitHub, Inc.", CN = github.com
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    SNIP --- SNIP
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            SNIP --- SNIP
    Signature Algorithm: sha256WithRSAEncryption
         70:0f:5a:96:a7:58:e5:bf:8a:9d:a8:27:98:2b:00:7f:26:a9:
         SNIP ----- SNIP
         af:ed:7a:29

We clearly recognize the components just discussed. At the top, there are the version number and the serial number (in hex). Then we see the signature algorithm and, at the bottom, the signature, the issuer (DigiCert), the validity, the subject (GitHub Inc.) and, last but not least, the full public key. Note that both, issuer and subject, are identified using distinguished names as you might known them from LDAP and similar directory services.

If we now wanted to verify this certificate, we would need to get the public key of the issuer, DigiCert. Of course, this is a bit of a chicken-egg problem, as we would need another certificate to verify the authenticity of this key as well. So we would need a certificate with subject DigiCert, signed by some other party, and then another certificate signed by yet another certificate authority, and so forth. This chain obviously has to end somewhere, and it does – the last certificate in such a chain (the root CA certificate) is typically a self-signed certificate. These are certificates for which issuer and subject are identical, i.e certificates where no further verification is possible and in which we simply have to trust.

How, then, do we obtain these root certificates? The answer is that root certificates are either distributed inside an organization or are bundled with operating systems and browsers. In our example, the DigiCert certificate that we see here is itself signed by another DigiCert unit called “DigiCert High Assurance EV Root CA”, and a certificate for this CA is part of the Ubuntu distribution that I use and stored in /etc/ssl/certs/DigiCert_High_Assurance_EV_Root_CA.pem which is a self-signed root certificate.

CertificatesII

In this situation, the last element of the chain is called the root CA, the first element the end-entity and any element in between an intermediate CA.

To obtain a certificate, the owner of the server github.com would turn to the intermediate CA and submit file, a so-called certificate signing request (CSR), containing the public key to be signed. The format for CSRs is standardized in RFC 2986 which, among things, specifies that a CSR be itself signed with the private key of the requestor, which also proves to the intermediate CA that the requestor possesses the private key corresponding to the public key to be signed. The intermediate CA will then issue a certificate. To establish the intermediate CA, the intermediate CA has, at some point in the past, filed a similar CSR with the root CA and that root CA has issued a corresponding certificate to the intermediate CA.

The TLS handshake

Let us now see how certificates are applied in practice to secure a communication. Our example is the transport layer security protocol TLS, formerly known as SSL, which is underlying the HTTPS protocol (which is nothing but HTTP sitting on top of TLS).

In a very basic scenario, a TLS communication roughly works as follows. First, the clients send a “hello” message to the server, containing information like the version of TLS supported and a list of supported ciphers. The server answers with a similar message, immediately followed by the servers certificate. This certificate contains the name of the server (either as fully-qualified domain name, or including wildcards like *.domain.com in which case the certificate is called a wildcard certificat) and, of course, the public key of the server. Client and server can now use this key to agree on a secret key which is then used to encrypt the further communication. This phase of the protocol which prepares the actual encrypted connection is known as the TLS handshake.

To successfully conclude this handshake, the server therefore needs a certificate called the server certificate which it will present to the client and, of course, the matching private key, called the server private key. The client needs to verify the server certificate and therefore needs access to the certificate of the (intermediate or root) CA that signed the server certificate. This CA certificate is known as the server CA certificate. Instead of just presenting a single certificate, a server can also present an entire chain of certificates which must end with the server CA certificate that the client knowns. In practice, these certificates are often the root certificates distributed with operating systems and browsers to which the client will have access.

Now suppose that you are a system administrator aiming to set up a TLS secured service, say a HTTPS-based reverse proxy with NGINX. How would you obtain the required certificates? First, or course, you would create a key pair for the server. Once you have that, you need to obtain a certificate for the public key. Basically, you have three options to obtain a valid certificate.

First, you could turn to an independent CA and ask the CA to issue a certificate, based on a CSR that you provide. Most professional CAs will charge for this. There are, however, a few providers like let’s encrypt or Cloudflare that offer free certificates.

Alternatively, you could create your own, self-signed CA certificate using OpenSSL or Ansible, this is what we will do today in this post. And finally, as we will see in the next post, you could even build your own “micro-CA” to issue intermediate CA certificates which you can then use to issue end-entity certificates within your organization.

Using NGINX with self-signed certificates

Let us now see how self-signed certificates can be created and used in practice. As an example, we will secure NGINX (running in a Docker container, of course) using self-signed certificates. We will first do this using OpenSSL and the command line, and then see how the entire process can be automated using Ansible.

The setup we are aiming at is NGINX acting as TLS server, i.e. we will ask NGINX to provide content via HTTPS which is based on TLS. We already know that in order to do this, the NGINX server will need an RSA key pair and a valid server certificate.

To create the key pair, we will use OpenSSL. OpenSSL is composed of a variety of different commands. The command that we will use first is the genrsa command that is responsible for creating RSA keys. The man page – available via man genrsa – is quite comprehensive, and we can easily figure out that we need the following command to create a 2048 bit RSA key, stored in the file server.rsa.

openssl genrsa \
  -out server.rsa

As a side note, the created file does not only contain the private key, but also the public key components (i.e. the public exponent), as you can see by using openssl rsa -in server.rsa -noout -text to dump the generated key.

Now we need to create the server certificate. If we wanted to ask a CA to create a certificate for us, we would first create a CSR, and the CA would then create a matching certificate. When we use OpenSSL to create a self-signed certificate, we do this in one step – we use the req command of OpenSSL to create the CSR, and pass the additional switch –x509 which instructs OpenSSL to not create a CSR, but a self-signed certificate.

To be able to do this, OpenSSL will need a few pieces of information from us – the validity, the subject (which will also be the issuer), the public key to be signed, any extensions that we want to include and finally the output file name. Some of these options will be passed on the command line, but other options are usually kept in a configuration file.

OpenSSL configuration files are plain-text files in the INI-format. There is one section for each command, and there can be additional sections which are then referenced in the command-specific section. In addition, there is a default section with settings which apply for all commands. Again, the man page (run man config for the general structure of the configuration file and man req for the part specific to the req command) – is quite good and readable. Here is a minimal configuration file for our purposes.

[req]
prompt = no
distinguished_name = dn
x509_extensions = v3_ext

[dn]
CN = Leftasexercise
emailAddress = me@leftasexercise.com
O = Leftasexercise blog
L = Big city
C = DE

[v3_ext]
subjectAltName=DNS:*.leftasexercise.local,DNS:leftasexercise.local

We see that the file has three sections. The first section is specific for the req command. It contains a setting that instructs OpenSSL to not prompt us for information, and then two references to other sections. The first of these sections contains the distinguished name of the subject, the second section contains the extensions that we want to include.

There are many different extensions that were introduced with version 3 of the X509 format, and this is not the right place to discuss all of them. The one that we use for now is the subject alternative name extension which allows us to specify a couple of alias names for the subject. Often, these are DNS names for servers for which the certificate should be valid, and browsers will typically check these DNS names and try to match them with the name of the server. As shown here, we can either use a fully-qualified domain name, or we can use a wildcard – these certificates are often called wildcard certificates (which are disputed as they give rise to security concerns, see for instance this discussion). This extension is typical for server certificates.

Let us assume that we have saved this configuration file as server.cnf in the current working directory. We can now invoke OpenSSL to actually create a certificate for us. Here is the command to do this and to print out the resulting certificate.

openssl req \
  -new \
  -config server.cnf \
  -x509 \
  -days 365 \
  -key server.rsa \
  -out server.crt
# Take a look at the certificate
openssl x509 \
  -text \
  -in server.crt -noout

If you scroll through the output, you will be able to identify all components of a certificate discussed so far. You will also find that the subject and the issuer of the certificate are identical, as we expect it from a self-signed certificate.

Let us now turn to the configuration of NGINX needed to serve HTTPS requests presenting our newly created certificate as server certificate. Recall that an NGINX configuration file contains a context called server which contains the configuration for a specific virtual server. To instruct NGINX to use TLS for this server, we need to add a few lines to this section. Here is a full configuration file containing these lines.

server {
    listen               443 ssl;
    ssl_certificate      /etc/nginx/certs/server.crt;
    ssl_certificate_key  /etc/nginx/certs/server.rsa;

    location / {
        root   /usr/share/nginx/html;
        index  index.html index.htm;
    }
}

In the line starting with listen, specifically the ssl keyword, we ask NGINX to use TLS for port 443, which is the default HTTPS port. In the next line, we tell NGINX which file it should use as a server certificate, presented to a client during the TLS handshake. And finally, in the third line, we point NGINX to the location of the key matching this certificate.

To try this out, let us bring up an an NGINX container with that configuration. Ẃe will mount two directories into this container – one directory containing our certificates, and one directory containing the configuration file. So create the following directories in your current working directory.

mkdir ./etc
mkdir ./etc/conf.d
mkdir ./etc/certs

Then place a configuration file default.conf with the content shown above in ./etc/conf.d and the server certificate and server private key that we have created in the directory ./etc/certs.d. Now we start the container and map these directories into the container.

docker run -d --rm \
       -p 443:443 \
       -v $(pwd)/etc/conf.d:/etc/nginx/conf.d \
       -v $(pwd)/etc/certs:/etc/nginx/certs \
       nginx

Note that we map port 443 inside the container into the same port number on the host, so this will only work if you do not yet have a server running on this port, in this case, pick a different port. Once the container is up, we can test our connection using the s_client command of the OpenSSL package.

openssl s_client --connect 127.0.0.1:443

This will produce a lengthy output that details the TLS handshake protocol and will then stop. Now enter a HTTP GET request like

GET /index.html HTTP/1.0

The HTML code for the standard NGINX welcome page should now be printed, demonstrating that the setup works.

When you go through the output produced by OpenSSL, you will see that the client displays the full certificate chain from the certificate presented by the server up to the root CA. In our case, this chain has only one element, as we are using a self-signed certificate (which the client detects and reports as error – we will see how to get rid of this in the next post).

Automating certificate generation with Ansible

So far, we have created keys and certificates manually. Let us now see how this can be automated using Ansible. Fortunately, Ansible comes with modules to manage TLS certificates.

The first module that we will need is the openssl_csr module. With this module, we will create a CSR which we will then, in a second step, present to the module openssl_certificate to perform the actual signing process. A third module, openssl_privatekey, will be used to create a key pair.

Let us start with the key generation. Here, the only parameters that we need are the length of the key (we again use 2048 bits) and the path to the location of the generated key. The algorithm will be RSA, which is the default, and the key file will by default be created with the permissions 0600, i.e. only readable and writable by the owner.

- name: Create key pair for the server
  openssl_privatekey:
    path: "{{playbook_dir}}/etc/certs/server.rsa"
    size: 2048

Next, we create the certificate signing request. To use the openssl_csr module to do this, we need to specificy the following parameters:

  • The components of the distinguished name of the subject, i.e. common name, organization, locality, e-mail address and country
  • Again the path of the file into which the generated CSR will be written
  • The parameters for the requested subject alternative name extension
  • And, of course, the path to the private key used to sign the request
- name: Create certificate signing request
  openssl_csr:
    common_name: "Leftasexercise"
    country_name: "DE"
    email_address: "me@leftasexercise.com"
    locality_name: "Big city"
    organization_name: "Leftasexercise blog"
    path: "{{playbook_dir}}/server.csr"
    subject_alt_name: 
      - "DNS:*.leftasexercise.local"
      - "DNS:leftasexercise.local"
    privatekey_path: "{{playbook_dir}}/etc/certs/server.rsa"

Finally, we can now invoke the openssl_certificate module to create a certificate from the CSR. This module is able to operate using different backends, the so-called provider. The provider that we will use for the time being is the self-signed provider which generates self-signed certificates. Apart from the path to the CSR and the path to the created certificate, we therefore need to specify this provider and the private key to use (which, of course, should be that of the server), and can otherwise rely on the default values.

- name: Create self-signed certificate
  openssl_certificate:
    csr_path: "{{playbook_dir}}/server.csr"
    path: "{{playbook_dir}}/etc/certs/server.crt"
    provider: selfsigned
    privatekey_path: "{{playbook_dir}}/etc/certs/server.rsa"

Once this task completes, we are now ready to start our Docker container. This can again be done using Ansible, of course, which has a Docker module for that purpose. To see and run the full code, you might want to clone my GitHub repository.

git clone http://github.com/christianb93/tls-certificates
cd tls-certificates/lab1
ansible-playbook site.yaml

This completes our post for today. In the next post, we will look into more complex setups involving our own local certificate authority and learn how to generate and use client certificates.

Setting up our OpenStack playground

In this post, we will describe the setup of our Lab environment and install the basic infrastructure services that OpenStack uses.

Environment setup

In a real world setup, OpenStack runs on a collection of physical servers on which the virtual machines provided by the cloud run. Now most of us will probably not have a rack in their basement, so that using four or five physical servers for our labs is not a realistic option. Instead, we will use virtual machines for that purpose.

To avoid confusion, let us first fix some terms. First, there is the actual physical machine on which the labs will run, most likely a desktop PC or a laptop, and most likely the PC you are using to read this post. Let us call this machine the lab host.

On this host, we will run Virtualbox to create virtual machines. These virtual machines will be called the nodes, and they will play the role that in a real world setup, the physical servers would play. We will be using one controller node on which most of the OpenStack components will run, and two compute nodes.

Inside the compute nodes, the Nova compute service will then provision virtual machines which we call VMs. So effectively, we use nested virtualization – the VM is itself running inside a virtual machine (the node).

To run the labs, your host will need to have a certain minimum amount of RAM. When I tested the setup, I found that the controller node and the compute nodes in total consume at least 7-8 GB of RAM, which will increase depending on the number of VMs you run. To still be able to work on the machine, you will therefore need at least 16 GB of RAM. If you have more – even better. If you have less, you might also want to use a cloud based setup. In this case, the host could itself be a virtual machine in the cloud, or you could use a bare-metal provider like Packet to get access to a physical host with the necessary memory.

Not every cloud will work, though, as it needs to support nested virtualization. I have tested the setup on DigitalOcean and found that it works, but other cloud providers might yield different results.

Networking

Let us now take a look at the network configuration that we will use for our hosts. If you run OpenStack, there will be different categories of traffic between the nodes. First, there is management traffic, i.e. communication between the different components of the platform, like messages exchanged via RabbitMQ or API calls. For security and availability reasons, this traffic is typically handled via a dedicated management network. The management network is configured by the administrator and used by the OpenStack components.

Then, there is traffic between the VMs, or, more precisely, between the guests running inside the VMs. The network which is supporting this traffic is called the guest network. Note that we are not yet talking about a virtual network here, but about the network connecting the various nodes which eventually will be used for this traffic.

Sometimes, additional network types need to be considered, there could for instance be a dedicated API network to allow end users and administrators access to the API without depending on any of the other networks, or a dedicated external network that connects the network node to a physical route to provide internet access for guests, but for this setup, we will only use a two networks – a management network and a guest network. Note that the guest network needs to be provided by an adminstrator, but is controlled by Openstack (which, for instance, will add the interfaces that make up the network to virtual bridges so that they can no longer be used for other traffic).

In our case, both networks, the management network and the guest network, will be set up as Virtualbox host-only networks, connecting our nodes. Here is a diagram that summarizes the network topology we will use.

OpenStackEnvironment

Setting up your host and first steps

Let us now roll up our sleeves and dive right into the first lab. Today, we will bring up our environment, and, on each node, install the required infrastructure like MySQL, RabbitMQ and so forth.

First, however, we need to prepare our host. Obviously, we need some tools installed – Vagrant, Virtualbox and Ansible. We will also use pwgen to create credentials. How exactly these tool need to be installed depends on your Linux distribution, on Ubuntu, you would run

sudo apt-get install python3-pip
pip3 install 'ansible==v2.8.6' 
sudo apt-get install pwgen
sudo apt-get install virtualbox
sudo apt-get install vagrant

The Ansible version is important. I found that there is at least oneissue which breaks network creation in OpenStack with Ansible with some 2.9.x versions of Ansible.

When we set up our labs, we will sometimes have to throw away our environment and rebuild it. This will be fully automated, but it implies that we need to download packages into the nodes over and over again. To speed up this process, we install a local APT cache. I use APT-Cacher-NG for that purpose. Installing it is very easy, simply run

sudo apt-get install apt-cacher-ng

This will install a proxy, listening on port 3142, which will create local copies of packages that you install. Later, we will instruct the apt processes running in our virtual machines to use this cache.

Now we are ready to start. First, you will have to clone my repository to get a copy of the scripts that we will use.

git clone https://github.com/christianb93/openstack-labs
cd openstack-labs/Lab1

Next, we will bring up our virtual machines. There is, however, a little twist when it comes to networking. As mentioned above, we will use Virtualbox host networking. As you might know when you have read my previous post on this topic, Virtualbox will create two virtual devices to support this, one for each network. These devices will be called vboxnet0 and vboxnet1. However, if these devices already exist, Virtualbox will use them and take over parts of the existing network configuration. This can lead to problems later, if, for instance, Virtualbox runs a DHCP server on this device, this will conflict with the OpenStack DHCP agent and your VMs will get incorrect IP addresses and will not be reachable. To avoid this, we will delete any existing interfaces (which of course requires that you stop all other virtual machines) and recreate them before we bring up our machines. The repository contains a shells script to do this. To run it and start the machines, enter

../scripts/createVBoxNetworks.sh
vagrant up

We are now ready to run our playbook. Before doing this, let us first discuss what the scripts will actually do.

First, we need a set of credentials. These credentials consist of a set of randomly generated passwords that we use to set up the various users that the installation needs (database users, RabbitMQ users, Keystone users and so forth) and an SSH key pair that we will use later to access our virtual machines. These credentials will be created automatically and stored in ~/.os_credentials.

Next, we need a basic setup within each of the nodes – we will need the Python OpenStack modules, we will need to bring up all network interfaces, and we will update the /etc/hosts configuration files in each of the nodes to be able to resolve all other nodes.

We will also change the configuration of the APT package manager. We will point APT to the APT cache running on the host and we will add the Ubuntu OpenStack Cloud Archive repository to the repository list from which we will pull the OpenStack packages.

Next, we need to make sure that the time on all nodes is synchronized. To achieve this, we install a network of NTP daemons. We use Chrony and set up the controller as Chrony server and the compute nodes as clients. We then install MySQL, Memcached and RabbitMQ on the controller node and create the required users.

All this is done by the playbook site.yaml, and you can simply run it by typing

ansible-playbook -i hosts.ini site.yaml

Once the script completes, we can run a few checks to see that everything worked. First, log into the controller node using vagrant ssh controller and verify that Chrony is running and that we have network connectivity to the other nodes.

sudo systemctl | grep "chrony"
ping compute1
ping compute2

Then, verify that you can log into MySQL locally and that the root user has a non-empty password (we can still log in locally as root without a password) by running sudo mysql and then, on the SQL prompt, typing

select * from mysql.user;

Finally, let us verify that RabbitMQ is running and has a new user openstack.

sudo rabbitmqctl list_users
sudo rabbitmqctl node_health_check
sudo rabbitmqctl status

A final note on versions. This post and most upcoming posts in this series have been created with a lab PC running Python 3.6 and Ansible 2.8.9. After upgrading my lab PC to Ubuntu 20.04 today, I continued to use Ansible 2.8.9 because I had experienced problems with newer versions earlier on, but upgraded to Python 3.8. After doing this, I hit upon this bug that requires this fix which I reconciled manually into my local Ansible version.

We are now ready to install our first OpenStack services. In the next post, we will install Keystone and learn more about domains, users, projects and services in OpenStack.

WSGI, middleware, PasteDeploy and all that

When you are a Python programmer or study open source software written in Python, you will sooner or later be exposed to the WSGI standard and to related concepts like WSGI middleware. In this post, I will give you a short overview of this technology and point you to some additional references.

What is WSGI?

WSGI stands for “Web Server Gateway Interface” and is a standard that defines how Python applications can run inside a web container (“server”), quite similar to Java servlets running in a servlet container. The WSGI standard is defined in PEP 333 (and, for Python3, in PEP 3333) and describes the interface between the application and the server.

In essence, the standard is quite simple. First, an application needs to provide a callable object (that can be a function, an instance of a class with a __call__ method or a method of a class or object) to the server which accepts two arguments. The first argument, traditionally called environ, is a dictionary that plays the role of a request context. The standard defines a set of fields in that object that a server needs to populate, including

Field Description
REQUEST_METHOD The HTTP request method (GET, POST, ..)
HTTP_* Variables corresponding to the various components of the HTTP request header
QUERY_STRING The part of the request strings after the ?
wsgi.input A stream from which the response body can be read, using methods like read(), readline() or __iter__
wsgi.errors A stream to which the application can write error logs

The second argument that is passed to the application is actually a function, with the signature

start_response(status, response_headers)

This function is supposed to return a stream-like object implementing the write method. The application can call use this object to write the response into it (which, however, is not the preferred way, in general, the application should simpyl return the response data). The argument status is a HTTP status code along with the respective string, like “200 OK”. The response_headers is a list of tuples of the form (name, value) which are added to the HTTP header of the response. The idea of this function is to give the server a chance to prepare the HTTP header of the response before the actual response body is written.

In fact, there is a third, optional argument to this method, which is an expection information as returned by sys.exc_info, which can be used to ask the server to re-raise an exception caught by the application and which we will ignore here.

The application function is supposed to return the response data, i.e. the data should go into the HTTP response body. Note that with Python3, this is supposed to be a bytes object, so text needs to be converted to bytes first.

Armed with this information, let us now write our first WSGI application. Of course, we need a WSGI server, and for our tests, we will use a very simple embedded WSGI server that comes as part of the wsgiref module. Here is the code.


from wsgiref.simple_server import make_server
def application(environ, start_response):
start_response(
'200 OK',
[('Content-type', 'text/html')]
)
response = "<html><body><p><b>Environment data:</b></p>"
response += "<table><tr><th>Key</th><th>Value</th></tr>"
for key, value in environ.items():
response += "<tr><td>%s</td><td>%s</td></tr>" % ( key, value)
response = response + "</table></body></html>"
return [bytes(response, 'utf-8')]
print("Starting up")
httpd = make_server('', 8800, application)
httpd.serve_forever()

view raw

wsgi.py

hosted with ❤ by GitHub

Let us see what this application does. First, there is the application function with the signature defined by the standard. We see that we call start_response and then create a response string. The response string contains an HTML table with one entry for each key/value pair in the environ dictionary. Finally we convert this to a byte object and return it to the server.

In the main processing, we create a wsgiref.simple_server that points to our application and start it.

To run the example, simply save the above code as wsgi.py (or whatever name you prefer) and run it with

python3 wsgi.py

When you now point your browser to 127.0.0.1:8800, you should see a table containing your environment values (the simple_server includes all currently defined OS level environment variables, so you will have to scroll down to see the WSGI specific parts).

Let us now try something else. Our application actually returns a sequence of byte objects. The server is supposed to iterate over this sequence and assemble the results to obtain the entire response. Thus the only thing that matters is that our application is something that can be called and returns something that has a method __iter__. Instead of using a function which returns a sequence, we can therefore as well use a class that has an __iter__ method as in the example below.


from wsgiref.simple_server import make_server
class Application:
def __init__(self, environ, start_response):
self.environ = environ
self.start_response = start_response
def __iter__(self):
self.start_response(
'200 OK',
[('Content-type', 'text/html')]
)
yield b'Hello!'
httpd = make_server('', 8800, Application)
httpd.serve_forever()

view raw

wsgi.py

hosted with ❤ by GitHub

When the server receives a request, it will call the “thing called application”, i.e. it will do something like Application(). This will create a new instance of the application object, i.e. call the __init__ method, which simply stores the parameters for later use. Then, the server will iterate over this object, i.e. call __iter__, where the actual result is assembled and returned.

Finally, we could also pass an instance of a class instead of a class to make_server. This instance than needs a __call__ method so that it can be invoked like a function.

WSGI middleware

As we have seen, the WSGI specification has two parts. First, it defines how an application should behave (call start_response and return response data) and it defines how a server should behave (call the application), as displayed below.

WSGIInterface

A WSGI middleware is simply a piece of Python code that implements both behaviours – it can act as a server and as an application. This allows middleware components to be chained: the server calls the middleware, the middleware performs whatever action it wishes, for instance manipulating the environment variable, and then invokes the application, and the application prepares the actual response.

WSGIMiddleware

Of course, instead of just passing through the start_response function to the application, a middleware could also pass in a different function and then call the original start_response function itself.

A nice feature of middleware is that it can be chained. You could for instance have a middleware which performs authorization, followed by a middleware to rewrite URLs and so forth, until finally the application is invoked. Here is a simple example.


from wsgiref.simple_server import make_server
class Middleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
environ['added_by_middleware'] = 1
return self.app(environ, start_response)
def application(environ, start_response):
start_response(
'200 OK',
[('Content-type', 'text/html')]
)
response = "<html><body><p><b>Environment data:</b></p>"
response += "<table><tr><th>Key</th><th>Value</th></tr>"
for key, value in environ.items():
response += "<tr><td>%s</td><td>%s</td></tr>" % ( key, value)
response = response + "</table></body></html>"
return [bytes(response, 'utf-8')]
httpd = make_server('', 8800, Middleware(application))
httpd.serve_forever()

view raw

wsgi.py

hosted with ❤ by GitHub

If you run this example as before, you will see that in addition to the environment variables produced by our first example, there is the additional key added_by_middleware which has been added by the middleware. In this example, the full call chain is as follows.

  • When the server starts, it creates an instance of the class Middleware that points to the function application
  • This instance is passed as argument to make_server
  • The server gets the request from the browser
  • The server makes a call on the “thing” supplied with make_server, i.e. the middleware instance
  • The server calls the middleware instance, i.e. it invokes its __call__ function
  • The __call__ function adds the additional key to the environment and then delegates the request to the function application

Building middleware chains with PasteDeploy

So far, we have chained middleware programmatically, but in real life, it is often much more flexible to do this via a configuration. Enter PasteDeploy, a Python module that allows you to build chains of middleware components from a configuration. To make sure that you have this installed, run

pip3 install PasteDeploy

before proceeding.

PasteDeploy is able to parse configuration files and to dynamically pipe together WSGI applications and WSGI middleware. To understand how this works, let us first consider an example. Suppose that in our working directory, we have the following code, stored in a file wsgi.py


from paste.deploy import loadapp
from wsgiref.simple_server import make_server
#
# This is our application, as usual
#
def application(environ, start_response):
start_response(
'200 OK',
[('Content-type', 'text/html')]
)
response = "<html><body><p><b>Environment data:</b></p>"
response += "<table><tr><th>Key</th><th>Value</th></tr>"
for key, value in environ.items():
response += "<tr><td>%s</td><td>%s</td></tr>" % ( key, value)
response = response + "</table></body></html>"
return [bytes(response, 'utf-8')]
#
# This is the factory which is invoked by PasteDeploy, passing
# additional configuration data from the INI file
#
def app_factory(global_config, **local_conf):
return application
#
# This call evaluates the INI file and builds an application
#
wsgi_app = loadapp('config:paste.ini', relative_to=".")
httpd = make_server('', 8800, wsgi_app)
httpd.serve_forever()

view raw

wsgi.py

hosted with ❤ by GitHub

In addition, let us create a configuration file paste.ini in the same directory, with the following content.

[app:main]
use = call:wsgi:app_factory

When we now run wsgi.py, we again get the same server as in our first, basic example. But what is happening behind the scenes?

First, we invoke the PasteDeploy API by calling loadapp. This function will evaluate the INI file passed as argument for different types of objects PasteDeploy knows. In our case, the section name app:main implies that we want to define an application and that this is the main entry point for our WSGI server. The argument that PasteDeploy expects here is the the full path to a factory function (i.e. in our case, the function app_factory in wsgi.py). PasteDeploy will then simply call this factory and return the result of this call as an application. We then start a server using this application as before. Note that PasteDeploy can also pass configuration data in the INI file to the factory.

A second basic object in PasteDeploy are filters. Filters are used to create filtered versions of an application, i.e. the application behind a defined middleware (the filter). In the configuration file, filters are specified in a section starting with the keyword filter, and refer to a filter factory. A filter factory is a callable which is called with the configuration in the INI file as argument, and returns a filter. A filter, in turn, is a function which receives an application as an argument and returns a WSGI application wrapping this application. This sounds a bit confusing, so it might be a good idea to look at an example. Our new code looks as follows


from paste.deploy import loadapp
from wsgiref.simple_server import make_server
#
# A middleware that adds a key to the environment
#
class Middleware:
def __init__(self, app, key="test", value=1):
self._key = key
self._value = value
self._app = app
def __call__(self, environ, status_response):
environ[self._key] = self._value
return self._app(environ, status_response)
#
# This is our application, as usual
#
def application(environ, start_response):
start_response(
'200 OK',
[('Content-type', 'text/html')]
)
response = "<html><body><p><b>Environment data:</b></p>"
response += "<table><tr><th>Key</th><th>Value</th></tr>"
for key, value in environ.items():
response += "<tr><td>%s</td><td>%s</td></tr>" % ( key, value)
response = response + "</table></body></html>"
return [bytes(response, 'utf-8')]
#
# This is the factory which is invoked by PasteDeploy, passing
# additional configuration data from the INI file
#
def app_factory(global_config, **local_conf):
return application
#
# A filter factory. A filter factory returns a filter function
#
def filter_factory(global_conf, key):
# A filter function returns a middleware, wrapping the
# provided app
def filter(app):
return Middleware(app, key)
return filter
#
# This call evaluates the INI file and builds an application
#
wsgi_app = loadapp('config:paste.ini', relative_to=".")
httpd = make_server('', 8800, wsgi_app)
httpd.serve_forever()

view raw

wsgi.py

hosted with ❤ by GitHub

with the following configuration

[app:main]
use = call:wsgi:app_factory
filter-with = filter1 

[filter:filter1]
use = call:wsgi:filter_factory 
key = "abc"

What happens if you run the example? First, PasteDeploy will create an application as before, by calling the app_factory function. Then, it will find the configuration option filter-with that tells the library that we wish to wrap the application. Here, we refer to a filter called filter1 which is defined in the section of the INI file.

When evaluating this section, PasteDeploy will call the provided filter factory filter_factory, passing the additional configuration in the section as parameters. The filter factory returns a function, the filter function. PasteDeploy will now take the application and call the filter function with this application as argument. The return value of this call will then be used as the actual application that is returned by loadapp and started using the simple_server (in fact, PasteDeploy will first call the filter factory, then the app factory and then the filter itself).

Of course, you can apply more than one filter to an application. To make this as easy as possible, PasteDeploy offers a third type of objects called pipelines. A pipeline is just a sequence of filters which are applied to an application. The nice thing about pipelines is that they are piped together by PasteDeploy automatically, without any need to write additional factory objects. So our source code remains the same, we only have to change the application.

[pipeline:main]
pipeline = filter1 filter2 myapp

[app:myapp]
use = call:wsgi:app_factory

[filter:filter1]
use = call:wsgi:filter_factory 
key = "abc"

[filter:filter2]
use = call:wsgi:filter_factory 
key = "def"

Here, we define a pipeline which will first apply filter1, then filter2 and then finally pass control to our app. These three objects are created by the same calls to factory functions as before, and PasteDeploy will automatically load the pipeline and plumb the objects together. The result will be that once the application is reached, both keys (abc and def) will be present in the request context.

This is now what we want. We can, of course, have filters in different Python modules, and thus completely decoupled. PasteDeploy will then happily plumb together the final WSGI application according to the configuration, and we can easily add middleware components to the pipelines and remove them, without having to change our code.

Finally, there is another approach to configure a pipeline which is also the one described in the documentation. Here, we realize a pipeline as a composite object. This object again corresponds to a factory function with a specific signature. Part of this signature is a loader object which we can use to load the individual filters by name and apply them step by step to the application. A nice example where this is implemented is the configuration of the OpenStack Nova compute service, with the factory being implemented here. And yes, it was an effort to understand this example which eventually made me carry out some research and write this blog post – expect to see a bit more on OpenStack soon on this blog!

Building your own cloud with OpenStack – overview

Over time, I have worked with a couple of different commercial cloud platforms like AWS, DigitalOcean, GCP, Paperspace or Packet.net. Even though these platforms are rather well documented, there comes a point where you would like to have more insights into the inner workings of a cloud platform. Unfortunately, not too many of use have permission to walk into a Google data center and dive into their setup, but we can install and study one of the most relevant open source cloud platforms – OpenStack.

What is OpenStack?

OpenStack is an open source project (or, more precisely, a collection of projects) aiming at providing a state-of-the-art cloud platform. Essentially, OpenStack contains everything that you need to convert a set of physical servers into a cloud. There are components that interact with a hypervisor like KVM to build and run virtual machines, components to define and operate virtual networks and virtual storage, components to maintain images, a set of APIs to operate your cloud and a web-based graphical user interface.

OpenStack has been launched by Rackspace and NASA in 2010, but is currently supported by a large number of organisations. Some commercially supported OpenStack distributions are available, like RedHat OpenStack, Lenovo Thinkcloud or VMWare Integrated OpenStack. The software is written in Python, which for me was one of the reasons why I have decided to dive into OpenStack instead of one of the other open source cloud platforms like OpenNebula or Apache Cloudstack.

New releases of OpenStack are published every six months. This and the following posts use the Stein release from April 2019 and Ubuntu 18.04 Bionic as the underlying operating system.

OpenStack architecture basics

OpenStack is composed of a large number of components and services which initially can be a bit confusing. However, a minimal OpenStack installation only requires a hand-full of services which are displayed in the diagram below.

OpenStackComponents

At the lowest layer, there are a couple of components that are used by OpenStack but provided by other open source projects. First, as OpenStack is written in Python, it uses the WSGI specification to expose its APIs. Some services come with their own WSGI container, others use Apache2 for that purpose.

Then, of course, OpenStack needs to persist the state of instances, networks and so forth somewhere and therefore relies on a relational database which by default is MariaDB (but could also be PostgreSQL, and in fact, every database that works with SQLAlchemy should do). Next, the different components of an OpenStack service communicate with each other via message queues provided by RabbitMQ and store data temporarily in Memcached. And finally, there is of course the hypervisor which by default is KVM.

On top of these infrastructure components, there are OpenStack services that lay the foundations for the compute, storage and network components. The first of these services is Keystone which provides identity management and a service catalog. All end user and all other services are registered as user with Keystone, and Keystone is handing out tokens so that these users can access the APIs of the various OpenStack services.

Then, there is the Glance image service. Glance allows an administrator to import OS images for use with virtual machines in the cloud, similar to a Docker registry for Docker images. The third of these intermediate services is the placement service which used to be a part of Nova and is providing information on available and used resources so that OpenStack can decide where a virtual machine should be scheduled.

On the upper layer, we have the services that make up the heart of OpenStack. Nova is the compute service, responsible for interacting with the hypervisor to bring up and maintain virtual machines. Neutron is creating virtual networks so that these virtual machines can talk to each other and the outside world. And finally, Cinder (which is not absolutely needed in a minimum installation) is providing block storage.

There are a couple of services that we have not represented in this picture, like the GUI Horizon or the bare-metal service Ironic. We will not discuss Ironic in this series and we will set up Horizon, but mostly use the API.

OpenStack offers quite a bit of flexibility as to how these services are distributed among physical nodes. We can not only distribute these services, but can even split individual services and distribute them across several physical nodes. Neutron, for instance, consists of a server and several agents, and typically these agents are installed on each compute node. Over time, we will look into more complex setups, but for our first steps, we will use a setup where there is a single controller node holding most of the Nova services and one or more compute nodes on which parts of the Nova service and the Neutron service are running.

OpenStackDeployment

In a later lab, we will build up an additional network host that runs a part of the Neutron network services, to demonstrate how this works.

Organisation of the upcoming series

In the remainder of this series, I will walk you through the installation of OpenStack in a virtual environment. But the main purpose of this exercise is not to simply have a working installation of OpenStack – if you want this, you could as well use one of the available installation methods like DevStack. Instead, the idea is to understand a bit what is going on behind the scenes – the architecture, the main configuration options, and here and then a little deep-dive into the source code.

To achieve this, we will discuss each service, its overall architecture, some use cases and the configuration steps, starting with the basic setup and ending with the Neutron networking service (on which I will put a certain focus out of interest). To turn this into a hands-on experience, I will guide you through a sequence of labs. In each lab, we will do some exercises and see OpenStack in action. Here is my current plan how the series will be organized.

As always, the code for this series is available on GitHub. Most of the actual setup will be fully automated using Vagrant and Ansible. We will simulate the individual nodes as virtual machines using VirtualBox, but it should not be difficult to adapt this to the hypervisor of your choice. And finally, the setup is flexible enough to work on a sufficiently well equipped desktop PC as well as in the cloud.

After this general overview, let us now get started. In the next post, we will dive right into our first lab and install the base services that OpenStack needs.

Virtual networking labs – building a virtual router with iptables and Linux namespaces

When you are trying to understand virtual networking, container networks, micro segmentation and all this, sooner or later the day will come where you will have to deal with iptables, the built-in Linux firewall mechanism. After evading the confrontation with the full complexity of this remarkable beast for many years, I have recently decided to dive a little deeper into the internals of the Linux networking stack. Today, I will give you an overview of the inner workings of the machinery behind iptables and show you how to use this to build a virtual firewall in a Linux networking namespace.

Netfilter hooks in the Linux kernel

In order to understand how iptables work, we will have to take a short look at a mechanism called netfilter hooks in the Linux networking stack.

Netfilter hooks are points in the Linux networking code at which modules can add their own custom processing. When a packet is travelling up or down through the networking stack and reaches one of these points, it is handed over to the registered modules which can manipulate the packet and, by their return value, can either instruct the core networking code to continue with the processing of the packet or to drop it.

Let us take a closer look at where these netfilter hooks are placed in the kernel. The following diagram is a (simplified) summary of the way that packets take through the Linux IPv4 stack (for those readers who actually want to see this in the Linux kernel code, I have added some of the most relevant Linux kernel functions, referring to v4.2 of the kernel).

NetfilterHooks

A packet coming in from a network device will first reach the pre-routing hook. As the name indicates, this happens before a routing decision is taken. After passing this hook, the kernel will consult its routing tables. If the target IP address is the IP address of a local device, it will flag the packet for local delivery. These packets will now be processed by the input hook before they are handed over to the higher layers, e.g. a socket listening on a port.

If the routing algorithm determines that the packet is not targeted towards a local interface but needs to be forwarded, the path through the kernel is different. These packets will be handled by the forwarding code and pass the forward netfilter hook, followed by the post-routing hook. Then, the packet is sent to the outgoing network interface and leaves the kernel.

Finally, for packets that are locally generated by an application, the kernel first determines the route to the destination. Then, the modules registered for the output hook are invoked, before we also reach the post-routing hook as in the case of forwarding.

Having discussed netfilter hooks in general, let us now turn to iptables. Essentially, iptables is a framework sitting on top of the netfilter hooks which allows you to define rules that are evaluated at each of the hooks and determine the fate of the packet. For each netfilter hook, a set of rules called a chain is processed. Consequently, there is an input chain, an output chain, a pre-routing chain, a post-routing chain and a forward chain. If it also possible to define custom chains to which you can jump from one of the pre-built chains.

Iptables rules are further organized into tables and wired up with the kernel code using netfilter hooks, but not every table registers for every hook, i.e. not every table is represented in every chain. The following diagram shows which chain is present in which table.

IPTablesChains

It is sometimes stated that iptables chains are contained in tables, but given the discussion of netfilter hooks above, I prefer to think of this a matrix – there are chains and tables, and rules are sitting at the intersections of chains and tables, so that every rule belongs to a table and a chain. To illustrate this, let us look at the processing steps taken by iptables for a packet for a local destination.

  • Process the rules in the raw table in the pre-routing chain
  • Process the rules in the mangle table in the pre-routing chain
  • Process the rules in the nat table in the pre-routing chain
  • Process the rules in the mangle table in the input chain
  • Process the rules in the nat table in the input chain
  • Process the rules in the filter table in the input chain

Thus, rules are evaluated at every point in the above diagram where a white box indicates a non-empty intersection of tables and chains.

Iptables rules

Let us now see how the actual iptables rules are defined. Each rule consists of a match which determines to which packets the rule applies, and a target which determines the action taken on the packet. Some targets are terminating, meaning that the processing of the packet stops at this point, other targets are non-terminating, meaning that a certain action will be taken and processing continues. Here are a few examples of available targets, see the documentation listed in the last section for the full specification.

Action Description
ACCEPT Accept the packet, i.e do not apply any further rules within this combination of chain and table and instruct the kernel to let the packet pass
DROP Drop the packet, i.e. tell the kernel to stop processing of the packet without any further action
REJECT Tell the kernel to stop processing of the packet and send an ICMP reject message back to the origin of the packet
SNAT Perform source NATing on the packet, i.e. change the source IP address of the packet, more on this below
DNAT Destination NATing, i.e. change the destination IP address of the packet, again we will discuss this in a bit more detail below
LOG Log the packet and continue processing
MARK Mark the packet, i.e. attach a number which can again be used for matching in a subsequent rule

Note, however, that not every action can be used in every chain, but certain actions are restricted to specific tables or chains

Of course, it might happen that no rule matches. In this case, the default target is chosen, which is also known as the policy for a given table and chain.

As already mentioned above, it is also possible to define custom chains. These chains can be used as a target, which implies that processing will continue with the rules in this chain. From this chain, one can either return explicitly to the original table using the RETURN target, or, otherwise, the processing continues in the original table once all rules in the custom chain have been processed, so this is very similar to a function or subroutine in a high-level language.

Setting up our test lab

After all this theory, let us now see iptables in action and add some simple rules. First, we need to set up our lab. We will simulate a situation where two hosts, called boxA and boxB are connected via a router, as indicated in the following diagram.

VirtualRoutingLab

We could of course do this using virtual machines, but as a lightweight alternative, we can also use IP namespaces (it is worth mentioning that similar to routing tables, iptables rules are per namespace). Here is a script that will set up this lab on your local machine.


# Create all namespaces
sudo ip netns add boxA
sudo ip netns add router
sudo ip netns add boxB
# Create veth pairs and move them into their respective namespaces
sudo ip link add veth0 type veth peer name veth1
sudo ip link set veth0 netns boxA
sudo ip link set veth1 netns router
sudo ip link add veth2 type veth peer name veth3
sudo ip link set veth3 netns boxB
sudo ip link set veth2 netns router
# Assign IP addresses
sudo ip netns exec boxA ip addr add 172.16.100.5/24 dev veth0
sudo ip netns exec router ip addr add 172.16.100.1/24 dev veth1
sudo ip netns exec boxB ip addr add 172.16.200.5/24 dev veth3
sudo ip netns exec router ip addr add 172.16.200.1/24 dev veth2
# Bring up devices
sudo ip netns exec boxA ip link set dev veth0 up
sudo ip netns exec router ip link set dev veth1 up
sudo ip netns exec router ip link set dev veth2 up
sudo ip netns exec boxB ip link set dev veth3 up
# Enable forwarding globally
echo 1 > /proc/sys/net/ipv4/ip_forward
# Enable logging from within a namespace
echo 1 > /proc/sys/net/netfilter/nf_log_all_netns

view raw

setupLab13.sh

hosted with ❤ by GitHub

Let us now start playing with this setup a bit. First, let us see what default policies our setup defines. To do this, we need to run the iptables command within one of the namespaces representing the different virtual hosts. Fortunately, ip netns exec offers a very convenient way to do this – you simply pass a network namespace and an arbitrary command, and this command will be executed within the respective namespace. To list the current content of the mangle table in namespace boxA, for instance, you would run

sudo ip netns exec boxA \
   iptables -t mangle -L

Here, the switch -t selects the table we want to inspect, and -L is the command to list all rules in this table. The output will probably depend on the Linux distribution that you use. Hopefully, the tables are empty, and the default target (i.e. the policy) for all chains is ACCEPT (no worries if this is not the case, we will fix this further below). Also note that the output of this command will not contain every possible combination of tables and chains, but only those which actually are allowed by the diagram above.

To be able to monitor the incoming and outgoing traffic, we now create our first iptables rule. This rule uses a special target LOG which simply logs the packet so that we can trace the flow through the involved hosts. To add such a rule to the filter table in the OUTPUT chain of boxA, enter

sudo ip netns exec boxA \
   iptables -t filter -A OUTPUT \
   -j LOG \
   --log-prefix "boxA:OUTPUT:filter:" \
   --log-level info

Let us briefly through this command to see how it works. First, we use the ip netns exec command to run a command (iptables in our case) inside a network namespace. Within the iptables command, we use the switch -A to add a new rule in the output chain, and the switch -t to indicate that this rule belongs to the filter table (which, actually, is the default if -t is omitted).

The switch -j indicates the target (“jump”). Here, we specify the LOG target. The remaining switches are specific parameters for the LOG target – we define a log prefix which will be added to every log message and the log level with which the messages will appear in the kernel log and the output of dmesg.

Again, I have created a script that you can run (using sudo) to add logging rules to all relevant combinations of chains and tables. In addition, this script will also add logging rules to detect established connections, more on this below, and will make sure that all default policies are ACCEPT and that no other rules are present.

Let us now run try our first ping. We will try to reach boxB from boxA.

sudo ip netns exec boxA \
   ping -c 1 172.16.200.5

This will fail with the error message “Network unreachable”, as expected – we do have a route to the network 172.16.100.0/24 on boxA (which the Linux kernel creates automatically when we bring up the interface) but not for the network 172.16.200.0/24 that we try to reach. To fix this, let us now add a route pointing to our router.

sudo ip netns exec boxA \
   ip route add default via 172.16.100.1

When we now try a ping, we do not get an error message any more, but the ping still does not succeed. Let us use our logs to see why. When you run dmesg, you should see an output similar to the sample output below.

[ 5216.449403] boxA:OUTPUT:raw:IN= OUT=veth0 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449409] boxA:OUTPUT:mangle:IN= OUT=veth0 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449412] boxA:OUTPUT:nat:IN= OUT=veth0 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449415] boxA:OUTPUT:filter:IN= OUT=veth0 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449416] boxA:POSTROUTING:mangle:IN= OUT=veth0 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449418] boxA:POSTROUTING:nat:IN= OUT=veth0 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449437] router:PREROUTING:raw:IN=veth1 OUT= MAC=c6:76:ef:89:cb:ec:96:ad:71:e1:0a:28:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449441] router:PREROUTING:mangle:IN=veth1 OUT= MAC=c6:76:ef:89:cb:ec:96:ad:71:e1:0a:28:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449443] router:PREROUTING:nat:IN=veth1 OUT= MAC=c6:76:ef:89:cb:ec:96:ad:71:e1:0a:28:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449447] router:FORWARD:mangle:IN=veth1 OUT=veth2 MAC=c6:76:ef:89:cb:ec:96:ad:71:e1:0a:28:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449449] router:FORWARD:filter:IN=veth1 OUT=veth2 MAC=c6:76:ef:89:cb:ec:96:ad:71:e1:0a:28:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449451] router:POSTROUTING:mangle:IN= OUT=veth2 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449452] router:POSTROUTING:nat:IN= OUT=veth2 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449474] boxB:PREROUTING:raw:IN=veth3 OUT= MAC=2a:12:10:db:37:49:a6:cd:a5:c0:7d:56:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449477] boxB:PREROUTING:mangle:IN=veth3 OUT= MAC=2a:12:10:db:37:49:a6:cd:a5:c0:7d:56:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 
[ 5216.449479] boxB:PREROUTING:nat:IN=veth3 OUT= MAC=2a:12:10:db:37:49:a6:cd:a5:c0:7d:56:08:00 SRC=172.16.100.5 DST=172.16.200.5 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=15263 DF PROTO=ICMP TYPE=8 CODE=0 ID=20237 SEQ=1 

We see nicely how the various tables are traversed, starting with the four tables in the output chain of boxA. We also see the packet in the POSTROUTING chain of the router, leaving it towards boxB, and are being picked up by boxB. However, no reply is reaching boxA.

To understand why this happens, let us look at the last logging entry that we have from boxB. Here, we see that the request (ICMP type 8) is entering with the source IP address of boxA, i.e. 172.168.100.5. However, there is no route to this host on boxB, as boxB only has one network interface which is connected to 172.16.200.0/24. So boxB cannot generate a reply message, as it does not know how to route this message to boxA.

By the way, you might ask yourself why there are no log entries for the INPUT chain on boxB. The answer is that the Linux kernel has a feature called reverse path filtering. When this filter is enabled (which it seems to be on most Linux distributions by default), then the kernel will silently drop messages coming in from an IP address to which is has no outgoing route as defined in RFC 3704. For documentation on how to turn this off, see this link.

So how can we fix this problem and enable boxB to send an ICMP reply back to boxA? The first idea you might have is to simply add a route on boxB to the network 172.16.100.0/24 with the router as the next hop. This would work in our lab, but there is a problem with this approach in real life.

In a realistic scenario, boxA would typically be a machine in the private network of an organization, using a private IP address from a private address range which is far from being unique, whereas boxB would be a public IP address somewhere on the Internet. Therefore we cannot simply add a route for the IP address of boxA, which is private and should never appear in a public network like the Internet.

What we can do, however, is to add a route to the public interface of our router, as the IP address of this interface typically is a public IP address. But why would this help to make boxA reachable from the Internet?

Somehow we would have to divert reply traffic direct towards boxA to the public interface of our router. In fact, this is possible, and this is where SNAT comes into play.

SNAT (source network address translation) simply means that the router will replace the source IP address of boxA by the IP address of its own outgoing interface (i.e. 172.16.200.1 in our case) before putting the packet on the network. When the packet (for instance an ICMP echo request) reaches boxB, boxB will try to send the answer back to this address which is reachable. So boxB will be able to create a reply, which will be directed towards the router. The router, being smart enough to remember that it has manipulated the IP address, will then apply the reverse mapping and forward the packet to boxA.

To establish this mechanism, we will have to add a corresponding rule with the target SNAT to an appropriate chain of the router. We use the postrouting chain, which is traversed immediately before the packet leaves the router, and put the rule into the NAT table which exists for exactly this purpose.

sudo ip netns exec router \
   iptables -t nat \
   -A POSTROUTING \
   -o veth2 \
   -j SNAT --to 172.16.200.1

Here, we also use our first match – in this case, we apply this rule to all packets leaving the router via veth2, i.e. the public interface of our router.

When we now repeat the ping, this should work, i.e. we should receive a reply on boxA. It is also instructive to again inspect the logging output created by iptables using dmesg where we can observe nicely that the IP destination address of the reply changes to the IP address of boxA after traversing the mangle table of the PREROUTING chain of the router (this change is done before the routing decision is taken, to make sure that the route which is determined is correct). We also see that there are no logging messages from our NAT tables anymore on the router for the reply, because the NAT table is only traversed for the first packet in each stream and the same action is applied to all subsequent packets of this stream.

Adding firewall functionality

All this is nice, but there is still an important feature that we need in a real world scenario. So far, our router acts as a router in both directions – the default policies are ACCEPT, and traffic coming in from the “public” interface veth2 will happily be forwarded to boxA. In real life, of course, this is exactly what you do not want – you want to protect boxB against unwanted incoming traffic to decrease the attack surface.

So let us now try to block unwanted incoming traffic on the public device veth2 of our router. Our first idea could be to simply change the default policy for the filter table on each of the chains INPUT and FORWARD to DROP. As one of these chains is traversed by incoming packets, this should do the trick. So let us try this.

sudo ip netns exec router \
   iptables -t filter \
   -P INPUT DROP
sudo ip netns exec router \
   iptables -t filter \
   -P FORWARD DROP

Of course this was not a really good idea, as we immediately learn when we execute our next ping on boxA. As we have changed the default for the FORWARD chain to drop, our ICMP echo request is dropped before being able to leave the router. To fix this, let us now add an additional rule to the FORWARD table which ACCEPTs all traffic coming from the private network, i.e. veth1.

sudo ip netns exec router \
   iptables -t filter \
   -A FORWARD \
   -i veth1 -j ACCEPT

When we now repeat the ping, we will see that the ICMP request again reaches boxB and a reply is generated. However, there is still a problem – the reply will reach the router via the public interface, and whence will be dropped.

To solve this problem, we would need a mechanism which would allow the router to identify incoming packets as replies to a previously sent outgoing packet and to let them pass. Again, iptables has a good answer to this – connection tracking.

Connection tracking

Iptables is a stateful firewall, meaning that it is able to maintain the state of a connection. During its life, a connection undergoes state transitions between several states, and an iptables rule can refer to this state and match a packet only if the underlying connection is in a certain state.

  • When a connection is not yet established, i.e. when a packet is observed that does not seem to relate to an existing connection, the connection is created in the state NEW
  • Once the kernel has seen packets in both directions, the connection is moved into the state ESTABLISHED
  • There are connections which could be RELATED to an existing connection, for instance for FTP data connections
  • Finally, a connection can be INVALID which means that the iptables connection tracking algorithm is not able to handle the connection

To use connection tracking, we have to add the -m conntrack switch to our iptables rule, which instructs iptables to load the connection tracking module, and then the –ctstate switch to refer to one or more states. The following rule will accept incoming traffic which belongs to an established connection, i.e. reply traffic.

sudo ip netns exec router \
   iptables -t filter \
   -A FORWARD \
   -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

After adding this rule, a ping from boxA to boxB should work again, and the log messages should show that the request travels from boxA to boxB across the router and that the reply travels the same way back without being blocked.

Destination NATing

Let us summarize what we have done so far. At this point, our router and firewall is able to

  • Allow traffic from the internal network, i.e. boxA, to pass through the router and reach the public network, i.e. boxB
  • Conceal the private IP address of boxB by applying source NATing
  • Allow reply traffic to pass through the router from the public network back into the private network
  • Block all other traffic from the public network from reaching the private network

However, in some cases, there might actually be a good reason to allow incoming traffic to reach boxA on our internal network. Suppose, for instance, we had a web server (which, as far as this lab is concerned, will be a simple Python script) running on boxA which we want to make available from the public network. We would then want to allow incoming traffic to a dedicated port, say 8800.

Of course, we could add a rule that ACCEPTs incoming traffic (even if it is not a reply) when the target port is 8800. But we need a bit more than this. Recall that the IP address of boxA is not visible on the public network, but the IP address of the router (the IP address of the veth2 interface) is. To make our web server port reachable from the public network, we would need to divert traffic targeting port 8800 of the router to port 8800 of boxA, as indicated in the diagram below.

DNAT

Again, there is a form of NATing that can help – destination NATing. Here, we leave the source IP address of the incoming packet as it is, but instead change the destination IP address. Thus, when a request comes in for port 8800 of the router, we change the target IP address to the IP address of boxA. When we do this in the PREROUTING chain, before a routing decision has been taken, the kernel will recognize that the new IP destination address is not a local address and will forward the packet to boxA.

To try this out, we first need a web server. I have put together a simple WSGI based web server, which will be present in the directory lab13 if you have cloned the corresponding repository. In a separate window, start the web server, making it run in the namespace of boxA.

cd lab13
sudo ip netns exec boxA python3 server.py

Now let us add a destination NATing rule to our router. As mentioned before, the change of the destination address needs to take place before the routing decision is taken, i.e. in the PREROUTING chain.

sudo ip netns exec router \
  iptables -t nat -A PREROUTING \
  -p tcp \
  -i veth2 \
  --destination-port 8800 \
  -j DNAT \
  --to-destination 172.16.100.5:8800

In addition, we need to ACCEPT traffic to this new destination in the FORWARD chain.

sudo ip netns exec router \
  iptables -t filter -A FORWARD \
  -p tcp \
  -i veth2 \
  --destination-port 8800 \
  -d 172.16.100.5 \
  -j ACCEPT

Let us now try to reach our web server from boxB.

sudo ip netns exec boxB \
  curl -w "\n" 172.16.200.1:8800

You should now see a short output (a HTML document with “Hello!” in it) from our web server, indicating that the connection worked. Effectively, we have “peeked a hole” into our firewall, connecting port 8080 of the public network front of our router to port 8800 of boxA. Of course, we could also use any other combination of ports, i.e. instead of mapping 8800 to itself, we could as well map port 80 to 8800 so that we could reach our web server on the public IP address of the router on the standard port.

Of course there is much more that we could say about iptables, but this discussion of the core features should put you in a position to read and interpret most iptable rule sets that you are likely to encounter when working with virtual networks, cloud technology and containers. I highly recommend to browse the references below to learn more, and to look at those chains on your local machine that Docker and libvirt install to get an idea how this is used in practice.

References

Virtual networking labs – using OpenFlow

In the last few posts, we have already touched on the OpenFlow protocol that plays a central role in most SDNs. Today, we will learn more on OpenFlow and again use Open vSwitch to see the protocol in action.

OpenFlow – the basics

Recall from our previous post that OpenFlow is a protocol that the control plane and the data plane of an SDN use to exchange rules that determine the flow of packets through the (virtual or physical) infrastructure. The standard is maintained by the Open Networking Foundation and is available here.

Let us first try to understand some basic terms. First, the specification describes the behavior of a switch. Logically, the switch is decomposed into two components. There is the control channel which is the part of the switch that communicates with the OpenFlow controller, and there is the datapath which consists of tables defining the flow of packets through the switch and the ports.

The switch maintains two sets of tables that are specified by OpenFlow. First, there are flow tables that contain – surprise – flows, and then, there are group tables. Let us discuss flow tables first.

An entry in a flow table (a flow entry or flow for short) is very similar to a firewall rule in e.g. iptables. It consists of the following parts.

  • A set of match fields that are used to see which flow entry applies to which Ethernet packet
  • An action that is executed on a match
  • A set of counters
  • Priorities that apply if a packet matches more than one flow entry
  • Additional control fields like timeouts, flag or cookies that are passed through

openflowtableentry-e1572185967522.png

Flow tables can be chained in a pipeline. When a packet comes in, the flow tables in the pipeline are processed in sequence. Depending on the action of a matching flow table entry, a packet can then be sent directly to an outgoing port, or be forwarded to the next table in the pipeline for further processing (using the goto-table action). Optionally, a pipeline can be divided into three parts – ingress tables, group tables (more on this later) and egress tables.

A table typically contains one entry with priority zero (i.e. the lowest priority) and no match fields. As non-existing match fields are considered as wildcards, this flow matches all packets that are not consumed by other, more specific flows. Therefore, this entry is called the table-miss entry and determines how packets with no other matching role are handled. Often, the action associated with this entry is to forward the packet to a controller to handle it. If not even a table-miss entry exists in the first table, the packet is dropped.

While the packet traverses the pipeline, an action set is maintained. Each flow entry can add an action to the set or remove an action or run a specific action immediately. If a flow entry does not forward the packet to the next table, all actions which are present in the action set will be executed.

The exact set of actions depends on the implementation as there are many optional actions in the specification. Typical actions are forwarding a packet to another table, sending a packet to an output port, adding or removing VLAN tags, or setting specific fields in the packet headers.

In addition to flow tables, an OpenFlow compliant switch also maintains a group table. An entry in the group table is a bit like a subroutine, it does not contain any matching criteria, but packets can be forwarded to a group by a flow entry. A group contains one or more buckets each of which in turn contains a set of actions. When a packet is processed by a group table entry, a copy will be created for each bucket, and to each copy the actions in the respective bucket will be applied. Group tables have been added with version 1.1 of the specification.

Lab13: seeing OpenFlow in action

After all this theory, it is time to see OpenFlow in action. For that purpose, we will use the setup in lab11, as shown below.

OVSOverlayNetwork

Let us bring up this scenario again.

git clone https://github.com/christianb93/networking-samples
cd networking-samples/lab11
vagrant up

OVS comes with an OpenFlow client ovs-ofctl that we can use to inspect and change the flows. Let us use this to display the initial content of the flow tables. On boxA, run

sudo ovs-vsctl set bridge myBridge protocols=OpenFlow14
sudo ovs-ofctl -O OpenFlow 14 show myBridge

The first command instructs the bridge to use version 1.4 of the OpenFlow protocol (by default, an OVS bridge still uses the rather outdated version 1.0). The second command asks the CLI to provide some information on the bridge itself and the ports. Note that the bridge has a datapath id (dpid) which is identical to the datapath ID stored in the Bridge OVSDB table (use ovs-vsctl list Bridge to verify this). For this and for all commands, we use the switch -O to make sure that the client uses version 1.4 of the protocol as well. As an example, here is the command to display the flow table.

sudo ovs-ofctl -O OpenFlow14 dump-flows myBridge

This should create only one line of output, describing a flow in table 0 (the first table in the pipeline) with priority 0 and no match fields. This is the table-miss rule mentioned above. The associated action NORMAL specifies that the normal switch processing should take place, i.e. the device should operate like an ordinary switch without involving OpenFlow.

Now let us generate some traffic. Open an SSH connection to to box B and execute

sudo docker exec -it web3 /bin/bash
ping -i 1 -c 10 172.16.0.1 

to create 10 ICMP packets. If we now take another look at the OpenFlow table, we should see that the counter n_packets has changed and should now be 24 (10 ICMP requests, 10 ICMP replies, 2 ARP requests and 2 ARP replies).

Next, we will add a flow which will drop all traffic with TCP target port 80 coming in from the container web3. Here is the command to do this.

sudo ovs-ofctl \
     -O OpenFlow14 \
     add-flow \
      myBridge  \
     'table=0,priority=1000,           
      eth_type=0x0800,ip_proto=6,tcp_dst=80,action='

The syntax of the match fields and the rules is a bit involved and described in more detail in the man-pages for ovs-actions and ovs-fields. Note that we do not specify an action, which implies that the packet will be dropped. When you display the flow tables again, you will see the additional rule being added.

Now head over into the terminal connected to the container web3 and try to curl web1. You should see an error message from curl telling you that the destination could not be reached. If you dump the flows once more, you will find that the statistic of our newly added rule have changed and the counter n_packets is now 2. If we delete the flow again using

sudo ovs-ofctl \
     -O OpenFlow14 \
     del-flows \
     myBridge \
     'table=0,eth_type=0x0800,ip_proto=6,tcp_dst=80'

and repeat the curl, it should work again.

This was of course a simple setup, with only one table and no group tables. Much more complex processing can be realized with OpenFlow, and we refer to the OVS tutorials to see some examples in action.

One final note which might be useful when it comes to debugging. OVS is – as many switches – not a pure OpenFlow switch, but in addition to the OpenFlow tables maintains another set of rules called the datapath flow. Sometimes, this causes confusion when the observed results do not seem to match the existing OpenFlow table entries. These additional flows can be dumped using the ovs-appctl tool, see the Open vSwitch FAQs for details.

Virtual networking labs – Open vSwitch in practice

In the last post, we have discussed the architecture of Open vSwitch and how it with a control plane to realize an SDN. Today, we will make this a bit more tangible by running two hands-on labs with OVS.

The labs in this post are modelled after some of the How-to documents that are part of the Open vSwitch documentation, but use a combination of virtual machines and Docker to avoid the need for more than one physical machine. In both labs, we bring up two virtual machines which are connected via a VirtualBox virtual network, and inside each machine, we bring up two Docker containers that will eventually interact via OVS bridges.

Lab 11: setting up an overlay network with Open vSwitch

In the first lab, we will establish interaction between the OVS bridges on the two involved virtual machines using an overlay network. Specifically, the Docker containers on each VM will be connected to an OVS bridge, and the OVS bridges will use VXLAN to talk to each other, so that effectively, all Docker containers appear to be connected to an Ethernet network spanning the two virtual machines.

OVSOverlayNetwork

Instead of going through all steps required to set this up, we will again bring up the machines automatically using a combination of Vagrant and Ansible, and then discuss the major steps and the resulting setups. To run the lab, you will again have to download the code from my repository and start Vagrant.

git clone https://github.com/christianb93/networking-samples
cd lab11
vagrant up

While this is running, let us quickly discuss what the scripts are doing. First, of course, we create two virtual machines, each running Ubuntu Bionic. In each machine, we install Open vSwitch and Docker. We then install the docker Python3 module to make Ansible happy.

Next, we bring up two Docker containers, each running an image which is based on NGINX but has some networking tools installed on top. For each container, we set up a pair of two VETH devices. One of the devices is then moved into the networking namespace of the container, and one of the two devices will later be added to our bridge, so that these VETH device pairs effectively operate like an Ethernet cable connecting the containers to the bridge.

We then create the OVS bridge. In the Ansible script, we use the Ansible OVS module to do this, but if you wanted to create the bridge manually, you would use a command like

ovs-vsctl add-br myBridge \
           -- add-port myBridge web1_veth1 \
           -- add-port myBridge web2_veth1

This is actually a combination of three commands (i.e updates on the OVSDB database) which will be run in one single transaction (the OVS CLI uses the double dashes to combine commands into one transaction). With the first part of the command, we create a virtual OVS bridge called myBridge. With the second and third line, we then add two ports, connected to the two VETH pairs that we have created earlier.

Once the bridge exists and is connected to the containers, we add a third port, which is a VLXAN port, which, using a manual setup, would be the result of the following commands.

ovs-vsctl add-port myBridge vxlan0 \\
          -- set interface vxlan0 type=vxlan options:remote_ip=192.168.50.4  options:dst_port=4789 options:ttl=5

Again, we atomically add the port to the bridge and pass the VXLAN options. We set up the VTEP as a point-to-point connection to the second virtual machine, using the standard UDP port and a TTL of five to avoid that UDP packets get lost.

Finally, we configure the various devices and assign IP addresses. To configure the devices in the container namespaces, we could attach to the containers, but it is easier to use netns to run the required commands within the container namespaces.

Once the setup is complete, we are ready to explore the newly created machines. First, use vagrant ssh boxA to log into boxA. From there, use Docker exec to attach to the first container.

sudo docker exec -it web1 "/bin/bash"

You should now be able to ping all other containers, using the IP addresses 172.16.0.2 – 172.16.0.4. If you run arp -n inside the container, you will also find that all three IP addresses are directly resolved into MAC addresses and are actually present on the same Ethernet segment.

To inspect the bridges that OVS has created, exit the container again so that we are now back in the SSH session on boxA and use the command line utility ovs-vsctl to list all bridges.

sudo ovs-vsctl list-br

This will show us one bridge called myBridge, as expected. To get more information, run

sudo ovs-vsctl show

This will print out the full configuration of the current OVS node. The output should look similar to the following snippet.

af3a2230-3d2a-4364-a0b9-1da4e32211e4
    Bridge myBridge
        Port "web2_veth1"
            Interface "web2_veth1"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {dst_port="4789", remote_ip="192.168.50.5", ttl="5"}
        Port "web1_veth1"
            Interface "web1_veth1"
        Port myBridge
            Interface myBridge
                type: internal
    ovs_version: "2.9.2"

We can see that the output nicely reflects the structure of our network. There is one bridge, with three ports – the two VETH ports and the VXLAN port. We also see the parameters of the VXLAN ports that we have specified during creation. It is also possible to obtain the content of the OVSDB tables that correspond to the various objects in JSON format.

sudo ovs-vsctl list bridge
sudo ovs-vsctl list port
sudo ovs-vsctl list interface

Lab 12: VLAN separation with Open vSwitch

In this lab, we will use a setup which is very similar to the previous one, but with the difference that we use layer 2 technology to span our network across the two virtual machines. Specifically, we establish two VLANs with ids 100 (containing web1 and web3) and 200 (containing the other two containers). On those two logical Ethernet networks, we establish two different layer 3 networks – 192.168.50.0/24 and 192.168.60.0/24.

OVSLab12

The first part of the setup – bringing up the containers and creating the VETH pairs – is very similar to the previous labs. Once this is done, we again set up the two bridges. On boxA, this would be done with the following sequence of commands.

sudo ovs-vsctl add-br myBridge
sudo ovs-vsctl add-port myBridge enp0s8
sudo ovs-vsctl add-port myBridge web1_veth1 tag=100
sudo ovs-vsctl add-port myBridge web2_veth1 tag=200

This will create a new bridge and first add the VM interface enp0s8 to it. Note that by default, every port added to OVS is a trunk port, i.e. the traffic will carry VLAN tags. We then add the two VETH ports with the additional parameter tag which will mark the port as an access port and define the corresponding VLAN ID.

Next we need to fix our IP setup. We need to remove the IP address from the enp0s8 as this is now part of our bridge, and set the IP address for the two VETH devices inside the containers.

sudo ip addr del 192.168.50.4 dev enp0s8
web1PID=$(sudo docker inspect --format='{{.State.Pid}}' web1)
sudo nsenter -t $web1PID -n ip addr add  192.168.50.1/24 dev web1_veth0
web2PID=$(sudo docker inspect --format='{{.State.Pid}}' web2)
sudo nsenter -t $web2PID -n ip addr add  192.168.60.2/24 dev web2_veth0

Finally, we need to bring up the devices.

sudo nsenter -t $web1PID -n ip link set  web1_veth0 up
sudo nsenter -t $web2PID -n ip link set  web2_veth0 up
sudo ip link set web1_veth1 up
sudo ip link set web2_veth1 up

The setup of boxB proceeds along the following lines. In the lab, we again use Ansible scripts to do all this, but if you wanted to do it manually, you would have to run the following on boxB.

sudo ovs-vsctl add-br myBridge
sudo ovs-vsctl add-port myBridge enp0s8
sudo ovs-vsctl add-port myBridge web3_veth1 tag=100
sudo ovs-vsctl add-port myBridge web4_veth1 tag=200
sudo ip addr del 192.168.50.5 dev enp0s8
web3PID=$(sudo docker inspect --format='{{.State.Pid}}' web3)
sudo nsenter -t $web3PID -n ip addr add  192.168.50.3/24 dev web3_veth0
web4PID=$(sudo docker inspect --format='{{.State.Pid}}' web4)
sudo nsenter -t $web4PID -n ip addr add  192.168.60.4/24 dev web4_veth0
sudo nsenter -t $web3PID -n ip link set  web3_veth0 up
sudo nsenter -t $web4PID -n ip link set  web4_veth0 up
sudo ip link set web3_veth1 up
sudo ip link set web4_veth1 up

Instead of manually setting up the machines, I have of course again composed a couple of Ansible scripts to do all this. To try this out, run

git clone https://github.com/christianb93/networking-samples
cd lab12
vagrant up 

Now log into one of the boxes, say boxA, attach to the web1 container and try to ping web3 and web4.

vagrant ssh boxA
sudo docker exec -it web1 /bin/bash
ping 192.168.50.3
ping 192.168.60.4

You should see that you can get a connection to web3, but not to web4. This is of course what we expect, as the VLAN tagging is supposed to separate the two networks. To see the VLAN tags, open a second session on boxA and enter

sudo tcpdump -e -i enp0s8

When you now repeat the ping, you should see that the traffic generated from within the container web1 carries the VLAN tag 100. This is because the port to which enp0s8 is attached has been set up as a trunk port. If you stop the dump and start it again, but this time listening on the device web1_veth1 which we have added to the bridge as an access port, you should see that no VLAN tag is present. Thus the bridge operates as expected by adding the VLAN tag according to the tag of the access port on which the traffic comes in.

In the next post, we will start to explore another important feagure of OVS – controlling traffic using flows.

Virtual networking labs – a short introduction to Open vSwitch

In the previous posts, we have used standard Linux tools to establish and configure our network interfaces. This is nice, but becomes very difficult to manage if you need to run environments with hundreds or even thousands of machines. Open vSwitch (OVS) is an Open source software switch which can be integrated with SDN control planes and cloud management software. In this post, we will look a bit at the theoretical background of OVS, leaving the practical implementation of some examples to the next post.

Some terms from the world of software defined networks

It is likely that you have heard the magical word SDN before, and it is also quite likely that you have already found that giving a precise meaning to this term is hard. Still, there is a certain agreement that one of the core ideas of SDN is to separate data flow through your networking devices from the and networking configuration.

In a traditional data center, your network would be implemented by a large number of devices like switches and routers. Each of these devices holds some configuration and typically has a way to change that configuration remotely. Thus, the configuration is tightly integrated with the networking infrastructure, and making sure that the entire configuration is consistent and matches the desired state of your network is hard.

With sofware defined networking, you separate the configuration from the networking equipment and manage it centrally. Thus, the networking equipment handles the flow of data – and is referred to as the data plane or flow plane – while a central component called the control plane is responsible for controlling the flow of data.

This is still a bit vague, but becomes a bit more tangible when we look at an example. Enter Open vSwitch (OVS). OVS is a software switch that turns a Linux server (which we will call a node) into a switch. Technically, OVS is a set of server processes that are installed on each node and that handle the network flow between the interfaces of the node. These nodes together make up the data plane. On top of that, there is a control plane or controller. This controller talks to the individual nodes to make sure the rules that they use to manage traffic (called the flows) are set up accordingly.

To allow controllers and switch nodes to interact, an open standard called OpenFlow has been created which defines a common way to describe flows and to exchange data between the controller and the switches. OVS supports OpenFlow (currently only version 1.1 is supported) and thus can be combined with OpenFlow based controllers like Faucet or Open Daylight, creating a layered architecture as follows. Additionally, a switch can be configured to ask the controller how to handle a packet for which no matching flow can be found.

OVSOverview

Here, OVS uses OpenFlow to exchange flows with the controller. To exchange information on the underlying configuration of the virtual bridge (which ports are connected, how are these ports set up, …) OVS provides a second protocol called OVSDB (see below) which can also be used by the control plane to change the configuration of the virtual switch (some people would probably prefer to call the part of the control logic which handles this the management plane in contrast to the control plane, which really handles the data flow only).

Components of Open vSwitch

Let us now dig a little bit into the architecture of OVS itself. Essentially, OVS consists of three components plus a set of command-line interfaces to operate the OVS infrastructure.

First, there is the OVS virtual switch daemon ovs-vswitchd. This is a server process running on the virtual switch and is connected to a socket (usually a Unix socket, unless it needs to communicate with controllers not on the same machine). This component is responsible for actually operating the software defined switch.

Then, there is a state store, in the form of the ovsdb-server process. This process is maintaining the state that is managed by OVS, i.e. the objects like bridges, ports and interfaces that make up the virtual switch, and tables like the flow tables used by OVS. This state is usually kept in a file in JSON format in /etc/openvswitch. The OVSDB connects to the same Unix domain socket as the Switch daemon and uses it to exchange information with the Switch daemon (in the database world, the switch daemon is a client to the OVSDB database server). Other clients can connect to the OVSDB using a JSON based protocol called the OVSDB protocol (which is described in RFC 7047) to retrieve and update information.

The third main component of OVS is a Linux kernel module openvswitch. This module is now part of the official Linux kernel tree and therefore is typically pre-installed. This kernel module handles one part of the OVS data path, sometimes called the fast path. Known flows are handled entirely in kernel space. New flows are handled once in the user space part of the datapath (slow path) and then, once the flow is known, subsequently in the kernel data path.

Finally, there are various command-line interfaces, the most important one being ovs-vsctl. This utility can be used to add, modify and delete the switch components managed by OVS like bridges, port and so forth – more on this below. In fact, this utility operates by making updates to the OVSDB, which are then detected and realized by the OVS switch daemon. So the OVSDB is the leading provider of the target state of the system.

The OVS data model

To understand how OVS operates, it is instructive to look at the data model that describes the virtual switches deployed by OVS. This model is verbally described in the man pages. If you have access to a server on which OVS is installed, you can also get a JSON representation of the data model by running

ovsdb-client get-schema Open_vSwitch 

At the top level of the hierarchy, there is a table called Open_vSwitch. This table contains a set of configuration items, like the supported interface types or the version of the database being used.

Next, there are bridges. A bridge has one or more ports and is associated with a set of tables, each table representing a protocol that OVS supports to obtain flow information (for instance NetFlow or OpenFlow). Note that the Flow_Table does not contain the actual OpenFlow flow table entries, but just additional configuration items for a flow table. In addition, there are mirror ports which are used to trace and monitor the network traffic (which we ignore in the diagram below).

Each port refers to one or more interfaces. In most situations, each port has one interface, but in case of bonding, for instance, one port is supported by two interfaces. In addition, a port can be associated with QoS settings and queue for traffic control.

OVSDBDataModel

Finally, there are controllers and managers. A controller, in OVS terminology, is some external system which talks to OVS via OpenFlow to control the flow of packets through a bridge (and thus is associated with a bridge). A manager, on the other hand, is an external system that uses the OVSDB protocol to read and update the OVSDB. As the OVS switch daemon constantly polls this database for changes, a manager can therefore change the setup, i.e. add or remove bridges, add or remove ports and so on – like a remote version of the ovs-vsctl utility. Therefore, managers are associated with the overall OVS instance.

Installation and first steps with OVS

Before we get into the actual labs in the next post, let us see how OVS can be installed, and let us use OVS to create a simple bridge in order to get used to the command line utilities.

On an Ubuntu distribution, OVS is available as a collection of APT packages. Usually, it should be sufficient to install openvswitch-switch, which will pull in a few additional dependencies. There are similar packages for other Linux distributions.

Once the installation is complete, you should see that two new server processes are running, called (as you might expect from the previous sections) ovsdb-server and ovs-vswitchd. To try out that everything worked, you can now run the ovs-vsctl utility to display the current configuration.

$ ovs-vsctl show
518a2ed6-56d0-433f-98f2-a575daf20f72
    ovs_version: "2.9.2"

The output is still very short, as we have not yet defined any objects. What it shows you is, in fact, an abbreviated version of the one and only entry in the Open_vSwitch table, which shows the unique row identifier (UUID) and the OVS version installed.

Now let us populate the database by creating a bridge, currently without any ports attached to it. Run

sudo ovs-vsctl add-br myBridge

When we now inspect the current state again using ovs-vsctl show, the output should look like this.

518a2ed6-56d0-433f-98f2-a575daf20f72
    Bridge myBridge
        Port myBridge
            Interface myBridge
                type: internal
    ovs_version: "2.9.2"

Note how the output reflects the hierarchical structure of the database. There is one bridge, and attached to this bridge one port (this is the default port which is added to every bridge, similarly to a Linux bridge where creating a bridge also creates a device that has the same name as the bridge). This port has only one interface of type “internal”. If you run ifconfig -a, you will see that OVS has in fact created a Linux networking device myBridge as well. If, however, you run ethtool -i myBridge, you will find that this is not an ordinary bridge, but simply a virtual device managed by the openvswitch driver.

It is interesting to directly inspect the content of the OVSDB. You could either do this by browsing the file /etc/openvswitch/conf.db, or, a bit more conveniently, using the ovsdb-client tool.

sudo ovsdb-client dump Open_vSwitch

This will provide a nicely formatted dump of the current database content. You will see one entry in the Bridge table, representing the new bridge, and corresponding entries in the Port and Interface table.

This closes our post for today. In the next post, we will setup an example (again using Vagrant and Ansible to do all the heavy lifting) in which we connect containers on different virtual machines using OVS bridges and a VXLAN tunnel. In the meantime, you might want to take a look at the following references which I found helpful.

Click to access OpenVSwitch.pdf

https://github.com/openvswitch/ovs
https://tools.ietf.org/html/rfc7047

Virtual networking labs – overlay networks

In the last post, we have looked at virtual networking on the Ethernet level. In modern cloud environments, a second class of virtual networks has gained importance, which uses higher level protocols to tunnel Ethernet frames. These networks are called overlay networks, and we will start to look at them in this post.

VXLAN – the basics

The VLAN technology that we have looked at in the last post is useful, but has some limitations. First, there is the maximum number of possible VLANs (4096). In practice, certain VLAN ranges need to be reserved for internal purposes, further limiting the number of available VLANs. In cloud environments with a large number of tenants, this limit can easily be reached if we try to implement all virtual networks via VLAN. In addition, VLAN tags inserted by the tenants could conflict with the VLAN tags inserted by the host operating systems.

To solve these problems, a new standard called VXLAN was developed a couple of years back, which is described (though not defined, as this is an informational RFC) in RFC 7348. The basic idea of VXLAN is actually quite simple. On each host involved, we create a virtual network device. When an Ethernet frame needs to be transmitted via this device, the host creates a UDP packet, puts the Ethernet frame as payload into this packet and sends it to the target host. The target host receives the packet, strips off the headers, and re-injects the payload (i.e. the original Ethernet frame) into the networking stack of the target system. Thus Ethernet frames travels on top of UDP, and the virtual Ethernet networks logically sits on top of the layer 3 IP network used to exchange the UDP packets, leading to the name overlay network.

To be able to isolate different VXLANs from each other, a 24 bit VXLAN network identifier (VNI) is used. The implementation needs to make sure that Ethernet frames are only delivered within the same VNI, thus isolating the different VXLAN networks from each other. A host that is able to provide VXLAN devices and to participate in the exchange of UDP packets is called a VXLAN tunnel endpoint (VTEP). Thus to send an Ethernet frame over VXLAN, a VTEP needs to

  • Add a VXLAN header that contains the VNI, so that the receiving VTEP can make sure that the frame is only delivered within the correct VNI
  • Pass the resulting data as payload to the own IP stack, which will add a UDP, IP and Ethernet header to be able to transmit the frame over an existing layer 2 network

VXLANFrame

To be able to locate the UDP target address to which we have to send an encapsulated Ethernet frame, each VTEP needs to maintain a table containing mapping between the IP addresses of other VTEPs and the corresponding MAC addresses. A VTEP typically learns how to populate this table and uses IP multicast to ask other VTEPS to resolve unknown MAC addresses, similar to the ARP protocol.

When VXLAN is used, there are a few points that should be kept in mind. First, we do of course add quite a bit of overhead. For every Ethernet frame that is being exchanged, we add a second Ethernet header, an IP header and a UDP header, plus the processing time it takes on the host to travel the networking stack up and down once more. In addition, there is a problem with the MTU (maximum transfer unit) configured for the VXLAN endpoints. As the Ethernet frames on the physical network are longer than the Ethernet frames on the overlay network (as we need the additional headers), we will have to increase the MTU on the physical network to account for this in order to avoid unnecessary fragmentation. Also, using VXLAN implies that your Ethernet frames flow in clear text over the IP connection, so if you want to use VXLAN across unsecure network areas, then you should use some form of encryption like IPSec.

Lab 9: setting up a point-to-point VXLAN connection

To see this in action, let us first implement a very basic scenario. Assume that we have two hosts (virtual machines provided by VirtualBox in our case) that are part of the same layer 3 network. On each host, we ask the Linux kernel to create a virtual device of type VXLAN. To this virtual device, we can assign IP addresses as usual. Any Ethernet frames sent to the device will be encapsulated using the VXLAN protocol and will be sent to the peer, where the Linux kernel will strip off the outer header and re-inject the Ethernet frame. So the Linux kernel acts as a VTEP on both sides.

VXLANLab9

Again, I have automated the setup using Vagrant and Ansible. To run the example, simply enter the following commands

git clone https://github.com/christianb93/networking-samples
cd networking-samples/lab9
vagrant up

To inspect the setup, let us first SSH into boxA. If you run ifconfig -a, you will in fact see a new device called vxlan0. This device has been created and configured by our Ansible script using the following commands.

ip link add type vxlan id 100 remote 192.168.50.5 dstport 4789 dev enp0s8
ip addr add 192.168.60.4/24 dev vxlan0
ip link set vxlan0 up

The first command creates the device, specifying the VNI 100, the IP address of the peer, the port number to use for the UDP connection (we use the port number defined in RFC 7348) and the physical device to be used for the transmission. The second and third command then assign an IP address and bring the device up.

When you run netstat -a on boxA, you will also find that a UDP socket has been created on port 4789, this socket is ready to accept UDP packets from the peer carrying encapsulated Ethernet frames. The setup on boxB is similar, using of course a different IP address.

Let us now try to exchange traffic and to display the packets that go forth and back. For that purpose, open an SSH session on boxB as well and start a tcdump session listening on vxlan0.

sudo tcpdump -e -i vxlan0

You will a sequence of ARP and IPv4 packets, with the source and target MAC addresses matching the MAC addresses of the vxlan0 devices on the respective hosts. Thus the device acts like an ordinary Ethernet device, as expected.

Now let us change the setup and start to dump traffic on the underlying physical interface.

sudo tcpdump -e -i enp0s8

When you now repeat the ping, you will see that the packets arriving at the physical interface are UDP packets. In fact, tcpdump properly recognizes these frames as VXLAN frames and also prints the inner headers. We see that the outer Ethernet headers contain the MAC addresses of the underlying network interfaces of boxA and boxB, whereas the inner headers contain the MAC addresses of the vxlan0 devices.

Lab 10: VXLAN and IP multicasting

So far, we have used a direct point-to-point connection between the two hosts involved in the VXLAN network. In reality, of course, things are more complicated. Suppose, for instance, that we have three hosts representing VTEP endpoints. If an Ethernet frame on one of the hosts reaches the VXLAN interface, the kernel needs to determine to which of the other hosts the resulting UDP packet should be sent.

Of course, we could simply broadcast the packet to all hosts on the IP network, using a broadcast, but this would be terribly inefficient. Instead, VXLAN uses IP multicast functionality. To this end, the administrator setting up VXLAN needs to associate an IP multicast address with each VNI. A VTEP will then join this group and will use the IP multicast address for all traffic that needs to go to one or more Ethernet destinations. In a local network, you want to use one of the “private” IP multicast groups in the range 239.0.0.0 – 239.255.255.255 reserved by RFC 2365, for instance within the local scope 239.255.0.0/16.

To study this, I have created lab 10 which establishes a scenario in which three hosts serve as VTEP to span a VXLAN with VNI 100. As always, grab the code from GitHub, cd into the directory lab10 and run vagrant up to start the example.

VXLANMulticast

The setup is very similar to the setup for the point-to-point connection above, with the difference that when bringing up the VXLAN device, we have removed the remote parameter and replaced it by the group parameter to tie the VNI to the multicast group.

ip link \
   add type vxlan \
   id 100 \
   group 239.255.0.1 \
   ttl 5 \
   dstport 4789 \
   dev enp0s8

Note the parameter TTL which defines the initial TTL that will be set on the UDP packets sent out by the VTEP. When I tried this setup first, I did not set the TTL, resulting in the default of one. With this setup, however, ARP requests were not answered by the target host, and I had to increase the TTL by adding the additional parameter.

Let us test this setup. Open SSH connections to the boxA and boxB. First, we can use ip maddr show enp0s8 to verify that on both machines, the interface enp0s8 has joined the multicast group 239.255.0.1 that we specified when bringing up the VXLAN. Then, start a tcpdump session on enp0s8 on boxC and ping the VXLAN IP address 192.168.60.5 of boxB from boxA. As this is the first time we establish this connection, the Ethernet device should emit an ARP request. This ARP request is encapsulated and sent out as an IP multicast with IP target address 239.255.0.1. In tcpdump, the corresponding output (again displaying the outer and inner headers) looks as follows.

06:30:44.255166 08:00:27:fe:3b:d0 (oui Unknown) > 01:00:5e:7f:00:01 (oui Unknown), ethertype IPv4 (0x0800), length 92: 192.168.50.4.57732 > 239.255.0.1.4789: VXLAN, flags [I] (0x08), vni 100
b6:02:f0:c8:15:85 (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 192.168.60.5 tell 192.168.60.4, length 28

We can clearly see that the outer IP header has the multicast IP address as target address, and that the inner frame is an ARP request, looking to resolve the IP address of the VXLAN device on boxB.

The multicast mechanism is used to initially discover the mapping of IP addresses to Ethernet addresses. However, this is typically only required once, because the VTEP is able to learn this mapping by storing it in a forwarding database (FDB). To see this mapping, switch to boxA and run

bridge fdb show dev vxlan0

In the output, you should be able to locate the Ethernet address of the VXLAN device on boxC, being mapped to the IP address of boxC on the underlying network, i.e. 192.168.50.6.

Other overlay solutions

In this post, we have studied overlay networks based on VXLAN in some detail. However, VXLAN is not the only available overlay protocol. We just mention two alternative solutions without going into details.

First, there is GRE (Generic Routing Encapsulation). which is defined in RFC 2784. GRE is a generic protocol to encapsulate packets within other packets. It defines a GRE header, which is put between the headers of the outer protocol and the payload, similar to the VXLAN header. Other than VXLAN, GRE allows different protocols both as payload protocols and as delivery (outer) protocols. Linux supports both IP over IP tunneling using GRE, using the device type gre for IP-over-IP tunnels and the device type gretap for Ethernet-over-IP tunneling.

Then, there is GENEVE, which is an attempt to standardize encapsulation protocols. It is very similar to VXLAN, tunneling Ethernet frames over UDP, but defines a header with optional fields to allow for future extensions.

And finally, Linux offers a few additional tunneling protocols like the IPIP module for tunneling of IP over IP traffic or SIT to tunnel IPv6 over IPv4 which have been present in the kernel for some time and predate some of the standards just discussed.

In this and the previous posts, we have mainly used Linux kernel technology to realize network virtualization. However, there are other options available. In the next post, I will start to explore Open vSwitch (OVS), which is an open-source software defined switching solution.