Kubernetes storage under the hood part II – persistent storage

The storage types that we have discussed so far realize ephemeral storage, i.e. storage tied to the lifecycle of the Pod on a specific node. Of course, there are many use cases like databases or other stateful applications that require storage that is persistent and has a lifecycle independent of the Pod. In this post, we look at different ways to realize this in Kubernetes.

Using cloud provider specific storage directly

The easiest – and historically first – way to do this is to directly refer to an existing piece of persistent storage provided by the underlying cloud platform. Recall that when you define and use volumes, you define the volume as part of Pod specification, assign a name and tell Kubernetes which type of volume it should allocate – like emptyDir or hostPath. If you check the list of supported volumes, you will find some volumes that are specific to a cloud provider, for instance awsElasticBlockStore. This allows you to mount a pre-existing AWS Elastic Block Store volume into your Pod. As EBS volumes can only be attached to one instance at a time, you can only connect to this volume from one Pod.

To try this out, we of course have to generate an EBS volume first (as always, be aware of the charges and make sure to delete the volume if it is no longer needed). EBS volumes need to be created within an availability zone (which you can figure out using aws ec2 describe-availability-zones). In my example, I will use the availability zone eu-central-1a. The following command creates a 16 GiB GP2 (SSD) drive in this availability zone.

$ aws ec2 create-volume --availability-zone=eu-central-1a\
           --size=16 --volume-type=gp2

If you now run aws ec2 describe-volumes --output json to list all volumes, you will see several volumes that have the status “in-use” (these are the root volumes of your nodes) and one new volume that has the status “available”. We will need the volume ID of this volume, in my case this is vol-0eb2505d4b7d035cb. Let us try to attach this to a Pod using the following manifest file.

apiVersion: v1
kind: Pod
  name: ebs-demo
  namespace: default
  - name: ebs-demo-ctr
    image: httpd:alpine
      - mountPath: /test
        name: test-volume
  - name: test-volume
      volumeID: vol-0eb2505d4b7d035cb

Sometimes you learn most from your mistakes. If you apply this manifest file chances are that your Pod will never be fully established, but will remain in the state “ContainerCreating” forever. What is going wrong?

To find the answer, use the AWS CLI or the AWS Web console to look at the availability zones of the instance on which the Pod is scheduled and the volume. In my example, Kubernetes did schedule the Pod on a node running in eu-central-1c, whereas the volume was created in eu-central-1a. Unfortunately, EBS volumes cannot be attached across availability zones, and the creation of the Pod fails.

Fortunately, there is a way out. Note that EKS will attach a label to the nodes which capture there availability zone. This label is called failure-domain.beta.kubernetes.io/zone. Now, Kubernetes has a general mechanism called node selector. This allows you to instruct the Kubernetes scheduler to move Pods only on specific nodes, matching certain selection criteria. These criteria are provided in the section nodeSelector of the Pod specification. So the following updated manifest file makes sure that the node will be scheduled in the availability zone in which we did create the volume.

apiVersion: v1
kind: Pod
  name: ebs-demo
  namespace: default
  - name: ebs-demo-ctr
    image: httpd:alpine
      - mountPath: /test
        name: test-volume
  - name: test-volume
      volumeID: vol-0eb2505d4b7d035cb
    failure-domain.beta.kubernetes.io/zone: eu-central-1a

If you apply this manifest file and wait for some time, you will see that the Pod comes up. If you again run aws ec2 describe-volumes, you will find that the volumes status has changed to “in-use” and that EKS has automatically attached it to the node on which the Pod is scheduled (provided, of course, that you have a node running in eu-central-1a, which is not a given if you only use two nodes as we do it – in this case you will have to create the volume in one of the availability zones in which you have a node). You can also attach to the Pod and run mount to verify that a device has been mounted on /test. Combining the output of docker inspect and mount on the node will tell you that the following has happened.

  • Kubernetes has asked AWS to attach our volume as /dev/xvdba to the node
  • Then Kubernetes did create a directory specific to the Pod
  • The volume was mounted into this directory
  • Finally, Kubernetes did create a Docker bind mount to hook up this directory on the node with the directory /test in the container file system

This example nicely demonstrates that using existing persistent storage in the underlying cloud platform is possible, but comes with some drawbacks. An administrator will have to manage these volumes manually, using whatever tools your cloud provider makes available. There are limitations if your cluster is spanning multiple availability zones, and of course you tie all your manifest files directly to a specific cloud provider. In addition, Kubernetes tries to follow the idea of “run everywhere”, which does not combine well with an approach where each individual cloud provider needs to be hardcoded into the core Kubernetes code base. As so often in computer science, this situation almost cries for an additional abstraction layer between Pod volumes and the underlying storage. This abstraction layer exists and is the topic of the next section.

Persistent volume claims

In the previous section, we have defined volumes as part of a Pod specification. These volumes – which we should actually call Pod volumes – are linked to the lifecycle of an individual Pod. We have seen that these volumes can refer to existing storage within your cloud layer, but that this comes with some drawbacks and requires manual provisioning outside of the Kubernetes world.

To change this, Kubernetes defines an additional abstraction layer between the storage as provided by the cloud platform and the Pod volumes. The most important objects we need to discuss in order to understand this are persistent volumes (PV) and persistent volume claims (PVC).

Let us start with persistent volumes. Essentially, a persistent volume is a Kubernetes object that represents a piece of storage in the underlying cloud platform. Typically, volumes are created dynamically, but we will see in the next post in this series that they can also be managed manually. The important thing is that volumes are first-class Kubernetes citizens and have a lifecycle independent of that of Pods.

Volumes represent the supply side of storage on your cluster. The demand side is represented by volume claims. A persistent volume claim is an object that a user creates to let Kubernetes know that a certain amount of storage is required. If the cluster is set up for it, Kubernetes will then automatically try to fulfill the claim, i.e. to either find an existing, unused volume that matches the claim or to provision a volume dynamically, using a so-called provisioner. A persistent volume claim can then be referenced in a Pod volume which in turn can be mounted into a container.


To get an idea what this means, let us again consider an example. The following manifest file describes a persistent volume claim.

apiVersion: v1
kind: PersistentVolumeClaim
  name: ebs-pvc
  namespace: default
    - ReadWriteOnce
      storage: 4Gi

In addition to the standard fields, we have a specification that consists of two sections. The first section specifies the access mode. Here we use ReadWriteOnce, which states that we want a volume which can be accessed by one Pod at a time, reading and writing. In the second section, we specify the resources, i.e. the amount of storage that we need, in our case 4 Gigabytes of storage. This example already demonstrates one nice feature of a PVC – it does not refer at all to a specific type of volume or a specific cloud platform (it does so indirectly, via the mechanism of storage classes, which we will investigate in the next post).

When we apply this manifest file and wait for a few seconds, we can look at the objects that have been created. First, run kubectl get pvc to get a list of all persistent volume claims.

$ kubectl get pvc
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ebs-pvc   Bound    pvc-63521bd7-4e51-11e9-8e51-0af6f9d0ca50   4Gi        RWO            gp2            7m

So as expected, we have a new persistent volume claim that has been generated. The status of this PVC is “bound”, telling us that the provisioner working behind the scenes was able to find a matching volume. We also find a reference to the volume in the output. So let us list this volume.

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS   REASON   AGE
pvc-63521bd7-4e51-11e9-8e51-0af6f9d0ca50   4Gi        RWO            Delete           Bound    default/ebs-pvc   gp2   

As stated above, this volume exists as an independent entity that we can refer to by its name and that has some properties. When we append --output json to get all the details, we see a few interesting things.

First, the volume has a field claimRef which tells us to which claim this volume is bound. This is only one field, not an array. Similarly, the PVC has a field volumeName referring to the underlying volume, which is again not an array. So the relation between a PV and a PVC is one-to-one. As mentioned in the documentation, this can lead to overprovisioning. If, for instance, you request 4 Gi, and there is a free volume with 8 Gi, the provisioner might decide to bind the PVC to this volume, even though only 4 Gi were requested.

Another interesting information that we get from the JSON output is that a volume has a label that tells us in which availability zone the volume (or, more precisely, the underlying EBS volume) is located. And, finally, the spec section contains a field that lets us locate the underlying EBS storage. You can use the following statements to extract this information and use the AWS CLI to print out some details of this volume.

$ fullVolumeID=$(kubectl get pv\
$ volumeID=$(echo $fullVolumeID | sed 's/aws:\/\/.*\///')
$ aws ec2 describe-volumes --volume-id=$volumeID --output json

Let us now actually use this volume, i.e. attach it to a Pod. Here is a manifest file which will bring up a Pod mounting this volume.

apiVersion: v1
kind: Pod
  name: pv-demo
  namespace: default
  - name: pv-demo-ctr
    image: httpd:alpine
      - mountPath: /test
        name: test-volume
  - name: test-volume
      claimName: ebs-pvc

We see that at the point in the file were we did previously refer directly to an EBS volume, we now refer to the PVC. So again, there is no EBS or EKS specific data in this manifest file, and we can theoretically use the same manifest file on any cloud platform.

When you apply this manifest file, you will notice that it takes significantly longer for the containers to come up, this is because behind the scenes, Kubernetes needs to attach the EBS volume to the node on which the Pod is scheduled, which takes some time.

We can now again analyze the structure of the file systems in the container and on the node. If we do this, we will find a very similar picture as above. The container has a bind mount into a directory managed by Kubernetes. This directory in turn is a mount point for the device /dev/xvdbu. If we look at the output of aws ec2 describe-volumes, we find that this is the device to which AWS has attached the EBS volume behind the PVC. So the outcome of the entire exercise is the same as before, with the difference that no manual provisioning of the volume as necessary.

In addition, Kubernetes is smart enough to place our Pod in a the same availability zone in which the volume is located. In fact, as explained in the documentation on multi-zone capabilities, the Kubernetes scheduler will make sure that a Pod that requires a PV located in a given availability zone will only be placed on a node running in that zone. It can, however, happen that a volume is created in an availability zone where there is no node at all, rendering it unusable (see this GitHub issue for a discussion).

Let us now investigat the lifecycle of this storage. To do this, let us create a file in our newly mounted directory, kill the pod, bring it up again and list the contents of the volume (this assumes that you have saved the manifest file for the Pod in a file called pvUsage.yaml).

$ kubectl exec -it pv-demo touch /test/hello
$ kubectl delete pod pv-demo
pod "pv-demo" deleted
$ kubectl apply -f pvUsage.yaml
pod/pv-demo created
$ # Wait for container to be ready
$ kubectl exec -it pv-demo ls /test
hello       lost+found

So, as expected, the volume and its content have survived the restart of the Pod. If a persistent volume claim is deleted, however, Kubernetes will automatically delete the underlying volume and delete the corresponding storage in the cloud platform layer. When you shutdown a cluster, make sure to delete all PVCs first, otherwise orphaned block storage volumes not controlled by Kubernetes any more could result.

We have now covered the basics of persistent storage in Kubernetes – but of course there is much more that we have still left open. How is storage actually provisioned? How does the platform know which type of storage we need when we issue a PVC? And how can we manually create storage? These are the topics that we will discuss in the next post in this series.

Kubernetes storage under the hood part I – ephemeral storage

So far, we have mainly discussed how compute and network resources are used and managed with Kubernetes. We will now turn to the third fundamental element of a container platform – storage.

Docker storage concepts

Before we talk about Kubernetes storage concepts, let us first recall how storage is managed in Docker. The following tests assume that you have a local installation of Docker on a Linux workstation (or virtual machine, of course). As a refresher, you might want to take a look at my introduction into Docker internals before reading on.

First, let us start a Docker container and spawn a shell. The easiest way to do this is the busybox image. So let us spin up a busybox container, attached to the terminal, and run mount to inspect its file system.

$ docker run -it busybox
/ # mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/BNUAN2FS4K75ZLQFSZVQWXRLUU:/var/lib/docker/overlay2/l/UKEGOK4TLTD2T4XN5KIEPN7JXF,upperdir=/var/lib/docker/overlay2/e8f16f0d705fb8e4677b605796b7319ef3f0226e2ad173b506e13b479afa515f/diff,workdir=/var/lib/docker/overlay2/e8f16f0d705fb8e4677b605796b7319ef3f0226e2ad173b506e13b479afa515f/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)

Your output will most likely look differently, but the general pattern will be the same. The first mount point that we see is the mount for the root directory. On newer Docker versions, this will be an overlay2 file system, on older Docker versions, it will be of type aufs. This entry points to one or more actual files on the host system that are located in the directory /var/lib/docker/ which is managed by Docker. These are the files in which the actual content of the root filesystem is stored.

The data on that file system is volatile and linked to the lifecycle of the container. To see this, let us create a file in the containers root file system and exit the container.

/ # echo "dancing in the rain" > /franky
/ # exit

This will stop the container, but not remove it – it will still exist and be visible in the output of docker ps -a. If you restart the container and attach to the running container, you will find that the file is still there and its content has been preserved. If, however, you remove the container using docker rm, the corresponding files on the host file system will be removed and the content of the file system of our container is lost. In that sense, these volumes are ephemeral – they survive across restarts, but die if the container is removed.

But Docker can do more – we can also use persistent storage. One option to do this are bind mounts. A bind mount maps a directory or a file from the host file system into the namespace of the container and attaches it to a mount point. To see an example, create a temporary directory on your host system and put some data into it. We can then mount this directory into a new Docker container using the -v option.

$ mkdir /tmp/ctr-test/
$ echo "Hello World" > /tmp/ctr-test/hello
$ docker run -v /tmp/ctr-test:/ctr-test/ -it busybox 
/ # cat /ctr-test/hello
Hello World
/# exit

So the content of the directory /tmp/ctr-test on the host becomes accessible within the container as /ctr-test (of course I could have chosen any other name as well). We can also see this mount point in the output of docker inspect. Use docker ps -a to find out the ID of the busybox container, in my case fd8ef21ba685, and then run

$ docker inspect fd8ef21ba685 --format="{{json .Mounts}}"

So the mount point shows up as a mount point of type bind in the list of mounts of our container.

We remark that Docker also has a more advanced way to mount storage referred to as volumes. In contrast to a bind mount, a volume is an object managed by Docker, backed by files in the Docker controlled directories. Volumes can be created manually or dynamically, can be given a name and can be mounted into a container. As they are objects with an independent lifecycle, they survive container eviction and can be mounted to more than one container. However, we will not look deeper into this as (at least to my knowledge) this feature is not used by Kubernetes.

Ephemeral storage in Kubernetes

Now let us try out how things change if we use Kubernetes to spin up our containers.

$ kubectl run -i --tty busybox --image=busybox --restart=Never
/ # / # mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/57X3FQLOLATAMPLAASUMBKHD5W:/var/lib/docker/overlay2/l/Q6HSQOG4NX7UHGVZEPQIZI43KQ,upperdir=/var/lib/docker/overlay2/f183dfd193c86bf43ebca4ae7d08cbfd6e268229f3bd59c70be2f48d9dc0937f/diff,workdir=/var/lib/docker/overlay2/f183dfd193c86bf43ebca4ae7d08cbfd6e268229f3bd59c70be2f48d9dc0937f/work)

So we see pretty much the same picture. The root volume of the container is again an overlay file system that is backed by directories managed by Docker on the host on which the pod is running.

If, however, you ssh into the node on which the container is running and do a docker inspect as before, you will find that there are actually a couple of bind mounts that we have not explicitly specified. These bind mounts are added by Kubernetes to give the container access to some configuration data like the token that can be used to connect to a Kubernetes service account, or host name resolutions (the own IP address of the container, for instance). Up to this point, our picture is actually quite simple.


Now things get a bit more complicated if you want to mount additional volumes. Kubernetes does in fact offer a large number of different volume types. Some of these volume types are ephemeral, i.e. the data is potentially lost if the pod dies or is rescheduled to a different node, other types of volumes are persistent. In this post, we focus on ephemeral storage and discuss different strategies to attach persistent storage in a later post.

Mount ephemeral storage of type emptyDir

In Kubernetes, we can define volumes on the level of an individual pod and attach these volumes to one or more containers that are running in this pod. Kubernetes offers several types of volumes. The type we are going to look at first is called an emptyDir because, from the point of view of a container, it is exactly that – a directory which is initially empty.

To see this in action, let us look at the following manifest file.

apiVersion: v1
kind: Pod
  name: empty-dir-demo
  namespace: default
  - name: empty-dir-demo-ctr
    image: httpd:alpine
      - mountPath: /test
        name: test-volume
  - name: test-volume
    emptyDir: {}

This manifest file defines an individual Pod, as we have seen it before. However, there are a few new elements which are populated in this manifest. The Pod specification contains a new field volumes, which is an array of volume objects. This volume has a name and an additional field which indicates the type of the volume. The documentation lists many of them, here we are working with a volume of type emptyDir.

In the container specification, we now refer to this volume. This instructs Kubernetes to create the volume and to mount it into this container at the defined mount point. To see this in action, let us apply this manifest file, spawn a shell in the pod that is created and inspect its file system.

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/emptyDir.yaml 
pod/empty-dir-demo created
$ kubectl exec -it empty-dir-demo "/bin/bash"
bash-4.4# mount | grep "test"
/dev/xvda1 on /test type xfs (rw,noatime,attr2,inode64,noquota)

So we see that Kubernetes has actually mounted a new file system onto the mount point /test. To figure out how this is realized, let us take a closer look at the Docker container that has been created. So ssh into the node on which the Pod is running and run the following commands (this assumes that jq is installed on the node, which is the default when using the standard AWS AMI).

$ containerId=$(docker ps | grep "httpd" | awk '{print $1}')
$ docker inspect $containerId | jq -r '.[0].Mounts[]'
  "Type": "bind",
  "Source": "/var/lib/kubelet/pods/0914a859-4da2-11e9-931c-06a2d10ef1fe/volumes/kubernetes.io~empty-dir/test-volume",
  "Destination": "/test",
  "Mode": "",
  "RW": true,
  "Propagation": "rprivate"
[ ... more output ... ]

So we find that Kubernetes realizes an emptyDir volume as a bind mount, i.e. Kubernetes will create a directory on the nodes local file system and use a Docker bind mount to mount this into the container. Of course, this directory will initially be empty (as the name strongly suggests). Let us see what happens if we actually write something onto this file system. The following commands (to be run again on the node on which the Pod is running) extract the directory which is used for the bind mount from the output of docker inspect and list the contents of this directory.

$ dir=$(docker inspect $containerId | jq -r '.[0].Mounts[] | select(.Destination=="/test") | .Source')
$ sudo ls $dir

If you run this now, you will find that the directory is empty. Now switch back to the terminal attached to the Pod and create a file in the /test directory.

bash-4.4# echo "hello" > /test/hello

If you now list the directories content again, you will find that a file hello has been created.

Knowing how an emptyDir is implemented now makes it easy to understand the statements on the lifecycle in the Kubernetes documentation. It is stored in a directory specific for the Pod, i.e. it is initially created when the Pod is created and removed when the Pod is removed. It survices container restarts, but when the Pod is migrated to a different node, the content will be lost. In that sense, it is ephemeral storage.


Accessing host-local file systems

We have found that a volume of type emptyDir is nothing but a Docker bind mount to a Pod specific directory managed by Kubernetes. Of course, Kubernetes also offers a way to set up bind mounts to existing directories in the host file system (needless to say that this might be a security risk). This done using a volume of type hostPath as in the example below.

apiVersion: v1
kind: Pod
  name: host-path-demo
  namespace: default
  - name: host-path-demo-ctr
    image: httpd:alpine
      - mountPath: /test
        name: test-volume
  - name: test-volume
      path: /etc

When you run this and attach to the resulting Pod, you will find that the content of the directory /test now match the content of the directory /etc on the host. Using again docker inspect on the node on which the Pod is running, you will find that Kubernetes has created an additional bind mount for the container which links the containers /test directory to the directory /etc on the host. Consequently, the contents of a hostPath volume will survive container restarts but will not be accessible anymore once the Pod is migrated to a different host.

Managing traffic with Kubernetes ingress controllers

In one of the previous posts, we have learned how to expose arbitrary ports to the outside world using services and load balancers. However, we also found that this is not very efficient – in the worst case, the number of load balancers we need equals the number of services.

Specifically for HTTP/HTTPS traffic, there is a different option available – an ingress rule.

Ingress rules and ingress controllers

An ingress rule is a Kubernetes resource that defines a kind of routing on the HTTP(S) path level. To illustrate this, assume that you have two services running, call them svcA and svcB. Assume further that both services work with HTTP as the underlying protocol and are listening on port 80. If you expose these services naively as in the previous post, you will need two load balancers, call them LB1 and LB2. Then, to reach svcA, you would use the URL


and to access svcB, you would use


The idea of an ingress is to have only one load balancer, with one DNS entry, and to use the path name to route traffic to our services. So with an ingress, we would be able to reach svcA under the URL


and the second service under the URL


With this approach, you have several advantages. First, you only need one load balancer that clients can use for all services. Second, the path name is related to the service that you invoke, which follows best practises and makes coding against your services much easier. Finally, you can easily add new services and remove old services without a need to change DNS names.

In Kubernetes, two components are required to make this work. First, there are ingress rules. These are Kubernetes resources that contain rules which specify how to route incoming requests based on their path (or even hostname). Second, there are ingress controllers. These are controllers which are not part of the Kubernetes core software, but need to be installed on top. These controllers will work with the ingress rules to manage the actual routing. So before trying this out, we need to install an ingress controller in our cluster.

Installing an ingress controller

On EKS, there are several options for an ingress controller. One possible choice is the nginx ingress controller. This is the controller that we will use for our examples. In addition, AWS has created their own controller called AWS ALB controller that you could use as an alternative – I have not yet looked at this in detail, though.

So let us see how we can install the nginx ingress controller in our cluster. Fortunately, there is a very clear installation instruction which tells us that to run the install, we simply have to execute a number of manifest files, as you would expect for a Kubernetes application. If you have cloned my repository and brought up your cluster using the up.sh script, you are done – this script will set up the controller automatically. If not, here are the commands to do this manually.

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/mandatory.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/provider/aws/service-l4.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/provider/aws/patch-configmap-l4.yaml

Let us go through this and see what is going on. The first file (mandatory.yaml) will set up several config maps, a service account and a new role and connect the service account with the newly created role. It then starts a deployment to bring up one instance of the nginx controller itself. You can see this pod running if you do a

kubectl get pods --namespace ingress-nginx

The first AWS specific manifest file service-l4.yaml will establish a Kubernetes service of type LoadBalancer. This will create an elastic load balancer in your AWS VPC. Traffic on the ports 80 and 443 is directed to the nginx controller.

kubectl get svc --namespace ingress-nginx

Finally, the second AWS specific file will update the config map that stores the nginx configuration and set the parameter use-proxy-protocol to True.

To verify that the installation has worked, you can use aws elb describe-load-balancers to verify that a new load balancer has been created and curl the DNS name provided by

kubectl get svc ingress-nginx --namespace ingress-nginx

from your local machine. This will still give you an error as we have not yet defined ingress rule, but show that the ingress controller is up and running.

Setting up and testing an ingress rule

Having our ingress controller in place, we can now set up ingress rules. To have a toy example at hand, let us first apply a manifest file that will

  • Add a deployment of two instances of the Apache httpd
  • Install a service httpd-service accepting traffic for these two pods on port 8080
  • Add a deployment of two instances of Tomcat
  • Create a service tomcat-service listening on port 8080 and directing traffic to these two pods

You can either download the file here or directly use the URL with kubectl

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/ingress-prep.yaml
deployment.apps/httpd created
deployment.apps/tomcat created
service/httpd-service created
service/tomcat-service created

When all pods are up and running, you can again spawn a shell in one of the pods and use the DNS name entries created by Kubernetes to verify that the services are available from within the pods.

$ pod=$(kubectl get pods --output \
$ kubectl exec -it $pod "/bin/bash"
bash-4.4# apk add curl
bash-4.4# curl tomcat-service:8080
bash-4.4# curl httpd-service:8080

Let us now define an ingress rule which directs requests to the path /httpd to our httpd service and correspondingly requests to /tomcat to the Tomcat service. Here is the manifest file for this rule.

apiVersion: extensions/v1beta1
kind: Ingress
  name: test-ingress
    nginx.ingress.kubernetes.io/rewrite-target : "/"
  - http:
      - path: /tomcat
          serviceName: tomcat-service
          servicePort: 8080
      - path: /alpine
          serviceName: alpine-service
          servicePort: 8080

The first few lines are familiar by now, specifying the API version, the type of resource that we want to create and a name. In the metadata section, we also add an annotation. The nginx ingress controller can be configured using this and similar annotations (see here for a list), and this annotation is required to make the example work.

In the specification section, we now define a set of rules. Our rule is a http rule, which, at the time of writing, is the only supported protocol. This is followed by a list of paths. Each path consists of a selector (“/httpd” and “/tomcat” in our case), followed by the specification of a backend, i.e. a combination of service name and service port, serving requests matching this path.

Next set up the ingress rule. Assuming that you have saved the manifest file above as ingress.yaml, simply run

$ kubectl apply -f ingress.yaml
ingress.extensions/test-ingress created

Now let us try this. We already know that the ingress controller has created a load balancer for us which serves all ingress rules. So let us get the name of this load balancer from the service specification and then use curl to try out both paths

$ host=$(kubectl get svc ingress-nginx -n ingress-nginx --output\
$ curl -k https://$host/httpd
$ curl -k https://$host/tomcat

The first curl should give you the standard output of the httpd service, the second one the standard Tomcat welcome page. So our ingress rule works.

Let us try to understand what is happening. The load balancer is listening on the HTTPS port 443 and picking up the traffic coming from there. This traffic is then redirected to a host port that you can read off from the output of aws elb describe-load-balancers, in my case this was 31135. This node port belongs to the service ingress-nginx that our ingress controller has set up. So the traffic is forwarded to the ingress controller. The ingress controller interprets the rules, determines the target service and forwards the traffic to the target service. Ignoring the fact that the traffic goes through the node port, this gives us the following simplified picture.


In fact, this diagram is a bit simplified as (see here) the controller does not actually send the traffic to the service cluster IP, but directly to the endpoints, thus bypassing the kube-proxy mechanism, so that advanced features like session affinity can be applied.

Ingress rules have many additional options that we have not yet discussed. You can define virtual hosts, i.e. routing based on host names, define a default backend for requests that do not match any of the path selectors, use regular expressions in your paths and use TLS secrets to secure your HTTPS entry points. This is described in the Kubernetes networking documentation and the API reference.

Creating ingress rules in Python

To close this post, let us again study how to implement Ingress rules in Python. Again, it is easiest to build up the objects from the bottom to the top. So we start with our backends and the corresponding path objects.


Having this, we can now define our rule and our specification section.

            paths=[tomcat_path, httpd_path]))

Finally, we assemble our ingress object. Again, this consists of the metadata (including the annotation) and the specification section.

metadata.annotations={"nginx.ingress.kubernetes.io/rewrite-target" : "/"}

We are now ready for our final steps. We again read the configuration, create an API endpoint and submit our creation request. You can find the full script including comments and all imports here


Watching Kubernetes networking in action

In this post, we will look in some more detail into networking in a Kubernetes cluster. Even though the Kubernetes networking model is independent of the underlying cloud provider, the actual implementation does of course depend on the cloud provider which communicates with Kubernetes through a CNI plugin.

I will continue to use EKS, so some of this will be EKS specific. To work with me through the example, you will first have to bring up your cluster, start two nodes and deploy a pod running a httpd on one of the nodes. I have written a script up.sh and a manifest file that automates all this. To download and apply all this, enter

$ git clone https://github.com/christianb93/Kubernetes.git
$ cd Kubernetes/cluster
$ chmod 700 up.sh
$ ./up.sh
$ kubectl apply -f ../pods/alpine.yaml

Node-to-Pod networking

Now let us log into the node on which the container is running and collect some information on the local network interface attached to the VM.

$ ifconfig eth0
eth0: flags=4163  mtu 9001
        inet  netmask  broadcast
        inet6 fe80::2b:dcff:fee7:448c  prefixlen 64  scopeid 0x20
        ether 02:2b:dc:e7:44:8c  txqueuelen 1000  (Ethernet)
        RX packets 197837  bytes 274587781 (261.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 25656  bytes 2389608 (2.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

So the local IP address of the node is If we do a kubectl get pods -o wide, we get a different IP address – – for the pod. Let us curl this from the node.

$ curl
<h1>It works!</h1>

So apparently we have reached our httpd. To understand why this works, let us investigate the network configuration in more detail. First, on the node on which the container is running, let us take a look at the network configuration inside the container.

$ ID=$(docker ps --filter name=alpine-ctr --format "{{.ID}}")
$ docker exec -it $ID "/bin/bash"
bash-4.4# ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth0@if6:  mtu 9001 qdisc noqueue state UP 
    link/ether 4a:8b:c9:bb:8c:8e brd ff:ff:ff:ff:ff:ff
bash-4.4# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 4A:8B:C9:BB:8C:8E  
          inet addr:  Bcast:  Mask:
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:936 (936.0 B)  TX bytes:0 (0.0 B)
bash-4.4# ip route
default via dev eth0 dev eth0 scope link

What do we learn from this? First, we see that inside the container namespace, there is a virtual ethernet device eth0, with IP address If you run kubectl get pods -o wide on your local workstation, you will find that this is the IP address of the Pod. We also see that there is a route in the container namespace that direct all traffic to this interface. The output of the ip link command also shows that this device is a virtual ethernet device that has a paired device (with index if6) in a different namespace. So let us exit the container, go back to the node and try to figure out what the network configuration on the node is.

$ ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:2b:dc:e7:44:8c brd ff:ff:ff:ff:ff:ff
3: eni3f5399ec799:  mtu 9001 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 66:37:e5:82:b1:f6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
4: enie68014839ee@if3:  mtu 9001 qdisc noqueue state UP mode DEFAULT group default 
    link/ether f6:4f:62:dc:38:18 brd ff:ff:ff:ff:ff:ff link-netnsid 1
5: eth1:  mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:e8:f0:26:a7:3e brd ff:ff:ff:ff:ff:ff
6: eni97d5e6c4397@if3:  mtu 9001 qdisc noqueue state UP mode DEFAULT group default 
    link/ether b2:c9:58:c0:20:25 brd ff:ff:ff:ff:ff:ff link-netnsid 2
$ ifconfig eth0
eth0: flags=4163  mtu 9001
        inet  netmask  broadcast
        inet6 fe80::2b:dcff:fee7:448c  prefixlen 64  scopeid 0x20
        ether 02:2b:dc:e7:44:8c  txqueuelen 1000  (Ethernet)
        RX packets 197837  bytes 274587781 (261.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 25656  bytes 2389608 (2.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
$ ip route
default via dev eth0 dev eth0 dev eth0 proto kernel scope link src dev eni3f5399ec799 scope link dev eni97d5e6c4397 scope link dev enie68014839ee scope link

Here we see that – in addition to a few other interfaces – there is a device eth0 to which all traffic is sent by default. However, there is also a device eni97d5e6c4397 which is the other end of the interface visible in the container. And there is a route that sends all traffic that is directed to the IP address of the pod to this interface. Overall, this gives a picture which seems familiar from our earlier analysis of docker networking


If we try to establish a connection to the httpd running in the pod, the routing table entry on the node will send the traffic to the interface eni97d5e6c4397. This is one end of a veth-pair, the other end appears inside the container as eth0. So from the containers point of view, this is incoming traffic received via eth0, which is happily accepted and processed by the httpd. The reply goes the other way – it is directed to eth0 inside the container, travels via the veth pair and ends up inside the host namespace, coming from eni97d5e6c4397.

Pod-to-Pod networking across nodes

Now let us try something else. Log into the second node – on which the container is not running – and try the curl from there. Surprisingly, this works as well! What we have seen so far does not explain this, so there is probably a piece of magic that we are still missing. To find this, let us use the aws cli to print out the network interfaces attached to the node on which the container is running (the following snippet assumes that you have the extremely helpful tool jq installed on your PC).

$ nodeName=$(kubectl get pods --output json | jq -r ".items[0].spec.nodeName")
$ aws ec2 describe-instances --output json --filters Name=private-dns-name,Values=$nodeName --query "Reservations[0].Instances[0].NetworkInterfaces"
---- SNIP -----
        "MacAddress": "02:e8:f0:26:a7:3e",
        "SubnetId": "subnet-06088e09ce07546b9",
        "PrivateDnsName": "ip-192-168-84-108.eu-central-1.compute.internal",
        "VpcId": "vpc-060469b2a294de8bd",
        "Status": "in-use",
        "Ipv6Addresses": [],
        "PrivateIpAddress": "",
        "Groups": [
                "GroupName": "eks-auto-scaling-group-myCluster-NodeSecurityGroup-1JMH4SX5VRWYS",
                "GroupId": "sg-08163e3b40afba712"
        "NetworkInterfaceId": "eni-0ed2f1cf4b09cb8be",
        "OwnerId": "979256113747",
        "PrivateIpAddresses": [
                "Primary": true,
                "PrivateDnsName": "ip-192-168-84-108.eu-central-1.compute.internal",
                "PrivateIpAddress": ""
                "Primary": false,
                "PrivateDnsName": "ip-192-168-72-200.eu-central-1.compute.internal",
                "PrivateIpAddress": ""
                "Primary": false,
                "PrivateDnsName": "ip-192-168-96-163.eu-central-1.compute.internal",
                "PrivateIpAddress": ""
                "Primary": false,
                "PrivateDnsName": "ip-192-168-99-199.eu-central-1.compute.internal",
                "PrivateIpAddress": ""
---- SNIP ----

I have removed some of the output to keep it readable. We see that AWS has attached several elastic network interfaces (ENI) to our node. An ENI is a virtual network interface that AWS creates and manages for you. Each node can have more than one ENI attached, and each ENI can have a primary and multiple secondary IP addresses.

If you look at the last line of the output, you see that there is a network interface eni-0ed2f1cf4b09cb8be that has, as one of the secondary IP addresses, the IP address This is the IP address of our Pod! Let us now go back to the node and inspect its network configuration once more. You will not find a network interface with this exact name, but you will find a network interface on the node on which the pod is running which has the same MAC address, namely eth1.

$ ifconfig eth1
eth1: flags=4163  mtu 9001
        inet6 fe80::e8:f0ff:fe26:a73e  prefixlen 64  scopeid 0x20
        ether 02:e8:f0:26:a7:3e  txqueuelen 1000  (Ethernet)
        RX packets 224  bytes 6970 (6.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 20  bytes 1730 (1.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This is an ordinary VPC interface and visible in the entire VPC under all of its IP addresses. So if we curl our httpd from the second node, the traffic will leave that node via the default interface, be picked up by the VPC, routed to the node on which the pod is running and enter via eth1. As IP forwarding is enabled on this node, the traffic will be routed to the Pod and arrive at the httpd.

This is the missing piece of magic we have been looking for. In fact, for every pod running on a node, EKS will add an additional secondary IP address to an ENI attached to the node (and attach additional ENIs if needed) which will make the Pod IP addressses visible in the entire VPC. This mechanism is nicely described in the documentation of the CNI plugin which EKS uses. So we now have the following picture.


So this allows us to run our httpd in such a way that it can be reached from the entire Pod network (and the entire VPC). Note, however, that it can of course not be reached from the outside world. It is interesting to repeat this experiment with a slighly adapted YAML file that uses the containerPort field:

apiVersion: v1
kind: Pod
  name: alpine
  namespace: default
  - name: alpine-ctr
    image: httpd:alpine
      - containerPort: 80

If we remove the old Pod and use this YAML file to create a new pod, we will find that the configuration does not change at all. In particular, running docker ps on the node on which the Pod is scheduled will teach you that this port specification is not the equivalent of the port specification of the docker run port mapping feature – as the Kubernetes API specification states, this field is informational.

Implementation of services

Let us now see how this picture changes if we add a service. First, we will use a service of type ClusterIP, i.e. a service that will make our httpd reachable from within the entire cluster under a common IP address. For that purpose – after deleting our existing pods – let us create a deployment that brings up two instances of the httpd.

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/pods/deployment.yaml

Once the pods are up, you can again use curl to verify that you can talk to every pod from every node and every pod. Now let us create a service.

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/service.yaml

Once that has been done, enter kubectl get svc to get a list all services. You should see a newly created service alpine-service. Note down its cluster IP address – in my case this was

Now log into one of the nodes again, attach to the container, install curl there and try to connect to port

$ ID=$(docker ps --filter name=alpine-ctr --format "{{.ID}}")
$ docker exec -it $ID "/bin/bash"
bash-4.4# apk add curl
OK: 124 MiB in 67 packages
bash-4.4# curl
<h1>It works!</h1>

So, as promised by the definition of s service, the httpd is visible within the cluster under the cluster IP address of the service. The same works if we are on a node and not attached to a container.

To see how this works, let us log out of the container again and search the NAT tables for the cluster IP address of the service.

$ sudo iptables -S -t nat | grep
-A KUBE-SERVICES -d -p tcp -m comment --comment "default/alpine-service: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-SXWLG3AINIW24QJC

So we see that Kubernetes (more precisely the kube-proxy running on each node) has added a NAT rule that captures traffic directed towards the service IP address to a special chain. Let us dump this chain.

$ sudo iptables -S -t nat | grep KUBE-SVC-SXWLG3AINIW24QJC
-A KUBE-SERVICES -d -p tcp -m comment --comment "default/alpine-service: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-SXWLG3AINIW24QJC
-A KUBE-SVC-SXWLG3AINIW24QJC -m comment --comment "default/alpine-service:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-YRELRNVKKL7AIZL7
-A KUBE-SVC-SXWLG3AINIW24QJC -m comment --comment "default/alpine-service:" -j KUBE-SEP-BSEIKPIPEEZDAU6E

Now this is actually pretty interesting. The first line is simply the creation of the chain. The second line is the line that we already looked at above. The next two lines are the lines we are looking for. We see that, with a probability of 50%, we either jump to the chain KUBE-SEP-YRELRNVKKL7AIZL7 or to the chain KUBE-SEP-BSEIKPIPEEZDAU6E. Let us display one of them.

$ sudo iptables -S KUBE-SEP-BSEIKPIPEEZDAU6E -t nat 
-A KUBE-SEP-BSEIKPIPEEZDAU6E -s -m comment --comment "default/alpine-service:" -j KUBE-MARK-MASQ
-A KUBE-SEP-BSEIKPIPEEZDAU6E -p tcp -m comment --comment "default/alpine-service:" -m tcp -j DNAT --to-destination

So we see the that this chain has two rules. The first rule marks all packages that are originating from the pod running on this node, this mark is later evaluated in the forwarding rules to make sure that the packet is accepted for forwarding. The second rule is where the magic happens – it performs a DNAT, i.e. a destination NAT, and sends our packets to one of the pods. The rule KUBE-SEP-YRELRNVKKL7AIZL7 is similar, with the only difference that it sends the packets to the other pod. So we see that two things are happening

  • Traffic directed towards port 8080 of the cluster IP address is diverted to one of the pods
  • Which one of the pods is selected is determined randomly, with a probability of 50% for both pods. Thus these rules implement a simple load balancer.

Let us now see how things change when we use a service of type NodePort. So let us use a slightly different YAML file.

$ kubectl delete -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/service.yaml
$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/network/nodePortService.yaml

When we now run kubectl get svc, we see that our service appears as a NodePort service, and, as the second entry in the columns PORTS, we find the port that Kubernetes opens for us.

$ kubectl get services
NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
alpine-service   NodePort           8080:32755/TCP   13s
kubernetes       ClusterIP             443/TCP          7h

In my case, the port 32755 has been used. If we now go back to one of the nodes and search the iptables rules for this port, we find that Kubernetes has created two additional NAT rules.

$ sudo iptables -S  -t nat | grep 32755
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/alpine-service:" -m tcp --dport 32755 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/alpine-service:" -m tcp --dport 32755 -j KUBE-SVC-SXWLG3AINIW24QJC

So we find that for all traffic directed to this port, again a marker is set and the rule KUBE-SVC-SXWLG3AINIW24QJC applies. If you inspect this rule, you will find that it is similar to the rules above and again realizes a load balancer that sends traffic to port 80 of one of the pods.

Let us now verify that we can really reach this pod from the outside world. Of course, this only works once we have allowed incoming traffic on at least one of the nodes in the respective AWS security group. The following commands determine the node Port, your IP address, the security group and the IP address of the node, allow access and submit the curl command (note that I use the cluster name myCluster to select the worker nodes, in case you are not using my scripts to run this example, you will have to change the filter to make this work).

$ nodePort=$(kubectl get svc alpine-service --output json | jq ".spec.ports[0].nodePort")
$ IP=$(aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --output text --query Reservations[0].Instances[0].PublicIpAddress)
$ SEC_GROUP_ID=$(aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --output text --query Reservations[0].Instances[0].SecurityGroups[0].GroupId)
$ myIP=$(wget -q -O- https://ipecho.net/plain)
$ aws ec2 authorize-security-group-ingress --group-id $SEC_GROUP_ID --port $nodePort --protocol tcp --cidr "$myIP/32"
$ curl $IP:$nodePort
<h1>It works!</h1>


After all these nitty-gritty details, let us summarize what we have found. When you start a pod on a node, a pair of virtual ethernet devices is created, with one end being assigned to the namespace of the container and one end being assigned to the namespace of the host. Then IP routes are added so that traffic directed towards the pod is forwarded to this bridge. This allows access to the container from the node on which they are running. To realize access from other nodes and pods and thus the flat Kubernetes networking model, EKS uses the AWS CNI plugin which attaches the pods IP addresses as secondary IP addresses to elastic network interfaces.

When you start a service, Kubernetes will in addition set up NAT rules that will capture traffic determined for the cluster IP address of the service and perform a destination network address translation so that this traffic gets send to one of the pods. The pod is selected at random, which implements a simple load balancer. For a service of type NodePort, additional rules will be created which make sure that the same NAT processing applies to traffic coming from the outside world.

This completes today post. If you want to learn more, you might want to check out some of the links below.

Kubernetes services and load balancers

In my previous post, we have seen how we can use Kubernetes deployment objects to bring up a given number of pods running our Docker images in a cluster. However, most of the time, a pod by itself will not be able to operate – we need to connect it with other pods and the rest of the world, in other words we need to think about networking in Kubernetes.

To try this out, let us assume that you have a cluster up and running and that you have submitted a deployment of two httpd instances in the cluster. To easily get to this point, you can use the scripts in my GitHub repository as follows. These scripts will also open two ssh connections, one to each of the EC2 instances which are part of the cluster (this assumes that you are using a PEM file called eksNodeKey.pem as I have done it in my examples, if not you will have to adjust the script up.sh accordingly).

# Clone repository
$ git clone https://github.com/christianb93/Kubernetes.git
$ cd Kubernetes/cluster
# Bring up cluster and two EC2 nodes
$ chmod 700 up.sh
$ ./up.sh
# Bring down nginx controller
$ kubectl delete svc ingress-nginx -n ingress-nginx
# Deploy two instances of the httpd
$ kubectl apply -f ../pods/deployment.yaml

Be patient, the creation of the cluster will take roughly 15 minutes, so time to get a cup of coffee. Note that we delete an object – the nginx ingress controller service – that my scripts generate and that we will use in a later post, but which blur the picture for today.

Now let us inspect our cluster and try out a few things. First, let us get the running pods.

$ kubectl get pods -o wide

This should give you two pods, both running an instance of the httpd. Typically, Kubernetes will distribute these two pods over two different nodes in the cluster. Each pod has an IP address called the pod IP address which is displayed in the column IP of the output. This is the IP address under which the pod is reachable from other pods in the cluster.

To verify this, let us attach to one of the pods and run curl to access the httpd running in the other pod. In my case, the first pod has IP address We will attach to the second pod and verify that this address is reachable from there. To get a shell in the pod, we can use the kubectl exec command which executes code in a pod. The following commands extract the id of the second pod, opens a shell in this pod, installs curl and talks to the httpd in the first pod. Of course, you will have to replace the IP address of the first pod – – with whatever your output gives you.

$ name=$(kubectl get pods --output json | \
             jq -r ".items[1].metadata.name")
$ kubectl exec -it $name "/bin/bash"
bash-4.4# apk add curl
bash-4.4# curl
<h1>It works!</h1>

Nice. So we can reach a port on pod A from any other pod in the cluster. This is what is called the flat networking model in Kubernetes. In this model, each pod has a separate IP address. All containers in the pod share this IP address and one IP namespace. Every pod IP address is reachable from any other pod in the cluster without a need to set up a gateway or NAT. Coming back to the comparison of a pod with a logical host, this corresponds to a topology where all logical hosts are connected to the same IP network.

In addition, Kubernetes assumes that every node can reach every pod as well. You can easily confirm this – if you log into the node (not the pod!) on which the second pod is running and use curl from there, directed to the IP address of the first pod, you will get the same result.

Now Kubernetes is designed to run on a variety of platforms – locally, on a bare metal cluster, on GCP, AWS or Azure and so forth. Therefore, Kubernetes itself does not implement those rules, but leaves that to the underlying platform. To make this work, Kubernetes uses an interface called CNI (container networking interface) to talk to a plugin that is responsible for executing the platform specific part of the network configuration. On EKS, the AWS CNI plugin is used. We will get into the details in a later post, but for the time being simply assume that this works (and it does).

So now we can reach every pod from every other pod. This is nice, but there is an issue – a pod is volatile. An application which is composed of several microservices needs to be prepared for the event that a pod goes down and is brought up again, be it due to a failure or simply due to the fact that an auto-scaler tries to empty a node. If, however, a pod is restarted, it will typically receive a different IP address.

Suppose, for instance, you had a REST service that you want to expose within your cluster. You use a deployment to start three pods running the REST service, but which IP address should another service in the cluster use to access it? You cannot rely on the IP address of individual pods to be stable. What we need is a stable IP address which is reachable by all pods and which routes traffic to one instance of this REST service – a bit like a cluster-internal load balancer.

Fortunately, Kubernetes services come to the rescue. A service is a Kubernetes object that has a stable IP address and is associated with a set of pods. When traffic is received which is directed to the service IP address, the service will select one of the pods (at random) and forward the traffic to it. Thus a service is a decoupling layer between the instable pod IP addresses and the rst of the cluster or the outer world. We will later see that behind the scenes, a service is not a process running somewhere, but essentially a set of smart NAT and routing rules.

This sounds a bit abstract, so let us bring up a service. As usual, a service object is described in a manifest file.

apiVersion: v1
kind: Service
  name: alpine-service
    app: alpine
  - protocol: TCP
    port: 8080
    targetPort: 80

Let us call this file service.yaml (if you have cloned my repository, you already have a copy of this in the directory network). As usual, this manifest file has a header part and a specification section. The specification section contains a selector and a list of ports. Let us look at each of those in turn.

The selector plays a role similar to the selector in a deployment. It defines a set of pods that are assumed to be reachable via the service. In our case, we use the same selector as in the deployment, so that our service will send traffic to all pods brought up by this deployment.

The ports section defines the ports on which the service is listening and the ports to which traffic is routed. In our case, the service will listen for TCP traffic on port 8080, and will forward this traffic to port 80 on the associated pods (as specified by the selector). We could omit the targetPort field, in this case the target port would be equal to the port. We could also specify more than one combination of port and target port and we could use names instead of numbers for the ports – refer to the documentation for a full description of all configuration options.

Let us try this. Let us apply this manifest file and use kubectl get svc to list all known services.

$ kubectl apply -f service.yaml
$ kubectl get svc

You should now see a new service in the output, called alpine-service. Similar to a pod, this service has a cluster IP address assigned to it, and a port number (8080). In my case, this cluster IP address is We can now again get a shell in one of the pods and try to curl that address

$ name=$(kubectl get pods --output json | \
             jq -r ".items[1].metadata.name")
$ kubectl exec -it $name "/bin/bash"
bash-4.4# apk add curl # might not be needed 
bash-4.4# curl
<h1>It works!</h1>

If you are lucky and your container did not go down in the meantime, curl will already be installed and you can skip the apk add curl. So this works, we can actually reach the service from within our cluster. Note that we now have to use port 8080, as our service is listening on this port, not on port 80.

You might ask how we can get the IP address of the service in real world? Well, fortunately Kubernetes does a bit more – it adds a DNS record for the service! So within the pod, the following will work

bash-4.4# curl alpine-service:8080
<h1>It works!</h1>

So once you have the service name, you can reach the httpd from every pod within the cluster.

Connecting to services using port forwarding

Having a service which is reachable within a cluster is nice, but what options do we have to reach a cluster from the outside world? For a quick and dirty test, maybe the easiest way is using the kubectl port forwarding feature. This command allows you to forward traffic from a local port on your development machine to a port in the cluster which can be a service, but also a pod. In our case, let us forward traffic from the local port 5000 to port 8080 of our service (which is hooked up to port 80 on our pods).

$ kubectl port-forward service/alpine-service 5000:8080 

This will start an instance of kubectl which will bind to port 5000 on your local machine ( You can now connect to this port, and behind the scenes, kubectl will tunnel the traffic through the Kubernetes master node into the cluster (you need to run this in a second terminal, as the kubectl process just started is still blocking your terminal).

$ curl localhost:5000
<h1>It works!</h1>

A similar forwarding can be realized using kubectl proxy, which is designed to give you access to the Kubernetes API from your local machine, but can also be used to access services.

Connect to a service using host ports

Forwarding is easy and a quick solution, but most likely not what you want to do in a production setup. What other options do we have?

One approach is to use host ports. Essentially, a host port is a port on a node that Kubernetes will wire up with the cluster IP address of a service. Assuming that you can reach the host from the outside world, you can then use the public IP address of the host to connect to a service.

To create a host port, we have to modify our manifest file slightly by adding a host port specification.

apiVersion: v1
kind: Service
  name: alpine-service
    app: alpine
  - protocol: TCP
    port: 8080
    targetPort: 80
  type: NodePort

Note the additional line at the bottom of the file which instructs Kubernetes to open a node port. Assuming that this file is called nodePortService.yaml, we can again use kubectl to bring down our existing service and add the node port service.

$ kubectl delete svc alpine-service
$ kubectl apply -f nodePortService.yaml
$ kubectl get svc

We see that Kubernetes has brought up our service, but this time, we see two ports in the line describing our service. The second port (32226 in my case) is the port that Kubernetes has opened on each node. Traffic to this port will be forwarded to the service IP address and port. To try this out, you can use the following commands to get the external IP address of the first node, adapt the AWS security group such that traffic to this node is allowed from your workstation, determine the node port and curl it. If your cluster is not called myCluster, replace every occurrence of myCluster with the name of your cluster.

$ nodePort=$(kubectl get svc alpine-service --output json | jq ".spec.ports[0].nodePort")
$ IP=$(aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --output text --query Reservations[0].Instances[0].PublicIpAddress)
$ SEC_GROUP_ID=$(aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --output text --query Reservations[0].Instances[0].SecurityGroups[0].GroupId)
$ myIP=$(wget -q -O- https://ipecho.net/plain)
$ aws ec2 authorize-security-group-ingress --group-id $SEC_GROUP_ID --port $nodePort --protocol tcp --cidr "$myIP/32"
$ curl $IP:$nodePort
<h1>It works!</h1>

Connecting to a service using a load balancer

A node port will allow you to connect to a service using the public IP of a node. However, if you do this, this node will be a single point of failure. For a HA setup, you would typically choose a different options – load balancers.

Load balancers are not managed directly by Kubernetes. Instead, Kubernetes will ask the underlying cloud provider to create a load balancer for you which is then connected to the service – so there might be additional charges. Creating a service exposed via a load balancer is easy – just change the type field in the manifest file to LoadBalancer

apiVersion: v1
kind: Service
  name: alpine-service
    app: alpine
  - protocol: TCP
    port: 8080
    targetPort: 80
  type: LoadBalancer

After applying this manifest file, it takes a few seconds for the load balancer to be created. Once this has been done, you can find the external DNS name of the load balancer (which AWS will create for you) in the column EXTERNAL-IP of the output of kubectl get svc. Let us extract this name and curl it. This time, we use the jsonpath option of the kubectl command instead of jq.

$ host=$(kubectl get svc alpine-service --output  \
$ curl $host:8080
<h1>It works!</h1>

If you get a “couldn not resolve hostname” error, it might be that the DNS entry has not yet propagated through the infrastructure, this might take a few minutes.

What has happened? Behind the scenes, AWS has created an elastic load balancer (ELB) for you. Let us describe this load balancer.

$ aws elb describe-load-balancers --output json
    "LoadBalancerDescriptions": [
            "Policies": {
                "OtherPolicies": [],
                "LBCookieStickinessPolicies": [],
                "AppCookieStickinessPolicies": []
            "AvailabilityZones": [
            "CanonicalHostedZoneName": "ad76890d448a411e99b2e06fc74c8c6e-2099085292.eu-central-1.elb.amazonaws.com",
            "Subnets": [
            "CreatedTime": "2019-03-17T11:07:41.580Z",
            "SecurityGroups": [
            "Scheme": "internet-facing",
            "VPCId": "vpc-060469b2a294de8bd",
            "LoadBalancerName": "ad76890d448a411e99b2e06fc74c8c6e",
            "HealthCheck": {
                "UnhealthyThreshold": 6,
                "Interval": 10,
                "Target": "TCP:30829",
                "HealthyThreshold": 2,
                "Timeout": 5
            "BackendServerDescriptions": [],
            "Instances": [
                    "InstanceId": "i-0cf7439fd8eb65858"
                    "InstanceId": "i-0fda48856428b9a24"
            "SourceSecurityGroup": {
                "GroupName": "k8s-elb-ad76890d448a411e99b2e06fc74c8c6e",
                "OwnerAlias": "979256113747"
            "DNSName": "ad76890d448a411e99b2e06fc74c8c6e-2099085292.eu-central-1.elb.amazonaws.com",
            "ListenerDescriptions": [
                    "PolicyNames": [],
                    "Listener": {
                        "InstancePort": 30829,
                        "LoadBalancerPort": 8080,
                        "Protocol": "TCP",
                        "InstanceProtocol": "TCP"
            "CanonicalHostedZoneNameID": "Z215JYRZR1TBD5"

This is a long output, let us see what this tells us. First, there is a list of instances, which are the instances of the nodes in your cluster. Then, there is the block ListenerDescriptions. This block specificies, among other things, the load balancer port (8080 in our case, this is the port that the load balancer exposes) and the instance port (30829). You will note that these are also the ports that kubectl get svc will give you. So the load balancer will send incoming traffic on port 8080 to port 30829 of one of the instances. This in turn is a host port as discussed before, and therefore will be connected to our service. Thus, even though technically not fully correct, the following picture emerges (technically, a service is not a process, but a collection of iptables rules on each node, which we will look at in more detail in a later post).


Using load balancers, however, has a couple of disadvantages, the most obvious one being that each load balancer comes with a cost. If you have an application that exposes tens or even hundreds of services, you clearly do not want to fire up a load balancer for each of them. This is where an ingress comes into play, which can distribute incoming HTTP(S) traffic across various services and which we will study in one of the next posts.

There is one important point when working with load balancer services – do not forget to delete the service when you are done! Otherwise, the load balancer will continue to run and create charges, even if it is not used. So delete all services before shutting down your cluster and if in doubt, use aws elb describe-load-balancers to check for orphaned load balancers.

Creating services in Python

Let us close this post by looking into how services can be provisioned in Python. First, we need to create a service object and populate its metadata. This is done using the following code snippet.

service = client.V1Service()
service.api_version = "v1"
service.kind = "Service"
metadata = client.V1ObjectMeta(name="alpine-service")
service.metadata = metadata

Now we assemble the service specification and attach it to the service object.

spec = client.V1ServiceSpec()
selector = {"app": "alpine"}
spec.selector = selector
port = client.V1ServicePort(
              port = 8080, 
              protocol = "TCP", 
              target_port = 80 )
spec.ports = [port]
service.spec = spec

Finally, we authenticate, create an API endpoint and submit the creation request.

api = client.CoreV1Api()

If you have cloned my GitHub repository, you will find a script network/create_service.py that contains the full code for this and that you can run to try this out (do not forget to delete the existing service before running this).

Kubernetes 101 – creating pods and deployments

In the last posts, we have seen how we can set up a Kubernetes cluster on Amazons EKS platform and spin up our first nodes. Today, we will create our first workloads and see pods and deployments in action.

Creating pods

We have already introduces pods in an earlier post as the smallest units that Kubernetes manages. To create pods, there are several options. First, we can use the kubectl command line tool and simply pass the description of the pod as arguments. Second, we can use a so-called manifest file which contains the specification of the pod, which has the obvious advantage that this file can be reused, put under version control, developed and tested, according to the ideas of the “Infrastructure as a code” approach. A manifest file can either be provided in JSON format or using the YAML markup language. And of course, we can again use the Kubernetes API directly, for instance by programming against the Python API.

In this post, we will demonstrate how to create pods using manifest file in YAML format and Python scripts. We start with a YAML file which looks as follows.

apiVersion: v1
kind: Pod
  name: naked-pod-demo
  namespace: default
  - name: naked-pod-demo-ctr
    image: nginx

Let us go through this file step by step to understand what it means. First, this is a YAML file, and YAML is essentially a format for specifying key-value pairs. Our first key is apiVersion, with value v1. This is the first line in all manifest files and simply specifies the API version that we will use.

The next key – kind – specifies the type of Kubernetes resource that we want to access, in this case a pod. The next key is metadata. The value of this key is a dictionary that again has two keys – name and namespace. The name is the name that our pod will have. The namespace key specifies the namespace in which the pod will start (namespaces are a way to segment Kubernetes resources and are a topic for a future post, for the time being we will simply use the so-called default namespace).

The next key – spec now contains the actual specification of the pod. This is where we tell Kubernetes what our pod should be running. Remember that a pod can contain one or more application containers. Therefore there is a key containers, which, as a value, has a list of containers. Each container has a name and further attributes.

In our case, we create one container and specify that the image should be nginx. At this point, we can specify every image that we could also specify if we did a docker run locally. In this case, Kubernetes will pull the nginx image from the default docker registry and run it.

So let us now trigger the creation of a pod by passing this specification to kubectl. To do this, we use the following command (assuming that you have saved the file as naked_pod.yaml, the reason for this naming will become clear later).

kubectl apply -f naked_pod.yaml

After a few seconds, Kubernetes should have had enough time to start up the container, so let us use kubectl to check whether the node has been created.

$ kubectl get pods 
naked-pod-demo   1/1     Running   0          38s

Nice. Let us take a closer look at this pod. If you run kubectl get pods -o wide, kubectl will give you a bit more information on the pod, including the name of the node on which the pod is running. So let us SSH into this node. To easily log into one of your nodes (the first node in this case, change Node to 1 to log into the second node), you can use the following sequence of commands. This will open the SSH port for all instances in the cluster for incoming connections from the your current IP address. Note that we use the cluster name as a filter, so you might need to replace myCluster in the commands below by the name of your cluster.

$ Node=0
$ IP=$(aws ec2 describe-instances --output text --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --query Reservations[$Node].Instances[0].PublicIpAddress)
$ SEC_GROUP_ID=$(aws ec2 describe-instances  --output text --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --query Reservations[0].Instances[0].SecurityGroups[0].GroupId)
$ myIP=$(wget -q -O- https://ipecho.net/plain)
$ aws ec2 authorize-security-group-ingress --group-id $SEC_GROUP_ID --port 22 --protocol tcp --cidr $myIP/32
$ ssh -i ~/eksNodeKey.pem ec2-user@$IP

Use this – or your favorite method (if you use my script up.sh to start the cluster, it will already open ssh connections to both nodes for you) – to log into the node on which the pod has been scheduled by Kubernetes and execute a docker ps there. You should see that, among some management containers that Kubernetes brings up, a container running nginx appears.

Now let us try a few things. First, apply the YAML file once more. This will not bring up a second pod. Instead, Kubernetes realizes that a pod with this name is already running and tells you that no change has been applied. This makes sense – in general, a manifest file specifies a target state, and we are already in the target state, so there is no need for action.

Now, on the EKS worker node on which the pod is running, use docker kill to actually stop the container. Then wait for a few seconds and do a docker ps again. Surprisingly, you will see the container again, but with a different container ID. What happened?

The answer is hidden in the logfiles of the component of Kubernetes that controls a single node – the kubelet. On the node, the kubelet is started using systemctl (you might want to verify this in the AWS provided bootstrap.sh script). So to access its logs, we need

$ journalctl -u kubelet

Close to the end of the logfile, you should see a line stating that … container … is dead, but RestartPolicy says that we should restart it. In fact, the kubelet constantly monitors the running containers and looks for deviations from the target state. The restart policy is a piece of the specification of a pod that tells the kubelet how to handle the case that a container has died. This might be perfectly fine (for batch jobs), but the default restart policy is Always, so the the kubelet will restart our container.

This is nice, but now let us simulate a more drastic event – the node dies. So on the AWS console, locate the node on which the pod is running and terminate the pod.

After a few seconds, you will see that a new node is created, thanks to the AWS auto-scaling group that controls our nodes. However, if you check the status of the pods using kubectl get pods, you will see that Kubernetes did not reschedule the pod on a different node, nor does it restart the pod once the replacement node is up and running again. So Kubernetes takes care (via the kubelet) of the containers that belong to a pod on a per-node level, but not on a per-cluster level.

In productions, this is of course typically not what you want – instead, it would be nice if Kubernetes could monitor the pods for you and restart them automatically in case of failure. This is what a deployment does.

Replica sets and deployments

A deployment is an example for a more general concept in Kubernetes – controllers. Basically, this is a component that constantly monitors the cluster state and makes changes if needed to get back into the target state. One thing that a deployment does is to automatically bring up a certain number of instances called replicas of a Docker image and distribute the resulting pods across the cluster. Once the pods are running, the deployment controller will monitor them and take action of the number of pods running deviates from the desired number of pods. As an example, let us consider the following manifest file.

apiVersion: apps/v1
kind: Deployment
  name: alpine
      app: alpine
  replicas: 2
        app: alpine
        - name: alpine-ctr
          image: httpd:alpine

The first line is specifying the API version as usual (though we need to use a different value for the API version, the additional apps is due to the fact that the Deployment API is part of the API group apps). The second line designates the object that we want to create as a deployment. We then again have a name which is part of the metadata, followed by the actual specification of the deployment.

The first part of this specification is the selector. To understands its role, recall that a deployment is supposed to make sure that a specified number of pods of a given type are running. Now, the term “of a given type” has to be made precise, and this is what the selector is being used for. All pods which are created by our deployment will have a label with key “app” and value “alpine”. The selector field specifies that all pods with that label are to be considered as pods controlled by the deployment, and our controller will make sure that there are always exactly two pods with this label.

The second part of the specification is the template that the deployment uses to create the pods controlled by this deployment. This looks very much like the definition of a pod as we have seen it earlier. However, the label that is specified here of course needs to match the label in the selector field (kubectl will actually warn you if this is not the case).

Let us now delete our existing pod, run this deployment and see what is happening. Save the manifest file as deployment.yaml, apply it and then list the running nodes.

$ kubectl delete -f naked_pod.yaml
$ kubectl apply -f deployment.yaml
deployment.apps/alpine created
$ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
alpine-8469fc798f-rq92t   1/1     Running   0          21s
alpine-8469fc798f-wmsqc   1/1     Running   0          21s

So two pods have been created and scheduled to nodes of our cluster. To inspect the container further, let us ask kubectl to provide all details as JSON and use the wonderful jq to process the output and extract the container information from it (you might need to install jq to run this).

$ kubectl get pods --output json | jq ".items[0].spec.containers[0].image"

So we see that the pods and the containers have been created according to our specification and run the httpd:alpine docker image.

Let us now repeat our experiment from before and stop one of the node. To do this, we first extract the instance ID of the first instance using the AWS CLI and then use the AWS CLI once more to stop this instances.

$ instanceId=$(aws ec2 describe-instances --output text --filters Name=tag-key,Values=kubernetes.io/cluster/myCluster Name=instance-state-name,Values=running --query Reservations[0].Instances[0].InstanceId)
$ aws ec2 terminate-instances --instance-ids $instanceId

After some time, Kubernetes will find that the instance has died, and if you run a kubectl get nodes, you will find that the node has disappeared. If you now run kubectl get nodes -o wide, however, you will still find two pods, but now both are running on the remaining node. So the deployment controller has replaced the pod that went down with our node by a new pod on the remaining node. This behaviour is in fact not realized by the deployment, but by the underlying ReplicaSet. We could also create a replica set directly, but a deployment offers some additional features, like the possibility to run a rolling upgrade automatically.

Creating deployments in Python

Having discussed how to create a deployment from a YAML file, let us again do the same thing in Python. The easiest approach is to walk our way upwards through the YAML file. After the usual preparations (imports, loading the configuration), we therefore start with the specification of the container.

import client
container = client.V1Container(

This is similar to our YAML file – in each pod, we want to run a container with the httpd:alpine image. Having the container object, we can now create the template section.

template = client.V1PodTemplateSpec(
                   labels={"app": "alpine"}),

Again, note the label that we will use later to select the pods that our deployment controller is supposed to watch. Now we put the template into the specification part of our controller.

selector = client.V1LabelSelector(
              match_labels={"app" : "alpine"})
spec = client.V1DeploymentSpec(

And finally, we can create the actual deployment object and ask Kubernetes to apply it – the complete Python script is available for download here).

deployment = client.V1Deployment(

              namespace="default", body=deployment)

This completes our post for today. We now know how to create pods and deployments to run Docker images in our cluster. In the next few posts, we will look at networking in Kubernetes – services, load balancers, ingress and all that.

Python up an EKS cluster – part II

In the last post, we have seen how Python can be used to control the generation of an EKS cluster. However, an EKS cluster without any worker nodes – and hence without the ability to start pods and services – is of very limited use. Today, we therefore take a look at the process of adding worker nodes.

Again, we roughly follow the choreography of Amazons Getting started with EKS guide, but instead of going through the individual steps using the console first, we start using the API and the Python SDK right away.

This description assumes that you have started an EKS cluster as described in my last post. If not, you can do this by running the create_cluster Python script that executes all necessary steps.

Essentially, we will do two things. First, we will create an AWS Auto-Scaling group that we will use to manage our worker nodes. This can be done using the AWS API and the Python SDK. Part of this setup will be a new role, that, in a second step, we need to make known to Kubernetes so that our nodes can join the cluster. This will be done using the Kubernetes API. Let us now look at each of these steps in turn.

Creating an auto-scaling group

An auto-scaling group is essentially a group of EC2 instances that are combined into one management unit. When you set up an auto-scaling group, you specify a scaling policy and AWS will apply that policy to make sure that a certain number of instances is automatically running in your group. If the number of instances drops below a certain value, or if the load increases (depending on the policy), then AWS will automatically spin up new instances for you.

We will use AWS CloudFormation to set up the auto-scaling group, using the template referenced in the “Getting started” guide. This is not an introduction into CloudFormation, but it is instructive to look at this template, stored at this URL.

If you look at this file, you will recognize a few key elements. First, there are parameters whose values we will have to provide when we use the template. These parameters are

  • A unique name for the node group
  • The name of an SSH key pair that we will later use to log into our nodes. I recommend to create a separate key pair for this, and I will assume that you have done this and that this key pair is called eksNodeKey
  • An instance type that AWS will use to spin up nodes
  • The ID of an AMI image that will be used to spin up the nodes. In this tutorial, we will use Kubernetes v1.11 and the specific AWS provided AMIs which are listed in the Getting started guide. I work in the region eu-frankfurt-1, and will therefore use the AMI ID ami-032ed5525d4df2de3
  • The minimum, maximum and desired size of the auto-scaling group. We will use one for the minimum size, two for the desired size and three for the maximum size, so that our cluster will come up with two nodes initially
  • The size of the volume attached to each node – we will not provide a value for this and stick to the default (20 GB at the time of writing)
  • In addition, we will have to supply the VPC, the subnets and the security group that we used before

In addition to the parameters, the CloudFormation template contains a list of resources that will be created. We see that in addition to several security groups and the auto scaling group, a IAM role called the node instance role and a corresponding instance profile is created. This is the role that the nodes will assume to interact with the Kubernetes master, and behind the scenes, the AWS IAM authenticator will be used to map this IAM role to a Kubernetes user, so that the nodes can authenticate themselves against the Kubernetes API. This requires some configuration, more precisely a mapping between IAM roles and Kubernetes users, which we will set up in our second steps in the next section.

So how do we apply this cloud formation template? Again, the AWS API comes to the rescue. It provides an API endpoint (client) for CloudFormation which has a method create_stack. This method expects a Python data structure that contains the parameters that we want to use, plus the URL of the cloud formation template. It will then take the template, apply the parameters, and create the resources defined in the template.

Building the parameter structure is easy, assuming that you have already stored all configuration data in Python variables (we can use the same method as in the last post to retrieve cluster data like the VPC ID). The only tricky part is the list of subnets. In CloudFormation, a parameter value that is a list needs to be passed as a comma separated list, not as a JSON list (it took me some failed attempts to figure out this one). Hence we need to convert our list of subnets into CSV format. The code snippet to set up the parameter map then looks as follows.

params = [
  {'ParameterKey' : 'KeyName' , 
   'ParameterValue' : sshKeyPair },
  {'ParameterKey' : 'NodeImageId' , 
   'ParameterValue' : amiId },
  {'ParameterKey' : 'NodeInstanceType' , 
   'ParameterValue' : instanceType },
  {'ParameterKey' : 'NodeAutoScalingGroupMinSize' , 
   'ParameterValue' : '1' },
  {'ParameterKey' : 'NodeAutoScalingGroupMaxSize' , 
   'ParameterValue' : '3' },
  {'ParameterKey' : 'NodeAutoScalingGroupDesiredCapacity' , 
   'ParameterValue' : '2' },
  {'ParameterKey' : 'ClusterName' , 
   'ParameterValue' : clusterName },
  {'ParameterKey' : 'NodeGroupName' , 
   'ParameterValue' : nodeGroupName },
  {'ParameterKey' : 'ClusterControlPlaneSecurityGroup' , 
   'ParameterValue' : secGroupId },
  {'ParameterKey' : 'VpcId' , 
   'ParameterValue' : vpcId },
  {'ParameterKey' : 'Subnets' , 
   'ParameterValue' : ",".join(subnets) },

Some of the resources created by the template are IAM resources, and therefore we need to confirm explicitly that we allow for this. To do this, we need to specify a specific capability when calling the API. With that, our call now looks as follows.

cloudFormation = boto3.client('cloudformation')
  StackName = stackName, 
  TemplateURL = templateURL,
  Parameters = params,
  Capabilities = ['CAPABILITY_IAM'])

Here the template URL is the location of the CloudFormation template provided above, and stack name is some unique name that we use for our stack (it is advisable to include the cluster name, so that you can create and run more than one cluster in parallel).

When you run this command, several things will happen, and it is instructive to take a look at the EC2 console and the CloudFormation console while the setup is in progress. First, you will find that a new stack appears on the CloudFormation console, being in status “In Creation”.

Second, the EC2 console will display – after a few seconds – a newly created auto scaling group, with the parameters specified in the template, and a few additional security groups.


And when you list your instances, you will find that AWS will spin up two instances of the specified type. These instances will come up as usual and are ordinary EC2 instances.


Remember that we specified a key pair when we created the auto-scaling group? Hopefully, you did save the PEM file – if yes, you can now SSH into your nodes. For that purpose, you will have to modify the security group of the nodes and open the SSH port for inbound traffic from your own network. Once this has been done, you can simply use an ordinary SSH command to connect to your machines.

$ ssh -i eksNodeKey.pem ec2-user@

where of course you need to replace with whatever public IP address the instance has. You can now inspect the node, for instance doing docker ps to see which docker containers are already running on the node, and you can use ps ax to see which processes are running.

Attaching the auto-scaling group to Kubernetes

At this point, the nodes are up and running, but in order to register with Kubernetes, they need to talk to the Kubernetes API. More specifically, the kubelet running on the node will try to connect to the cluster API and therefore act as a Kubernetes client. As every Kubernetes client, this requires some authorization. This field is a bit tricky, but let me try to give a short overview (good references for what follows are the documentation of the AWS IAM authenticator, this excellent blog post, and the Kubernetes documentation on authorization).

Kubernetes comes with several possible authentication methods. The method used by EKS is called bearer token. With this authentication method, a client request contains a token in its HTTPS header which is used by Kubernetes to authorize the request. To create and verify this token, a special piece of software is used – the AWS IAM authenticator. This utility can operate in client mode – it then generates a token – and in server mode, where it verifies a token.

The basic chain of events when a client like kubectl wants to talk to the API is as follows.

  • The client consults the configuration file ~/.kube/config to determine the authorization method to be used
  • In our case, this configuration refers to the AWS IAM authenticator, so the client calls this authenticator
  • The authenticator looks up the current AWS credentials, uses them to generate a token and returns that token to the client
  • The client sends the token to the API endpoint along with the request
  • The Kubernetes API server extracts the token and invokes the authenticator running on the Kubernetes management nodes (as a server)
  • The authenticator uses the token to determine the IAM role used by the client and to verify it
  • It then maps the IAM role to a Kubernetes user which is then used for authorization

You can try this yourself. Log into one of the worker nodes and run the command

$ aws sts get-caller-identity

This will return the IAM role ARN used by the node. If you compare this with the output of the CloudFormation stack that we have used to create our auto-scaling group, you will find that the node assumes the IAM role created when our auto-scaling group was generated. Let us now verify that this identity is also used when the AWS IAM authenticator creates a token. For that purpose, let us manually create a token by running – still on the node – the command

$ aws-iam-authenticator token -i myCluster

This will return a dictionary in JSON format, containing a key called token. Extract the value of this key, this will be a base64 encoded string starting with “k8s-aws-v1”. Now, fortunately the AWS authenticator also offers an operation called verify that we can use to extract the IAM role again from the token. So run

$ aws-iam-authenticator verify -i myCluster -d ""k8s-aws-v1..."

where the dots represent the full token extracted above. As a result, the authenticator will print out the IAM role, which we again recognize as the IAM role created along with the auto-scaling group.

Essentially, when the kubelet connects to the server and the AWS IAM authenticator running on the server receives the token, it will use the same operation to extract the IAM role. It then needs to map the IAM role to a Kubernetes role. But where does this mapping information come from?

The answer is that we have to provide it using the standard way to store configuration data in a Kubernetes cluster, i.e. by setting up a config map.

Again, Amazon provides a template for this config map in YAML format, which looks as follows.

apiVersion: v1
kind: ConfigMap
  name: aws-auth
  namespace: kube-system
  mapRoles: |
    - rolearn: 
      username: system:node:{{EC2PrivateDNSName}}
        - system:bootstrappers
        - system:nodes

Thus our config map contains exactly one key-value pair, with key mapRoles. The value of this key is again a string in YAML format (note that the pipe is an escape character in YAML which instructs us to treat the next lines as a multi-line string, not as a list). Of course, we have to replace the placeholder by the ARN of the role that we want to map, i.e. the role created along with the auto-scaling group.

To create this config map, we will have to use the Python SDK for Kubernetes that will allow us to talk to the Kubernetes API. To install it, run

$ pip3 install kubernetes

To be able to use the API from within our Python script, we of course have to import some components from this library and read our kubectl configuration file, which is done using the following code.

from kubernetes import client, config
v1 = client.CoreV1Api() 

The object that we call v1 is the client that we will use to create and submit API requests. To create our config map, we first need to instantiante an object of the type V1ConfigMap and populate it. Assuming that the role ARN we want to map is stored in the Python variable nodeInstanceRole, this can be done as follows.

body = client.V1ConfigMap()
body.metadata = {}
body.metadata['name'] = "aws-auth"
body.metadata['namespace'] = "kube-system"
body.data = {}
body.data['mapRoles'] = "- rolearn: " + nodeInstanceRole +  "\n  username: system:node:{{EC2PrivateDNSName}}\n  groups:\n    - system:bootstrappers\n    - system:nodes\n" 

Finally, we call the respective method to actually create the config map.

v1.create_namespaced_config_map("kube-system", body) 

Once this method completes, our config map should be in place. You can verify this by running kubectl on your local machine to describe the config map.

$ kubectl describe configMaps aws-auth -n kube-system
Name:         aws-auth
Namespace:    kube-system

- rolearn: arn:aws:iam::979256113747:role/eks-auto-scaling-group-myCluster-NodeInstanceRole-HL95F9DBQAMD
  username: system:node:{{EC2PrivateDNSName}}
    - system:bootstrappers
    - system:nodes


If you see this output, we are done. The EC2 instances should now be able to connect to the Kubernetes master nodes, and when running

$ kubectl get nodes

you should get a list displaying two nodes. Congratulations, you have just set up your first Kubernetes cluster!

The full Python script which automates these steps can as usual be retrieved from GitHub. Once you are done playing with your cluster, do not forget to delete everything again, i.e. delete the auto-scaling group (which will also bring down the nodes) and the EKS cluster to avoid high charges. The best approach to clean up is to delete the CloudFormation stack that we have used to bring up the auto-scaling group (in the AWS CloudFormation console), which will also terminate all nodes, and then use the AWS EKS console to delete the EKS cluster as well.

This completes our post for today. In the next post in this series, we will learn how we can actually deploy containers into our cluster.

Python up an EKS cluster – part I

When you want to try out Kubernetes, you have several choices. You can install Kubernetes in a cluster,  install it locally using Minikube, or use one of the Kubernets offerings of the major cloud providers like AWS, GCP or Azure. In this post, we set up a Kubernetes cluster on Amazons EKS platform. As I love Python, we will do this in two steps. First, we will go through the steps manually, as described in the AWS documentation, using the AWS console. Then, we will learn how to use the AWS Python SDK to automate the recurring steps using Python.

A word of caution before we continue: when you follow these instructions, Amazon will charge you for the cluster as well as for the nodes that the cluster uses! So make sure that you stop all nodes and delete the cluster if you are done to avoid high charges! At the time of writing, Amazon charges 20 cents per hour per cluster plus the EC2 instances underlying the cluster, so if you play with this for one our two hours it will not cost you a fortune, but if you forget to shut everything down and let it run, it can easily add up!

What is EKS?

Before we get our hands dirty, let us quickly discuss what EKS actually is. Essentially, EKS offers you the option to set up a Kubernetes cluster in the Amazon cloud. You can connect to this cluster using the standard Kubernetes API and the standard Kubernetes tools. The cluster will manage worker nodes that are running on Amazons EC2 platform, using your EC2 account. The management nodes on which Kubernetes itself is running are not running on your EC2 instances, but are provided by Amazon.

When you set up an EKS cluster, the cluster and your worker nodes will be running in a virtual private network (VPC). You can add load balancers to make your services reachable from the internet. In addition, EKS is integrated with Amazons IAM.

This tutorial assumes that you own an AWS account and that you have followed the instructions in the IAM getting started guide to set up an IAM user with administrator privileges. Please make sure that you are logged into the AWS console with this IAM user, not your root account.

To set up an EKS cluster, several preparational steps are required. First, you will need to set up an IAM role that EKS will use to control EC2 nodes on your behalf. Next, you will need a VPC in which your cluster will run. After these one time preparation steps have been completed, you can create a Kubernetes cluster, add worker nodes and start deployments. In the following, we will go through most of these steps one by one.

One time preparational steps

Before you can use EKS, you will need to add an IAM role. To do this, navigate to the EKS console (note that this link will automatically take you to your home region). As described on the Getting started with EKS page, open the IAM management console. In the menu on the left, chose “Roles” and then hit the button “Create role”. Choose “EKS” from the list of services and hit “Next: permissions” and then immediately “Next: tags” and “Next: Review”. As the role name, enter “EKSEC2UserRole” (you could of course choose any name you want, but to make the scripts that we will establish further below work, choose this name for this tutorial). At this point, your screen should look as follows.

Screenshot from 2019-03-04 20-49-29

Now confirm the setup of the new role with “Create role” and observe how the new role appears in the list in your IAM console.

The next ingredient that we need is the VPC in which the cluster will be running. To set this up, navigate to the CloudFormation page. Hit the button “Create new stack”. Select “Specify an Amazon S3 template URL” and enter https://amazon-eks.s3-us-west-2.amazonaws.com/cloudformation/2019-02-11/amazon-eks-vpc-sample.yaml
. Then hit “Next”.

On the next page, you will have to choose a stack name. Again, this can be any name, but let us follow the standard suggestion and call it “eks-vpc”. Then hit “Next” twice and then “Create”. AWS will now create the VPC for you, which might take a few minutes, you should see a screen as below while this is in progress.

Screenshot from 2019-03-04 20-57-24

Once the stack has been created, open the “Outputs” tab and record the value of SecurityGroups, VpcID and SubnetIds. . In my case, the values are

Key Value
SecurityGroups sg-005cf103994878f27
VpcId vpc-060469b2a294de8bd
SubnetIds subnet-06088e09ce07546b9,

Next, there is some software that you need on your PC. Please follows the instructions on the Getting started with EKS page to download and install kubectl, the aws-iam-authenticator and the aws CLI your your platform. On my Ubuntu Linux PC, this was done using the steps below.

$ sudo snap install kubectl --classic
$ curl -o aws-iam-authenticator https://amazon-eks.s3-us-west-2.amazonaws.com/1.11.5/2018-12-06/bin/linux/amd64/aws-iam-authenticator
$ curl -o aws-iam-authenticator.sha256 https://amazon-eks.s3-us-west-2.amazonaws.com/1.11.5/2018-12-06/bin/linux/amd64/aws-iam-authenticator.sha256
$ openssl sha1 -sha256 aws-iam-authenticator
$ cat aws-iam-authenticator.sha256 # compare checksums!
$ chmod +x ./aws-iam-authenticator
$ mv aws-iam-authenticator $HOME/Local/bin # replace by your local bin directory
$ pip3 install awscli --upgrade --user

Note that pip will install the AWS CLI in $HOME/.local/bin, so make sure to add this directory to your path. We can now test that all tools have been installed by entering

$ kubectl help
$ aws --version

At this point, the aws tool is not yet hooked up with your credentials. To do this, make sure that you have access to the IAM access key ID and access secret key (that you can generate at the IAM console) for your IAM admin user. Then run

$ aws configure

Enter the keys and the required configuration information. When done, enter

$ aws sts get-caller-identity

to confirm that you now access the AWS API with the admin user.

Creating a cluster and adding worker nodes using the AWS console

After all these preparations – which we of course have to go through only once – we are now in a position to create our first EKS cluster. For that purpose, go to the EKS console and look for the box called “Create EKS cluster”. Enter a cluster name – I have chosen “myCluster” (yes, I am very creative) – and hit the button “Next step”.

This will take you to a configuration page. On this page, you will have to select an IAM role that the cluster will use to operate EC2, a VPC and a security group. Make sure to select those entities that we have set up above! Then hit “Create”.

Your cluster will now be created, and you should be taken to the EKS cluster overview page that lists all your clusters along with their status.


Creation of a cluster can take a few minutes, be patient. At some point, the status of your cluster should switch to active. Congratulations, you are now proud owner of a running Kubernetes cluster!

However, this Kubernetes cluster is pretty useless – there are no worker nodes yet, and you are not yet able to communicate with your cluster using kubectl. We will create worker nodes and run services in the next post. To fix the second problem, we will have to hook up the IAM credential chain used by the AWS CLI tool with kubectl. To do this, enter

$ aws eks --region eu-central-1 update-kubeconfig --name myCluster

Once this has completed, you should now be able to user kubectl to connect to your cluster. As an example, run

$ kubectl get svc
kubernetes   ClusterIP           443/TCP   6m

Creating a cluster automatically using the Python SDK

Creating a cluster using the EKS console is a bit time consuming, and given the fact that Amazon will charge you for a running cluster, you will want to delete your cluster when you are done and recreate it easily when you need it again. For that purpose, it is very useful to be able to create a cluster using Python.

To do this, let us first delete the cluster again (on the cluster overview page in the EKS console) that we have just created, and compose a Python script that recreates the cluster for us.

First, of course, we need to install the Python SDK Boto3. This is as easy as

$ pip3 install boto3

Note that the SDK uses the same credentials as the AWS CLI, so you still need to go through the steps above to install and configure the AWS CLI.

Our script will accept the name of a cluster (which was myCluster in the example above). It will then need to determine the IAM role, the VPC and subnets and the security group as we did it for the manual setup.

To determine the IAM role, we need an IAM client and use its method get_role to get role details and obtain the role ARN.

iam = boto3.client('iam')
r = iam.get_role(RoleName=eksRoleName)
roleArn = r['Role']['Arn']

Here we assume that the Python variable eksRoleName contains the name of the role that we have created as part of our one time setup.

Getting the VPC ID, the subnet IDs and the security group is a bit more tricky. For that purpose, we will refer back to the CloudFormation stack that we have created as part of our one-time setup. To get the VPC ID and the security group, we can use the following code snippet.

cloudFormation = boto3.client('cloudformation')
r = cloudFormation.describe_stack_resources(
       StackName = "eks-vpc",
vpcId = r['StackResources'][0]['PhysicalResourceId']
r = cloudFormation.describe_stack_resources(
       StackName = "eks-vpc",
secGroupId = r['StackResources'][0]['PhysicalResourceId']

Once we have this, we can easily obtain the subnet IDs using the EC2 interface.

ec2 = boto3.resource('ec2')
vpc = ec2.Vpc(vpcId)
subnets = [subnet.id for subnet in vpc.subnets.all()]

We now have all the configuration data that we need to trigger the actual cluster setup. For that purpose, we will create an EKS client and call its method create_cluster, passing the information that we have collected so far as arguments. We then create a waiter object that is polling until the cluster status changes to “Active”.

eks = boto3.client("eks")
response = eks.create_cluster(
               roleArn = roleArn,
               resourcesVpcConfig = {
                 'subnetIds' : subnets,
                 'securityGroupIds' : [secGroupId]})
waiter = eks.get_waiter("cluster_active")

When this call completes, our cluster is created. However, we still have to update our Kubernetes configuration file. Of course, we could simply call the AWS CLI, but doing this in Python is more fun. The configuration file that the AWS CLI creates is stored in your home directory, in a subdirectory called kube, and formatted in the simplified markup language known as YAML. Inspecting this file, it is not difficult to translate this into a structure of Python dictionaries and lists, which contains four cluster-specific pieces of data.

  • The cluster endpoint – this is the URL used by kubectl to submit an API call
  • The Amazon resource name of your cluster (ARN)
  • the cluster name
  • a certificate used by kubectl to authorize requests

This information can be retrieved using the method describe_cluster of the EKS client and assembled into a Python data structure. We can then use the Python module yaml to turn this into a string in YAML format, which we can then write to disk. I will not go into detail, but you might want to take a look at the complete script to create a cluster on my GitHub page.

We are now able to automatically create a Kubernetes cluster and connect to it using kubectl. In the next post, we will add the meat – we will learn how to spin up worker nodes and deploy our first pods.


Kubernetes – an overview

Docker containers are nice and offer a very lean and structured way to package and deploy applications. However, if you really want to run large scale containerized applications, you need more – you need to deploy, orchestrate and manager all your containers in a highly customizable and automated way. In other words, you need a container management platform like Kubernetes.

What is a container management platform?

Imagine you have a nice, modern application based on microservices, and you have decided to deploy each microservice into a separate container. Thus your entire application deployment will consist of a relatively large number, maybe more than hundred, individual containers. Each of these containers needs to be deployed, configured and monitored. You need to make sure that if a container goes down, it is restarted and traffic is diverted to a second instance. You need to distribute your load evenly. You need to figure out which container should run on which virtual or physical machine. When you upgrade or deploy, you need to take dependencies into account. If traffic reaches a certain threshold, you would like to scale, either vertically or horizontally. And so forth…

Thus managing a large number of containers in a production-grade environment requires quite some effort. It would be nice to have a platform that is able to

  • Offer loadbalancing, so that incoming traffic is automatically routed to several instances of a service
  • Easily set up container-level networks
  • Scale out automatically if load increases
  • Monitor running services and restart them if needed
  • Distribute containers evenly across all nodes in your network
  • Spin up new nodes or shut down nodes depending on the traffic
  • Manage your volumes and storage
  • Offer mechanisms to upgrade your application and the platform and / or rollback upgrades

In other words, you need a container management platform. In this post – and in a few following posts – I will focus on Kubernetes, which is emerging as a de facto standard for container management platforms. Kubernetes is now the underlying platform for industry grade products like RedHat OpenShift or Rancher, but can of course be used stand-alone as well. In addition, all major cloud providers like Amazon (EKS), Google (GKE), Azure (AKS) and IBM ( IBM Cloud Kubernetes Service) offer Kubernetes as a service on their respective platforms.

Basic Kubernetes concepts

When you first get in touch with the world of Kubernetes, the terminology can be a bit daunting – there are pods, clusters, services, containers, namespaces, deployments, and many other terms that will be thrown at you. Time to explain some of the basic concepts very high level – we will get in closer contact with most of them over the next few posts.

The first object we need to understand is a Kubernetes node. Essentially, a node is a compute ressource, i.e. some sort of physical or virtual host on which Kubernetes can run things. In a Kubernetes cluster, there are two types of nodes. A master node is a node on which the Kubernetes components itself are running. A worker node is a node on which application workloads are running.

On each worker node, Kubernetes will start several agents, for instance the kubelet which is the primary agent responsible for controlling a node. In addition to the agent, there are of course the application workloads. One might suspect that these workloads are managed on a container basis, however, this is not quite the case. Instead, Kubernetes groups containers into objects called pods. A pod is the smallest unit handled by Kubernetes. It encompasses one or more containers that share resources like storage volumes, network namespaces and the same lifecycle. It is quite common to see pods with only one application container, but one might also decide to put, for instance, a microservice and a database into one pod. In this case, when the pod is started, both containers (the container containing the microservice and the container containing the database) are brought up on the same node, can see each other in the network under “localhost” and can access the same volumes. In a certain sense, a pod is therefore a sort of logical host on which our application containers run. Kubernets assigns IP addresses to pods, and applications running in different pods need to use these IP addresses to communicate via the network.

When a user asks Kubernetes to run a pod, a Kubernetes component called the scheduler identifies a node on which the pod is then brought up. If the pod dies and needs to be restarted, it might be that the pod is restarted on a different node. The same applies if the node goes down, in this case Kubernetes will restart pods running on this container on a different node. There are also other situations in which a pod can be migrated from one node to a different node, for instance for the purpose of scaling. Thus, a pod is a volatile entity, and an application needs to be prepared for this.


Pods themselves are relatively dumb objects. A large part of the logic to handle pods is located in componentes called controllers. As the name suggests, these are components that control and manage pods. There are, for instance, special controllers that are able to autoscale the number of pods or to run a certain number of instances of the same pod – called replicas. In general, pods are not started directly by a Kubernetes user, but controllers are defined that manage the pods life cycle.

In Kubernetes, a volume is also defined in the context of a pod. When the pod ceases to exist, so does the volume. However, a volume can of course be backed by persistent storage, in which case the data on the volume will outlive the pod. On AWS, we could for instance mount an EBS volume as a Kubernetes volume, in which case the data will persist even if we delete the pod, or we could use local storage, in which case the data will be lost when either the pod or the host on which the pod is running (an AWS virtual machine) is terminated.

There are many other objects that are important in Kubernetes that we have not yet discussed – networks, services, policies, secrets and so forth. This post is the first in a series which will cover most of these topics in detail. My current plans are to discuss

Thus, in the next post, we will learn how to set up and configure AWS EKS to see some pods in action (I will start with EKS for the time being, but also cover other platforms like DigitalOcean later on).