Kubernetes storage under the hood part I – ephemeral storage

So far, we have mainly discussed how compute and network resources are used and managed with Kubernetes. We will now turn to the third fundamental element of a container platform – storage.

Docker storage concepts

Before we talk about Kubernetes storage concepts, let us first recall how storage is managed in Docker. The following tests assume that you have a local installation of Docker on a Linux workstation (or virtual machine, of course). As a refresher, you might want to take a look at my introduction into Docker internals before reading on.

First, let us start a Docker container and spawn a shell. The easiest way to do this is the busybox image. So let us spin up a busybox container, attached to the terminal, and run mount to inspect its file system.

$ docker run -it busybox
/ # mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/BNUAN2FS4K75ZLQFSZVQWXRLUU:/var/lib/docker/overlay2/l/UKEGOK4TLTD2T4XN5KIEPN7JXF,upperdir=/var/lib/docker/overlay2/e8f16f0d705fb8e4677b605796b7319ef3f0226e2ad173b506e13b479afa515f/diff,workdir=/var/lib/docker/overlay2/e8f16f0d705fb8e4677b605796b7319ef3f0226e2ad173b506e13b479afa515f/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
[REDACTED - SOME LINES REMOVED]

Your output will most likely look differently, but the general pattern will be the same. The first mount point that we see is the mount for the root directory. On newer Docker versions, this will be an overlay2 file system, on older Docker versions, it will be of type aufs. This entry points to one or more actual files on the host system that are located in the directory /var/lib/docker/ which is managed by Docker. These are the files in which the actual content of the root filesystem is stored.

The data on that file system is volatile and linked to the lifecycle of the container. To see this, let us create a file in the containers root file system and exit the container.

/ # echo "dancing in the rain" > /franky
/ # exit

This will stop the container, but not remove it – it will still exist and be visible in the output of docker ps -a. If you restart the container and attach to the running container, you will find that the file is still there and its content has been preserved. If, however, you remove the container using docker rm, the corresponding files on the host file system will be removed and the content of the file system of our container is lost. In that sense, these volumes are ephemeral – they survive across restarts, but die if the container is removed.

But Docker can do more – we can also use persistent storage. One option to do this are bind mounts. A bind mount maps a directory or a file from the host file system into the namespace of the container and attaches it to a mount point. To see an example, create a temporary directory on your host system and put some data into it. We can then mount this directory into a new Docker container using the -v option.

$ mkdir /tmp/ctr-test/
$ echo "Hello World" > /tmp/ctr-test/hello
$ docker run -v /tmp/ctr-test:/ctr-test/ -it busybox 
/ # cat /ctr-test/hello
Hello World
/# exit

So the content of the directory /tmp/ctr-test on the host becomes accessible within the container as /ctr-test (of course I could have chosen any other name as well). We can also see this mount point in the output of docker inspect. Use docker ps -a to find out the ID of the busybox container, in my case fd8ef21ba685, and then run

$ docker inspect fd8ef21ba685 --format="{{json .Mounts}}"
[{"Type":"bind","Source":"/tmp/ctr-test","Destination":"/ctr-test","Mode":"","RW":true,"Propagation":"rprivate"}]

So the mount point shows up as a mount point of type bind in the list of mounts of our container.

We remark that Docker also has a more advanced way to mount storage referred to as volumes. In contrast to a bind mount, a volume is an object managed by Docker, backed by files in the Docker controlled directories. Volumes can be created manually or dynamically, can be given a name and can be mounted into a container. As they are objects with an independent lifecycle, they survive container eviction and can be mounted to more than one container. However, we will not look deeper into this as (at least to my knowledge) this feature is not used by Kubernetes.

Ephemeral storage in Kubernetes

Now let us try out how things change if we use Kubernetes to spin up our containers.

$ kubectl run -i --tty busybox --image=busybox --restart=Never
/ # / # mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/57X3FQLOLATAMPLAASUMBKHD5W:/var/lib/docker/overlay2/l/Q6HSQOG4NX7UHGVZEPQIZI43KQ,upperdir=/var/lib/docker/overlay2/f183dfd193c86bf43ebca4ae7d08cbfd6e268229f3bd59c70be2f48d9dc0937f/diff,workdir=/var/lib/docker/overlay2/f183dfd193c86bf43ebca4ae7d08cbfd6e268229f3bd59c70be2f48d9dc0937f/work)
[REDACTED]

So we see pretty much the same picture. The root volume of the container is again an overlay file system that is backed by directories managed by Docker on the host on which the pod is running.

If, however, you ssh into the node on which the container is running and do a docker inspect as before, you will find that there are actually a couple of bind mounts that we have not explicitly specified. These bind mounts are added by Kubernetes to give the container access to some configuration data like the token that can be used to connect to a Kubernetes service account, or host name resolutions (the own IP address of the container, for instance). Up to this point, our picture is actually quite simple.

KubernetesVolumesI

Now things get a bit more complicated if you want to mount additional volumes. Kubernetes does in fact offer a large number of different volume types. Some of these volume types are ephemeral, i.e. the data is potentially lost if the pod dies or is rescheduled to a different node, other types of volumes are persistent. In this post, we focus on ephemeral storage and discuss different strategies to attach persistent storage in a later post.

Mount ephemeral storage of type emptyDir

In Kubernetes, we can define volumes on the level of an individual pod and attach these volumes to one or more containers that are running in this pod. Kubernetes offers several types of volumes. The type we are going to look at first is called an emptyDir because, from the point of view of a container, it is exactly that – a directory which is initially empty.

To see this in action, let us look at the following manifest file.

apiVersion: v1
kind: Pod
metadata:
  name: empty-dir-demo
  namespace: default
spec:
  containers:
  - name: empty-dir-demo-ctr
    image: httpd:alpine
    volumeMounts:
      - mountPath: /test
        name: test-volume
  volumes:
  - name: test-volume
    emptyDir: {}

This manifest file defines an individual Pod, as we have seen it before. However, there are a few new elements which are populated in this manifest. The Pod specification contains a new field volumes, which is an array of volume objects. This volume has a name and an additional field which indicates the type of the volume. The documentation lists many of them, here we are working with a volume of type emptyDir.

In the container specification, we now refer to this volume. This instructs Kubernetes to create the volume and to mount it into this container at the defined mount point. To see this in action, let us apply this manifest file, spawn a shell in the pod that is created and inspect its file system.

$ kubectl apply -f https://raw.githubusercontent.com/christianb93/Kubernetes/master/emptyDir.yaml 
pod/empty-dir-demo created
$ kubectl exec -it empty-dir-demo "/bin/bash"
bash-4.4# mount | grep "test"
/dev/xvda1 on /test type xfs (rw,noatime,attr2,inode64,noquota)

So we see that Kubernetes has actually mounted a new file system onto the mount point /test. To figure out how this is realized, let us take a closer look at the Docker container that has been created. So ssh into the node on which the Pod is running and run the following commands (this assumes that jq is installed on the node, which is the default when using the standard AWS AMI).

$ containerId=$(docker ps | grep "httpd" | awk '{print $1}')
$ docker inspect $containerId | jq -r '.[0].Mounts[]'
{
  "Type": "bind",
  "Source": "/var/lib/kubelet/pods/0914a859-4da2-11e9-931c-06a2d10ef1fe/volumes/kubernetes.io~empty-dir/test-volume",
  "Destination": "/test",
  "Mode": "",
  "RW": true,
  "Propagation": "rprivate"
}
[ ... more output ... ]

So we find that Kubernetes realizes an emptyDir volume as a bind mount, i.e. Kubernetes will create a directory on the nodes local file system and use a Docker bind mount to mount this into the container. Of course, this directory will initially be empty (as the name strongly suggests). Let us see what happens if we actually write something onto this file system. The following commands (to be run again on the node on which the Pod is running) extract the directory which is used for the bind mount from the output of docker inspect and list the contents of this directory.

$ dir=$(docker inspect $containerId | jq -r '.[0].Mounts[] | select(.Destination=="/test") | .Source')
$ sudo ls $dir

If you run this now, you will find that the directory is empty. Now switch back to the terminal attached to the Pod and create a file in the /test directory.

bash-4.4# echo "hello" > /test/hello

If you now list the directories content again, you will find that a file hello has been created.

Knowing how an emptyDir is implemented now makes it easy to understand the statements on the lifecycle in the Kubernetes documentation. It is stored in a directory specific for the Pod, i.e. it is initially created when the Pod is created and removed when the Pod is removed. It survices container restarts, but when the Pod is migrated to a different node, the content will be lost. In that sense, it is ephemeral storage.

KubernetesVolumesII

Accessing host-local file systems

We have found that a volume of type emptyDir is nothing but a Docker bind mount to a Pod specific directory managed by Kubernetes. Of course, Kubernetes also offers a way to set up bind mounts to existing directories in the host file system (needless to say that this might be a security risk). This done using a volume of type hostPath as in the example below.

apiVersion: v1
kind: Pod
metadata:
  name: host-path-demo
  namespace: default
spec:
  containers:
  - name: host-path-demo-ctr
    image: httpd:alpine
    volumeMounts:
      - mountPath: /test
        name: test-volume
  volumes:
  - name: test-volume
    hostPath: 
      path: /etc

When you run this and attach to the resulting Pod, you will find that the content of the directory /test now match the content of the directory /etc on the host. Using again docker inspect on the node on which the Pod is running, you will find that Kubernetes has created an additional bind mount for the container which links the containers /test directory to the directory /etc on the host. Consequently, the contents of a hostPath volume will survive container restarts but will not be accessible anymore once the Pod is migrated to a different host.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s