christianb93

Using Terraform and Ansible to manage your cloud environments

Terraform is one of the most popular open source tools to quickly provision complex cloud based infrastructures, and Ansible is a powerful configuration management tool. Time to see how we can combine them to easily create cloud based infastructures in a defined state.

Terraform – a quick introduction

This post is not meant to be a full fledged Terraform tutorial, and there are many sources out there on the web that you can use to get familiar with it quickly (the excellent documentation itself, for instance, which is quite readable and quickly gets to the point). In this section, we will only review a few basic facts about Terraform that we will need.

When using Terraform, you declare the to-be state of your infrastructure in a set of resource definitions using a declarative language known as HCL (Hashicorp configuration language). Each resource that you declare has a type, which could be something like aws_instance or packet_device, which needs to match a type supported by a Terraform provider, and a name that you will use to reference your resource. Here is an example that defines a Packet.net server.

resource "packet_device" "web1" {
  hostname         = "tf.coreos2"
  plan             = "t1.small.x86"
  facilities       = ["ewr1"]
  operating_system = "coreos_stable"
  billing_cycle    = "hourly"
  project_id       = "${local.project_id}"
}

In addition to resources, there are a few other things that a typical configuration will contain. First, there are providers, which are an abstraction for the actual cloud platform. Each provider is enabling a certain set of resource types and needs to be declared so that Terraform is able to download the respective plugin. Then, there are variables that you can declare and use in your resource definitions. In addition, Terraform allows you to define data sources, which represent queries against the cloud providers API to gather certain facts and store them into variables for later use. And there are objects called provisioners that allow you to run certain actions on the local machine or on the machine just provisioned when a resource is created or destroyed.

An important difference between Terraform and Ansible is that Terraform also maintains a state. Essentially, the state keeps track of the state of the infrastructure and maps the Terraform resources to actual resources on the platform. If, for instance, you define a EC2 instance my-machine as a Terraform resource and ask Terraform to provision that machine, Terraform will capture the EC2 instance ID of this machine and store it as part of its state. This allows Terraform to link the abstract resource to the actual EC2 instance that has been created.

State can be stored locally, but this is not only dangerous, as a locally stored state is not protected against loss, but also makes working in teams difficult as everybody who is using Terraform to maintain a certain infrastructure needs access to the state. Thereform, Terraform offers backends that allow you to store state remotely, including PostgreSQL, S3, Artifactory, GCS, etcd or a Hashicorp managed service called Terraform cloud.

A first example – Terraform and DigitalOcean

In this section, we will see how Terraform can be used to provision a droplet on DigitalOcean. First, obviously, we need to install Terraform. Terraform comes as a single binary in a zipped file. Thus installation is very easy. Just navigate to the download page, get the URL for your OS, run the download and unzip the file. For Linux, for instance, this would be

wget https://releases.hashicorp.com/terraform/0.12.10/terraform_0.12.10_linux_amd64.zip
unzip terraform_0.12.10_linux_amd64.zip

Then move the result executable somewhere into your path. Next, we will prepare a simple Terraform resource definition. By convention, Terraform expects resource definitions in files with the extension .tf. Thus, let us create a file droplet.tf with the following content.

# The DigitalOcean oauth token. We will set this with the
#  -var="do_token=..." command line option
variable "do_token" {}

# The DigitalOcean provider
provider "digitalocean" {
  token = "${var.do_token}"
}

# Create a droplet
resource "digitalocean_droplet" "droplet0" {
  image  = "ubuntu-18-04-x64"
  name   = "droplet0"
  region = "fra1"
  size   = "s-1vcpu-1gb"
  ssh_keys = []
}

When you run Terraform in a directory, it will pick up all files with that extension that are present in this directory (Terraform treats directories as modules with a hierarchy starting with the root module). In our example, we make a reference to the DigitalOcean provider, passing the DigitalOcean token as an argument, and then declare one resource of type digitalocean_droplet called droplet0

Let us try this out. When we use a provider for the first time, we need to initialize this provider, which is done by running (in the directory where your resource definitions are located)

terraform init

This will detect all required providers and download the respective plugins. Next, we can ask Terraform to create a plan of the changes necessary to realize the to-be state described in the resource definition. To see this in action, run

terraform plan -var="do_token=$DO_TOKEN"

Note that we use the command line switch -var to define the value of the variable do_token which we reference in our resource definition to allow Terraform to authorize against the DigitalOcean API. Here, we assume that you have stored the DigitalOcean token in an environment variable called DO_TOKEN.

When we are happy with the output, we can now finally ask Terraform to actually apply the changes. Again, we need to provide the DigitalOcean token. In addition, we also use the switch -auto-approve, otherwise Terraform would ask us for a confirmation before the changes are actually applied.

terraform apply -auto-approve -var="do_token=$DO_TOKEN"

If you have a DigitalOcean console open in parallel, you can see that a droplet is actually being created, and after less than a minute, Terraform will complete and inform you that a new resource has been created.

We have mentioned above that Terraform maintains a state, and it is instructive to inspect this state after a resource has been created. As we have not specified a backend, Terraform will keep the state locally in a file called terraform.tfstate that you can look at manually. Alternatively, you can also use

terraform state pull

You will see an array resources with – in our case – only one entry, representing our droplet. We see the name and the type of the resource as specified in the droplet.tf file, followed by a list of instances that contains the (provider specific) details of the actual instances that Terraform has stored.

When we are done, we can also use Terraform to destroy our resources again. Simply run

terraform destroy -auto-approve -var="do_token=$DO_TOKEN"

and watch how Terraform brings down the droplet again (and removes it from the state).

Now let us improve our simple resource definition a bit. Our setup so far did not provision any SSH keys, so to log into the machine, we would have to use the SSH password that DigitalOcean will have mailed you. To avoid this, we again need to specify an SSH key name and to get the corresponding SSH key ID. First, we define a Terraform variable that will contain the key name and add it to our droplet.tf file.

variable "ssh_key_name" {
  type = string
  default = "do-default-key"
}

This definition has three parts. Following the keyword variable, we first specify the variable name. We then tell Terraform that this is a string, and we provide a default. As before, we could override this default using the switch -var on the command line.

Next, we need to retrieve the ID of the key. The DigitalOcean plugin provides the required functionality as a data source. To use it, add the following to our resource definition file.

data "digitalocean_ssh_key" "ssh_key_data" {
  name = "${var.ssh_key_name}"
}

Here, we define a data source called ssh_key_data. The only argument to that data source is the name of the SSH key. To provide it, we use the template mechanism that Terraform provides to expand variables inside a string to refer to our previously defined variable.

The data source will then use the DigitalOcean API to retrieve the key details and store them in memory. We can now access the SSH key to add it to the droplet resource definition as follows.

# Create a droplet
resource "digitalocean_droplet" "droplet0" {
...  
ssh_keys = [ data.digitalocean_ssh_key.ssh_key_data.id ]
}

Note that the variable reference consists of the keyword data to indicate that is has been populated by a data source, followed by the type and the name of the data source and finally the attribute that we refer to. If you now run the example again, we will get a machine with an SSH key so that we can access it via SSH as usual (run terraform state pull to get the IP address).

Having to extract the IP manually from the state file is not nice, it would be much more convenient if we could ask the Terraform to somehow provide this as an output. So let us add the following section to our file.

output "instance_ip_addr" {
  value = digitalocean_droplet.droplet0.ipv4_address
}

Here, we define an output called instance_ip_addr and populate it by referring to a data item which is exported by the DigitalOcean plugin. Each resource type will have its own list of exported variables which you can find in the documentation, and you can only refer to one of these variables. If we now run Terraform again, it will print the output upon completion.

We can also create more than one instance at a time by adding the keyword count to the resource definition. When defining the output, we will now have to refer to the instances as an array using the splat expression syntax to refer to an entire array. It is also advisible to move the entire instance configuration into variables and to split out variable definitions into a separate file.

Using Terraform with a non-local state

Using Terraform with a local state can be problematic – it could easily get lost or overwritten and makes working in a team or at different locations difficult. Therefore, let us quickly look at an example to set up Terraform with non-local state. Among the many available backends, we will use PostgreSQL.

Thanks to docker, it is very easy to bring up a local PostgreSQL instance. Simply run

docker run --name tf-backend -e POSTGRES_PASSWORD=my-secret-password -d postgres

Note that we do not map the PostgreSQL port here for security reasons, so we will need to figure out the IP of the resulting Docker container and use it in the Terraform configuration. The IP can be retrieved with the following command.

docker inspect tf-backend | jq -r '.[0].NetworkSettings.IPAddress'

In my example, the Docker container is running at 172.17.0.2. Next, we will have to create a database for the Terraform backend. Assuming that you have the required client packages installed, run

createdb --host=172.17.0.2 --username=postgres terraform_backend

and enter the password my-secret-password specified above when starting the Docker container. We will use the PostgreSQL superuser for our example, in a real life example you would of course first create your own user for Terraform and use it going forward.

Next, we add the backend definition to our Terraform configuration using the following section.

# The backend - we use PostgreSQL.
terraform {
  backend "pg" {
  }
}

You might expect that we also need to provide the location of the database (i.e. a connection string) and credentials. In fact, we could do this, but this would imply that the credentials which are part of the connection string would be present in the file in clear text. We therefore use a partial configuration and later supply the missing data on the command line.

Finally, we have to re-run terraform init to make the new backend known to Terraform and create a new workspace. Terraform will detect the change and even offer you to automatically migrate the state into the new backend.

terraform init -backend-config="conn_str=postgres://postgres:my-secret-password@172.17.0.2/terraform_backend?sslmode=disable"
terraform workspace new default

Be careful – as our database is running in a container with ephemeral storage, the state will be lost when we destroy the container! In our case, this would not be a desaster, as we would still be able to control our small playground environment manually. Still, it is a good idea to tag all your servers so that you can easily identify the servers managed by Terraform. If you intend to use Terraform more often, you might also want to spin up a local PostgreSQL server outside of a container (here is a short Ansible script that will do this for you and create a default user terraform with password being equal to the user name). Note that with a non-local state, Terraform will still store a reference to your state locally (look at the directory .terraform), so that when you work from a different directory, you will have to run the init command again. Also note that this will store your database credentials locally on disk in clear text.

Combining Terraform and Ansible

After this very short introduction into Terraform, let us now discuss how we can combine Terraform to manage our cloud environment with Ansible to manage the machines. Of course there are many different options, and I have not even tried to create an exhaustive list. Here are the options that I did consider.

First, you could operate Terraform and Ansible independently and use a dynamic inventory. Thus, you would have some Terraform templates and some Ansible playbooks and, when you need to verify or update your infrastructure, first run Terraform and then Ansible. The Ansible playbooks would use provider-specific inventory scripts to build a dynamic inventory and operate on this. This setup is simple and allows you to manage the Terraform configuration and your playbooks without any direct dependencies.

However, there is a potential loss of information when working with inventory scripts. Suppose, for instance, that you want to bring up a configuration with two web servers and two database servers. In a Terraform template, you would then typically have a resource “web” with two instances and a resource “db” with two instances. Correspondingly, you would want groups “web” and “db” in your inventory. If you use inventory scripts, there is no direct link between Terraform resources and cloud resources, and you would have to use tags to provide that link. Also, things easily get a bit more complicated if your configuration uses more than one cloud provider. Still, this is a robust method that seems to have some practical uses.

A second potential approach is to use Terraform provisioners to run Ansible scripts on each provisioned resource. If you decide to go down this road, keep in mind that provisioners only run when a resource is created or destroyed, not when it changes. If, for instance, you change your playbooks and simply run Terraform again, it will not trigger any provisioners and the changed playbooks will not apply. There are approaches to deal with this, for instance null resources, but this is difficult to control and the Terraform documentation itself does not advocate the use of provisioners.

Next, you could think about parsing the Terraform state. Your primary workflow would be coded into an Ansible playbook. When you execute this playbook, there is a play or task which reads out the Terraform state and uses this information to build a dynamic inventory with the add_host module. There are a couple of Python scripts out there that do exactly this. Unfortunately, the structure of the state is still provider specific, a DigitalOcean droplet is stored in a structure that is different from the structure used for a AWS EC2 instance. Thus, at least the script that you use for parsing the state is still provider specific. And of course there are potential race conditions when you parse the state while someone else might modify it, so you have to think about locking the state while working with it.

Finally, you could combine Terraform and Ansible with a tool like Packer to create immutable infrastructures. With this approach, you would use Packer to create images, supported maybe by Ansible as a provisioner. You would then use Terraform to bring up an infrastructure using this image, and would try to restrict the in-place configurations to an absolute minimum.

The approach that I have explored for this post is not to parse the state, but to trigger Terraform from Ansible and to parse the Terraform output. Thus our playbook will work as follows.

Use the Terraform module that comes with Ansible to run a Terraform template that describes the infrastructure
Within the Terraform template, create an output that contains the inventory data in a provider independent JSON format
Back in Ansible, parse that output and use add_host to build a corresponding dynamic inventory
Continue with the actual provisioning tasks per server in the playbook

Here, the provider specific part is hidden in the Terraform template (which is already provider specific by design), where the data passed back to Ansible and the Ansible playbook at least has a chance to be provider agnostic.

Similar to a dynamic inventory script, we have to reach out to the cloud provider once to get the current state. In fact, behind the scenes, the Terraform module will always run a terraform plan and then check its output to see whether there is any change. If there is a change, it will run terraform apply and return its output, otherwise it will fall back to the output from the last run stored in the state by running terraform output. This also implies that there is again a certain danger of running into race conditions if someone else runs Terraform in parallel, as we release the lock on the Terraform state once this phase has completed.

Note that this approach has a few consequences. First, there is a structural coupling between Terraform and Ansible. Whoever is coding the Terraform part needs to prepare an output in a defined structure so that Ansible is happy. Also the user running Ansible needs to have the necessary credentials to run Terraform and access the Terraform state. In addition, Terraform will be invoked every time when you run the Ansible playbook, which, especially for larger infrastructures, slows down the processing a bit (looking at the source code of the Terraform module, it would probably be easy to add a switch so that only terraform output is run, which should provide a significant speedup).

Getting and running the sample code

Let us now take a look at a sample implementation of this idea which I have uploaded into my Github repository. The code is organized into a collection of Terraform modules and Ansible roles. The example will bring up two web servers on DigitalOcean and two database servers on AWS EC2.

Let us first discuss the Terraform part. There are two modules involved, one module that will create a droplet on DigitalOcean and one module that will create an EC2 instance. Each module returns, as an output, a data structure that contains one entry for each server, and each entry contains the inventory data for that server, for instance

inventory = [
  {
    "ansible_ssh_user" = "root"
    "groups" = "['web']"
    "ip" = "46.101.162.72"
    "name" = "web0"
    "private_key_file" = "~/.ssh/do-default-key"
  },
...

This structure can easily be assembled using Terraforms for-expressions. Assuming that you have defined a DigitalOcean resource for your droplets, for instance, the corresponding code for DigitalOcean is

output "inventory" {
  value = [for s in digitalocean_droplet.droplet[*] : {
    # the Ansible groups to which we will assign the server
    "groups"           : var.ansibleGroups,
    "name"             : "${s.name}",
    "ip"               : "${s.ipv4_address}",
    "ansible_ssh_user" : "root",
    "private_key_file" : "${var.do_ssh_private_key_file}"
  } ]
}

The EC2 module is structured similarly and delivers output in the same format (note that there is no name exported by the AWS provider, but we can use a tag to capture and access the name).

In the main.tf file, we then simply invoke each of the modules, concatenate their outputs and return this as Terraform output.

This output is then processed by the Ansible role terraform. Here, we simply iterate over the output (which is in JSON format and therefore recognized by Ansible as a data structure, not just a string), and for each item, we create a corresponding entry in the dynamic inventory using add_host.

The other roles in the repository are straightforward, maybe with one exception – the role sshConfig which will add entries for all provisioned hosts in an SSH configuration file ~/.ssh/ansible_created_config and include this file in the users main SSH configuration file. This is not necessary for things to work, but will allow you to simply type something like

ssh db0

to access the first database server, without having to specify IP address, user and private key file.

A note on SSH. When I played with this, I hit upon this problem with the Gnome keyring daemon. The daemon adds identities to your SSH agent, which are attempted every time Ansible tries to log into one of the machines. If you work with more than a few SSH identities, we will therefore exceed the number of failed SSH login attempts configured in the Ubuntu image, and will not be able to connect any more. The solution for me was to disable the Gnome keyring at startup.

In order to run the sample, you will first have to prepare a few credentials. We assume that

You have AWS credentials set up to access the AWS API
The DigitalOcean API key is stored in the environment variable DO_TOKEN
There is a private key file ~/.ssh/ec2-default-key.pem matching an SSH key on EC2 called ec2-default-key
Similarly, there is a private key file ~/.ssh/do-default-key that belongs to a public key uploaded to DigitalOcean as do-default-key
There is a private / public SSH key pair ~/.ssh/default-user-key[.pub] which we will use to set up an SSH enabled default user on each machine

We also assume that you have a Terraform backend up and running and have run terraform init to connect to this backend.

To run the examples, you can then use the following steps.

git clone https://github.com/christianb93/ansible-samples
cd ansible-samples/terraform
terraform init
ansible-playbook site.yaml

The extra invocation of terraform init is required because we introduce two new modules that Terraform needs to pre-load, and because you might not have the provider plugins for DigitalOcean and AWS on your machine. If everything works, you will have a fully provisioned environment with two servers on DigitalOcean with Apache2 installed and two servers on EC2 with PostgreSQL installed up and running in a few minutes. If you are done, do not forget to shut down everything again by running

terraform  destroy -var="do_token=$DO_TOKEN"

Also I highly recommend to use the DigitalOcean and AWS console to manually check that no servers are left running to avoid unnecessary cost and to be prepared for the case that your state is out-of-sync.

Automating provisioning with Ansible – building cloud environments

When you browse the module list of Ansible, you will find that the by far largest section is the list of cloud modules, i.e. modules to interact with a cloud environments to define, provision and maintain objects like virtual machines, networks, firewalls or storage. These modules make it easy to access the APIs of virtually every existing cloud provider. In this post, we will look at an example that demonstrates the usage of these modules for two cloud platforms – DigitalOcean and AWS.

Using Ansible with DigitalOcean

Ansible comes with several modules that are able to connect to the DigitalOcean API to provision machines, manage keys, load balancers, snapshots and so forth. Here, we will need two modules – the module digital_ocean_droplet to deal with individual droplets, and digital_ocean_sshkey_facts to manage SSH keys.

Before getting our hands dirty, we need to talk about credentials first. There are three types of credentials involved in what follows.

To connect to the DigitalOcean API, we need to identify ourselves with a token. Such a token can be created on the DigitalOcean cloud console and needs to be stored safely. We do not want to put this into a file, but instead assume that you have set up an environment variable DO_TOKEN containing its value and access that variable.
The DigitalOcean SSH key. This is the key that we create once and upload (its public part, of course) to the DigitalOcean website. When we provision our machines later, we pass the name of the key as a parameter and DigitalOcean will automatically add this public key to the authorized_keys file of the root user so that we can access the machine via SSH. We will also use this key to allow Ansible to connect to our machines.
The user SSH key. As in previous examples, we will, in addition, create a default user on each machine. We need to create a key pair and also distribute the public key to all machines.

Let us first create these keys, starting with the DigitalOcean SSH key. We will use the key file name do-default-key and store the key in ~/.ssh on the control machine. So run

ssh-keygen -b 2048 -t rsa -P "" -f ~/.ssh/do-default-key
cat ~/.ssh/do-default-key.pub

to create the key and print out the public key.

Then navigate to the DigitalOcean security console, hit “Add SSH key”, enter the public key, enter the name “do-default-key” and save the key.

Next, we create the user key. Again, we use ssh-keygen as above, but this time use a different file name. I use ~/.ssh/default-user-key. There is no need to upload this key to DigitalOcean, but we will use it later in our playbook for the default user that we create on each machine.

Finally, make sure that you have a valid DigitalOcean API token (if not, go to this page to get one) and put the value of this key into an environment variable DO_TOKEN.

We are now ready to write the playbook to create droplets. Of course, we place this into a role to enable reuse. I have called this role droplet and you can find its code here.

I will not go through the entire code (which is documented), but just mention some of the more interesting points. First, when we want to use the DigitalOcean module to create a droplet, we have to pass the SSH key ID. This is a number, which is different from the key name that we specified while doing the upload. Therefore, we first have to retrieve the key ID. For that purpose, we retrieve a list of all keys by calling the module digital_ocean_sshkey_facts which will write its results into the ansible_facts dictionary.

- name: Get available SSH keys
  digital_ocean_sshkey_facts:
    oauth_token: "{{ apiToken }}"

We then need to navigate this structure to get the ID of the key we want to use. We do this by preparing a JSON query, which we put together in several steps to avoid issues with quoting. First, we apply the query string [?name=='{{ doKeyName }}'], where doKeyName is the variable that holds the name of the DigitalOcean SSH key, to navigate to the entry in the key list representing this key. The result will still be a list from which we extract the first (and typically only) element and finally access its attribute id holding the key ID.

- name: Prepare JSON query string
  set_fact:
    jsonQueryString: "[?name=='{{ doKeyName }}']"
- name: Apply query String to extract matching key data
  set_fact:
      keyData: "{{ ansible_facts['ssh_keys']|json_query(jsonQueryString) }}"
- name: Get keyId
  set_fact:
      keyId: "{{ keyData[0]['id'] }}"

Once we have that, we can use the DigitalOcean Droplet module to actually bring up the droplet. This is rather straightforward, using the following code snippet.

- name: Bring up or stop machines
  digital_ocean_droplet:
    oauth_token: "{{ apiToken }}"
    image: "{{osImage}}"
    region_id: "{{regionID}}"
    size_id: "{{sizeID}}"
    ssh_keys: [ "{{ keyId }}" ]
    state: "{{ targetState }}"
    unique_name: yes
    name: "droplet{{ item }}"
    wait: yes
    wait_timeout: 240
    tags:
    - "createdByAnsible"
  loop:  "{{ range(0, machineCount|int )|list }}"
  register: machineInfo

Here, we go through a loop, with the number of iterations being governed by the value of the variable machineCount, and, in each iteration, bring up one machine. The name of this machine is a fixed part plus the loop variable, i.e. droplet0, droplet1 and so forth. Note the usage of the parameter unique_name. This parameter instructs the module to check that the supplied name is unique. So if there is already a machine with that name, we skip its creation – this is what makes the task idempotent!

Once the machine is up, we finally loop over all provisioned machines once more (using the variable machineInfo) and complete two additional step for each machine. First, we use add_host to add the host to the inventory – this makes it possible to use them later in the playbook. For convenience, we also add an entry for each host to the local SSH configuration, which allows us to log into the machine using a shorthand like ssh droplet0.

Once the role droplet has completed, our main playbook will start a second and a third play. These plays now use the inventory that we have just built dynamically. The first of them is quite simple and just waits until all machines are reachable. This is necessary, as DigitalOcean reports a machine as up and running at a point in time when the SSH daemon might not yet accept conditions. The next and last play simply executes two roles. The first role is the role that we have already put together in an earlier post which simply adds a standard non-root user, and the second role simply uses apt and pip to install a few basic packages.

A few words on security are in order. Here, we use the DigitalOcean standard setup, which gives you a machine with a public IP address directly connected to the Internet. No firewall is in place, and all ports on the machine are reachable from everywhere. This is obviously not a setup that I would recommend for anything else than a playground. We could start a local firewall on the machine by using the ufw module for Ansible as a starting point, and then you would probably continue with some basic hardening measures before installing anything else (which, of course, could be automated using Ansible as well).

Using Ansible with AWS EC2

Let us now see what we need to change when trying the same thing with AWS EC2 instead of Digital Ocean. First, of course, the authorization changes a bit.

The Ansible EC2 module uses the Python Boto library behind the scenes which is able to reuse the credentials of an existing AWS CLI configuration. So if you follow my setup instructions in an earlier post on using Python with AWS, no further setup is required
Again, there is an SSH key that AWS will install on each machine. To create a key, navigate to the AWS EC2 console, select “Key pairs” under “Network and Security” in the navigation bar on the left, create a key called ec2-default-key, copy the resulting PEM file to ~/.ssh/ and set the access rights 0700.
The user SSH key – we can use the same key as before

As the EC2 module uses the Boto3 Python library, you might have to install it first using pip or pip3.

Let us now go through the steps that we have to complete to spin up our servers and see how we realize idempotency. First, to create a server on EC2, we need to specify an AMI ID. AMI-IDs change with every new version of an AMI, and typically, you want to install the latest version. Thus we first have to figure out the latest AMI ID. Here is a snippet that will do this.

- name: Locate latest AMI
  ec2_ami_facts:
    region: "eu-central-1"
    owners: 099720109477
    filters:
      name: "ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-????????"
      state: "available"
  register: allAMIs
- name: Sort by creation date to find the latest AMI
  set_fact:
    latestAMI: "{{ allAMIs.images | sort(attribute='creation_date') | last }}"

Here, we first use the module ec2_ami_facts to get a list of all AMIs matching a certain expression – in this case, we look for available AMIs for Ubuntu 18.04 for EBS-backed machines with the owner 099720109477 (Ubuntu). We then use a Jinja2 template to sort by creation date and pick the latest entry. The result will be a dictionary, which, among other things, contains the AMI ID as image_id.

Now, as in the DigitalOcean example, we can start the actual provisioning step using the ec2 module. Here is the respective task.

- name: Provision machine
  ec2:
    exact_count: "{{ machineCount }}"
    count_tag: "managedByAnsible"
    image: "{{ latestAMI.image_id }}"
    instance_tags:
      managedByAnsible: true
    instance_type: "{{instanceType}}"
    key_name: "{{ ec2KeyName }}"
    region: "{{ ec2Region }}"
    wait: yes
  register: ec2Result

Idempotency is realized differently on EC2. Instead of using the server name (which Amazon will assign randomly), the module uses tags. When a machine is brought up, the tags specified in the attribute instance_tags are attached to the new instance. The attribute exact_count instructs the module to make sure that exactly this number of instances with this tag is running. Thus if you run the playbook twice, the module will first query all machines with the specified tag, figure out that there is already an instance running and will not create any additional machines.

This simple setup will provision all new machines into the default VPC and use the default security group. This has the disadvantage that in order for Ansible to be able to reach these machines, you will have to open port 22 for incoming connections in the attached security group. The playbook does not do this, but expects that you do this manually using for instance the EC2 console. Again, this is of course not a production-like.

Running the examples

I have put the full playbooks with all the required roles into my GitHub repository. To run the examples, carry out the following steps.

First, prepare the SSH keys and API tokens for EC2 and DigitalOcean as explained in the respective sections. Then, use the EC2 console to allow incoming traffic for port 22 from your local workstation in the default VPC, into which our playbook will launch the machines created on EC2. Next, run the following commands to clone the repository and execute the playbooks (if you have not used the SSH key names and locations as above, you will first have to edit the default variables in the roles ec2Instance and droplet accordingly).

git clone https://github.com/christianb93/ansible-samples
cd partVII
# Turn off strict host key checking
export ANSIBLE_HOST_KEY_CHECKING=False
# Run the EC2 example
ansible-playbook awsSite.yaml
# Run the DigitalOcean example
ansible-playbook doSite.yaml

Once the playbooks complete, you should now be able to ssh into the machine created on DigitalOcean using ssh droplet0 and into the machine created on EC2 using ssh aws0. Do not forget to delete all machines again when you are done to avoid unnecessary cost!

Limitations of our approach

The approach we have followed so far to manage our cloud environments is simple, but has a few drawbacks and limitations. First, the provisioning is done on the localhost in a loop, and therefore sequentially. Even if the provisioning of one machine only takes one or two minutes, this implies that the approach does not scale, as bringing up a large number of machines would take hours. One approach to fix that that I have seen on Stackexchange (unfortunately I cannot find the post any more, so I cannot give proper credit) is to use a static inventory file with a dummy group containing only the host names in combination with connection: local to execute the provisioning locally, but in parallel.

The second problem is more severe. Suppose, for instance, you have already created a droplet and now decide that you want 2GB of memory instead. So you might be tempted to go back, edit (or override) the variable sizeID and re-run the playbook. This will not give you any error message, but in fact, it will not do anything. The reason is that the module will only use the host name to verify that the target state has been reached, not any other attributes of the droplet (take a look at the module source code here to verify this). Thus, we have a successful playbook execution, but a deviation between the target state coded into the playbook and the real state.

Of course this could be fixed by a more sophisticated check – either by extending the module, or by coding this check into the playbook. Alternatively, we could try to leverage an existing solution that has this logic already implemented. For AWS, for instance, you could try to define your target state using CloudFormation templates and then leverage the CloudFormation Ansible module to apply this from within Ansible. Or, if you want to be less provider specific, you might want to give Terraform a try. This is what I decided to do, and we will dive into more details how to integrate Terraform and Ansible in the next post.

Automating provisioning with Ansible – control structures and roles

When building more sophisticated Ansible playbooks, you will find that a linear execution is not always sufficient, and the need for more advanced control structures and ways to organize your playbooks arises. In this post, we will cover the corresponding mechanisms that Ansible has up its sleeves.

Loops

Ansible allows you to build loops that execute a task more than once with different parameters. As a simple example, consider the following playbook.

---
  - hosts: localhost
    connection: local
    tasks:
      - name: Loop with a static list of items
        debug:
          msg: "{{ item }}"
        loop:
        - A
        - B

This playbook contains only one play. In the play, we first limit the execution to localhost and then use the connection keyword to instruct Ansible not to use ssh but execute commands directly to speed up execution (which of course only works for localhost). Note that more generally, the connection keyword specifies a so-called connection plugin of which Ansible offers a few more than just ssh and local.

The next task has an additional keyword loop. The value of this attribute is a list in YAML format, having the elements A and B. The loop keyword instructs Ansible to run this task twice per host. For each execution, it will assign the corresponding value of the loop variable to the built-in variable item so that it can be evaluated within the loop.

If you run this playbook, you will get the following output.

PLAY [localhost] *************************************
TASK [Gathering Facts]********************************
ok: [localhost]

TASK [Loop with a static list of items] **************
ok: [localhost] => (item=A) => {
    "msg": "A"
}
ok: [localhost] => (item=B) => {
    "msg": "B"
}

So we see that even though there is only one host, the loop body has been executed twice, and within each iteration, the expression {{ item }} evaluates to the corresponding item in the list over which we loop.

Loops are most useful in combination with Jinja2 expressions. In fact, you can iterate over everything that evaluates to a list. The following example demonstrates this. Here, we loop over all environment variables. To do this, we use Jinja2 and Ansible facts to get a dictionary with all environment variables, then convert this to a list using the item() method and use this list as argument to loop.

---
  - hosts: localhost
    connection: local
    tasks:
      - name: Loop over all environment variables 
        debug:
          msg: "{{ item.0 }}"
        loop:
         "{{ ansible_facts.env.items() | list }}"
        loop_control:
          label:
            "{{ item.0 }}"

This task also demonstrates loop controls. This is just a set of attributes that we can specify to control the behaviour of the loop. In this case, we set the label attribute which specifies the output that Ansible prints out for each loop iteration. The default is to print the entire item, but this can quickly get messy if your item is a complex data structure.

You might ask yourself why we use the list filter in the Jinja2 expression – after all, the items() method should return a list. In fact, this is only true for Python 2. I found that in Python 3, this method returns something called a dictionary view, which we then need to convert into a list using the filter. This version will work with both Python 2 and Python 3.

Also note that when you use loops in combination with register, the registered variable will be populated with the results of all loop iterations. To this end, Ansible will populate the variable to which the register refers with a list results. This list will contain one entry for each loop iteration, and each entry will contain the item and the module-specific output. To see this in action, run my example playbook and have a look at its output.

Conditions

In addition to loops, Ansible allows you to execute specific tasks only if a certain condition is met. You could, for instance, have a task that populates a variable, and then execute a subsequent task only if this variable has a certain value.

The below example is a bit artificial, but demonstrates this nicely. In a first task, we create a random value, either 0 or 1, using Jinja2 templating. Then, we execute a second task only if this value is equal to one.

---
  - hosts: localhost
    connection: local
    tasks:
      - name: Populate variable with random value
        set_fact:
          randomVar: "{{ range(0, 2) | random }}"
      - name: Print value
        debug:
          var: randomVar
      - name: Execute additional statement if randomVar is 1
        debug:
          msg: "Variable is equal to one"
        when:
          randomVar == "1"

To create our random value, we combine the Jinja2 range method with the random filter which picks a random element from a list. Note that the result will be a string, either “0” or “1”. In the last task within this play, we then use the when keyword to restrict execution based on a condition. This condition is in Jinja2 syntax (without the curly braces, which Ansible will add automatically) and can be any valid expression which evaluates to either true or false. If the condition evaluates to false, then the task will be skipped (for this host).

Care needs to be taken when combining loops with conditional execution. Let us take a look at the following example.

---
  - hosts: localhost
    connection: local
      - name: Combine loop with when
        debug:
          msg: "{{ item }}"
        loop:
          "{{ range (0, 3)|list }}"
        when:
          item == 1
        register:
          loopOutput
      - debug:
          var: loopOutput

Here, the condition specified with when is evaluated once for each loop iteration, and if it evaluates to false, the iteration is skipped. However, the results array in the output still contains an item for this iteration. When working with the output, you might want to evaluate the skipped attribute for each item which will tell you whether the loop iteration was actually executed (be careful when accessing this, as it is not present for those items that were not skipped).

Handlers

Handlers provide a different approach to conditional execution. Suppose, for instance, you have a sequence of tasks that each update the configuration file of a web server. When this actually resulted in a change, you want to restart the web server. To achieve this, you can define a handler.

Handlers are similar to tasks, with the difference that in order to be executed, they need to be notified. The actual execution is queued until the end of the playbook, and even if the handler is notified more than once, it is only executed once (per host). To instruct Ansible to notify a handler, we can use the notify attribute for the respective task. Ansible will, however, only notify the handler if the task results in a change. To illustrate this, let us take a look at the following playbook.

---
  - hosts: localhost
    connection: local
    tasks:
    - name: Create empty file
      copy:
        content: ""
        dest: test
        force: no
      notify: handle new file
    handlers:
    - name: handle new file
      debug:
        msg: "Handler invoked"

This playbook defines one task and one handler. Handlers are defined in a separate section of the playbook which contains a list of handlers, similarly to tasks. They are structured as tasks, having a name and an action. To notify a handler, we add the notify attribute to a task, using exactly the name of the handler as argument.

In our example, the first (and only) task of the play will create an empty file test, but only if the file is not yet present. Thus if you run this playbook for the first time (or manually remove the file), this task will result in a change. If you run the playbook once more, it will not result in a change.

Correspondingly, upon the first execution, the handler will be notified and execute at the end of the play. For subsequent executions, the handler will not run.

Handler names should be unique, so you cannot use handler names to make a handler listen for different events. There is, however, a way to decouple handlers and notifications using topics. To make a handler listen for a topic, we can add the listen keyword to the handler, and use the name of the topic instead of the handler name when defining the notificiation, see the documentation for details.

Handlers can be problematic, as they are change triggered, not state triggered. If, for example, a task results in a change and notifies a handler, but the playbook fails in a later task, the trigger is lost. When you run the playbook again, the task will most likely not result in a change any more, and no additional notification is created. Effectively, the handler never gets executed. There is, however, a flag –force-handler to force execution of handlers even if the playbook fails.

Using roles and includes

Playbooks are a great way to automate provisioning steps, but for larger projects, they easily get a bit messy and difficult to maintain. To better organize larger projects, Ansible has the concept of a role.

Essentially, roles are snippets of tasks, variables and handlers that are organized as reusable units. Suppose, for instance, that in all your playbooks, you do a similar initial setup for all machines – create a user, distribute SSH keys and so forth. Instead of repeating the same tasks and the variables to which they refer in all your playbooks, you can organize them as a role.

What makes roles a bit more complex to use initially is that roles are backed by a defined directory structure that we need to understand first. Specifically, Ansible will look for roles in a subdirectory called (surprise) roles in the directory in which the playbook is located (it is also possible to maintain system-wide roles in /etc/ansible/roles, but it is obviously much harder to put these roles under version control). In this subdirectory, Ansible expects a directory for each role, named after the role.

Within this subdirectory, each type of objects defined by this role are placed in separate subdirectories. Each of these subdirectories contains a file main.yml that defines these objects. Not all of these directories need to be present. The most important ones that most roles will include are (see the Ansible documentation for a full list) are:

tasks, holding the tasks that make up the role
defaults that define default values for the variables referenced by these tasks
vars containing additional variable definitions
meta containing information on the role itself, see below

To initialize this structure, you can either create these directories manually, or you run

cd roles
ansible-galaxy init

to let Ansible create a skeleton for you. The tool that we use here – ansible-galaxy – is actually part of an infrastructure that is used by the community to maintain a set of reusable roles hosted centrally. We will not use Ansible Galaxy here, but you might want to take a look at the Galaxy website to get an idea.

So suppose that we want to restructure the example playbook that we used in one of my last posts to bring up a Docker container as an Ansible test environment using roles. This playbook essentially has two parts. In the first part, we create the Docker container and add it to the inventory. In the second part, we create a default user within the container with a given SSH key.

Thus it makes sense to re-structure the playbook using two roles. The first role would be called createContainer, the second defaultUser. Our directory structure would then look as follows.

Here, docker.yaml is the main playbook that will use the role, and we have skipped the var directory for the second role as it is not needed. Let us go through these files one by one.

The first file, docker.yaml, is now much shorter as it only refers to the two roles.

---
  - name: Bring up a Docker container 
    hosts: localhost
    roles:
    - createContainer
  - name: Provision hosts
    hosts: docker_nodes
    become: yes
    roles:
    - defaultUser

Again, we define two plays in this playbook. However, there are no tasks defined in any of these plays. Instead, we reference roles. When running such a play, Ansible will go through the roles and for each role:

Load the tasks defined in the main.yml file in the tasks-folder of the role and add them to the play
Load the default variables defined in the main.yml file in the defaults-folder of the role
Merge these variable definitions with those defined in the vars-folder, so that these variables will overwrite the defaults (see also the precedence rules for variables in the documentation

Thus all tasks imported from roles will be run before any tasks defined in the playbook directly are executed.

Let us now take a look at the files for each of the roles. The tasks, for instance, are simply a list of tasks that you would also put into a play directly, for instance

---
- name: Run Docker container
  docker_container:
    auto_remove: yes
    detach: yes
    name: myTestNode
    image: node:latest
    network_mode: bridge
    state: started
  register: dockerData

Note that this is a perfectly valid YAML list, as you would expect it. Similarly, the variables you define either as default or as override are simple dictionaries.

---
# defaults file for defaultUser
userName: chr
userPrivateKeyFile: ~/.ssh/ansible-default-user-key_rsa

The only exception is the role meta data main.yml file. This is a specific YAML-based syntax that defines some attributes of the role itself. Most of the information within this file is meant to be used when you decide to publish your role on Galaxy and not needed if you work locally. There is one important exception, though – you can define dependencies between roles that will be evaluated and make sure that roles that are required for a role to execute are automatically pulled into the playbook when needed.

The full example, including the definitions of these two roles and the related variables, can be found in my GitHub repository. To run the example, use the following steps.

# Clone my repository and cd into partVI
git clone https://github.com/christianb93/ansible-samples
cd ansible-samples/partVI
# Create two new SSH keys. The first key is used for the initial 
# container setup, the public key is baked into the container
ssh-keygen -f ansible -b 2048 -t rsa -P ""
# the second key is the key for the default user that our 
# playbook will create within the container
ssh-keygen -f default-user-key -b 2048 -t rsa -P ""
# Get the Dockerfile from my earlier post and build the container, using
# the new key ansible just created
cp ../partV/Dockerfile .
docker build -t node .
# Put location of user SSH key into vars/main.yml for 
# the defaultUser role
mkdir roles/defaultUser/vars
echo "---" > roles/defaultUser/vars/main.yml
echo "userPrivateKeyFile: $(pwd)/default-user-key" >> roles/defaultUser/vars/main.yml
# Run playbook
export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook docker.yaml

If everything worked, you can now log into your newly provisioned container with the new default user using

ssh -i default-user-key chr@172.17.0.2

where of course you need to replace the IP address by the IP address assigned to the Docker container (which the playbook will print for you). Note that we have demonstrated variable overriding in action – we have created a variable file for the role defaultUser and put the specific value, namely the location of the default users SSH key – into it, overriding the default coming with the role. This ability to bundle all required variables as part of a role but at the same time allowing a user to override them is it what makes roles suitable for actual code reuse. We could have overwritten the variable as well in the playbook using it, making the use of roles a bit similar to a function call in a procedural programming language.

This completes our post on roles. In the next few posts, we will now tackle some more complex projects – provisioning infrastructure in a cloud environment with Ansible.

Automating provisioning with Ansible – working with inventories

So far, we have used Ansible inventories more or less as a simple list of nodes. But there is much more you can do with inventories – you can assign hosts to groups, build hierarchies of groups, use dynamic inventories and assign variables. In this post, we will look at some of these options.

Groups in inventories

Recall that our simple inventory file (when using the Vagrant setup demonstrated in an earlier post) looks as follows (say this is saved as hosts.ini in the current working directory)

[servers]
192.168.33.10
192.168.33.11

We have two servers and one group called servers. Of course, we could have more than one group. Suppose, for instance, that we change our file as follows.

[web]
192.168.33.10
[db]
192.168.33.11

When running Ansible, we can now specify either the group web or db, for instance

export ANSIBLE_HOST_KEY_CHECKING=False
ansible -u vagrant \
        -i hosts.ini \
        --private-key ~/vagrant/vagrant_key \
        -m ping web

then Ansible will only operate on the hosts in the group web. If you want to run Ansible for all hosts, you can still use the group “all”.

Hosts can be in more than one group. For instance, when we change the file as follows

[web]
192.168.33.10
[db]
192.168.33.10
192.168.33.11

then the host 192.168.33.10 will be on both groups, db and web. If you run Ansible for the db group, it will operate on both hosts, if you run it for the web group, it will operate only on 192.168.33.10. Of course, if you use the pseudo-group all, then Ansible will touch this host only once, even though it appears twice in the configuration file.

It is also possible to define groups as the union of other groups. To create a group that contains all servers, we could for instance do something like

[web]
192.168.33.10
[db]
192.168.33.11
192.168.33.10
[servers:children]
db
web

Finally, as we have seen in the last post, we can define variables directly in the inventory file. We have already seen this on the level of individual hosts, but it is also possible on the level of groups. Let us take a look at the following example.

[web]
192.168.33.10
[db]
192.168.33.11
192.168.33.10
[servers:children]
db
web
[db:vars]
a=5
[servers:vars]
b=10

Here we define two variables, a and b. The first variable is defined for all servers in the db group. The second variable is defined for all servers in the servers group. We can print out the values of these variables using the debug Ansible module

ansible -i hosts.ini \
        --private-key ~/vagrant/vagrant_key  \
        -u vagrant -m debug -a "var=a,b" all

which will give us the following output

192.168.33.11 | SUCCESS => {
    "a,b": "(5, 10)"
}
192.168.33.10 | SUCCESS => {
    "a,b": "(5, 10)"
}

To better understand the structure of an inventory file, it is often useful to create a YAML representation of the file, which is better suited to visualize the hierarchical structure. For that purpose, we can use the ansible-inventory command line tool. The command

ansible-inventory -i hosts.ini --list -y

will create the following YAML representation of our inventory.

all:
  children:
    servers:
      children:
        db:
          hosts:
            192.168.33.10:
              a: 5
              b: 10
            192.168.33.11:
              a: 5
              b: 10
        web:
          hosts:
            192.168.33.10: {}
    ungrouped: {}

which nicely demonstrates that we have effectively built a tree, with the implicit group all at the top, the group servers as the only descendant and the groups db and web as children of this group. In addition, there is a group ungrouped which lists all hosts which are not explicitly assigned to a group. If you omit the -y flag, you will get a JSON representation which lists the variables in a separate dictionary _meta.

Group and host variables in separate files

Ansible also offers you the option to maintain group and host variables in separate files, which can again be useful if you need to deal with different environments. By convention, Ansible will look for variables defined on group level in a directory group_vars that needs to be a subdirectory of the directory in which the inventory file is located. Thus, if your inventory file is called /home/user/somedir/hosts.ini, you would have to create a directory /home/user/somedir/group_vars. Inside this directory, place a YAML file containing a dictionary with key-value pairs which will then be used as variable definitions and whose name is that of the group. In our case, if we wanted to define a variable c for all hosts in the db group, we would create a file group_vars/db.yaml with the following content

---
  c: 15

We can again check that this has worked by printing the variables a, b and c.

ansible -i hosts.ini \
        --private-key ~/vagrant/vagrant_key  \
        -u vagrant -m debug -a "var=a,b,c" db

Similary, you could create a directory host_vars and place a file there to define variables for a specific host – again, the file name should match the host name. It is also possible to merge several files – see the corresponding page in the documentation for all the details.

Dynamic inventories

So far, our inventories have been static, i.e. we prepare them before we run Ansible, and they are unchanged while the playbook is running. This is nice in a classical setup where you have a certain set of machines you need to manage, and make changes only if a new machine is physically added to the data center or removed. In a cloud environment, however, the setup tends to be much more dynamic. To deal with this situation, Ansible offers different approaches to manage dynamic inventories that change at runtime.

The first approach is to use inventory scripts. An inventory script is simply a script (an executable, which can be written in any programming language) that creates an inventory in JSON format, similarly to what ansible-inventory does. When you provide such an executable using the -i switch, Ansible will invoke the inventory script and use the output as inventory.

Inventory scripts are invoked by Ansible in two modes. When Ansible needs a full list of all hosts and groups, it will add the switch –list. When Ansible needs details on a specific host, it will pass the switch –host and, in addition, the hostname as an argument.

Let us take a look at an example to see how this works. Ansible comes bundled with a large number of inventory scripts. Let us play with the script for Vagrant. After downloading the script to your working directory and installing the required module paramiko using pip, you can run the script as follows.

python3 vagrant.py --list

This will give you an output similar to the following (I have piped this through jq to improve readability)

{
  "vagrant": [
    "boxA",
    "boxB"
  ],
  "_meta": {
    "hostvars": {
      "boxA": {
        "ansible_user": "vagrant",
        "ansible_host": "127.0.0.1",
        "ansible_ssh_private_key_file": "/home/chr/vagrant/vagrant_key",
        "ansible_port": "2222"
      },
      "boxB": {
        "ansible_user": "vagrant",
        "ansible_host": "127.0.0.1",
        "ansible_ssh_private_key_file": "/home/chr/vagrant/vagrant_key",
        "ansible_port": "2200"
      }
    }
  }
}

We see that the script has created a group vagrant with two hosts, using the names in your Vagrantfile. For each host, it has, in addition, declared some variables, like the private key file, the SSH user and the IP address and port to use for SSH.

To use this dynamically created inventory with ansible, we first have to make the Python script executable, using chmod 700 vagrant.py. Then, we can simply invoke Ansible pointing to the script with the -i switch.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible -i vagrant.py -m ping all

Note that we do not have to use the switches –private-key and -u as this information is already present in the inventory.

It is instructive to look at the (rather short) script. Doing this, you will see that behind the scenes, the script simply invokes vagrant status and vagrant ssh-config. This implies, however, that the script will only detect your running instances properly if you execute it – and thus Ansible – in the directory in which your Vagrantfile is living and in which you issued vagrant up to bring up the machines!.

In practice, dynamic inventories are mostly used with public cloud providers. Ansible comes with inventory scripts for virtually every cloud provider you can imagine. As an example, let us try out the script for EC2.

First, there is some setup required. Download the inventory script ec2.py and, placing it in the same directory, the configuration file ec2.ini. Then, make the script executable.

A short look at the script will show you that it uses the Boto library to access the AWS API. So you need Boto installed, and you need to make sure that Boto has access to your AWS credentials, for instance because you have a working AWS CLI configuration (see my previous post on using Python with AWS for more details on this). As explained in the comments in the script, you can also use environment variables to provide your AWS credentials (which are stored in ~/.aws/credentials when you have set up the AWS CLI).

Next, bring up some instances on EC2 and then run the script using

./ec2.py --list

Note that the script uses a cache to avoid having to repeat time-consuming API calls. The expiration time is provided as a parameter in the ec2.ini file. The default is 5 minutes, which can be a bit too long when playing around, so I recommend to change this to e.g. 30 seconds.

Even if you have only one or two machines running, the output that the script produces is significantly more complex than the output of the Vagrant dynamic inventory script. The reason for this is that, instead of just listing the hosts, the EC2 script will group the hosts according to certain criteria (that again can be selected in ec2.ini), for instance availability zone, region, AMI, platform, VPC and so forth. This allows you to target for instance all Linux boxes, all machines running in a specific data center and so forth. If you tag your machines, you will also find that the script groups the hosts by tags. This is very useful and allows you to differentiate between different types of machines (e.g. database servers, web servers), or different stages. The script also attaches certain characteristics as variables to each host.

In addition to inventory scripts, there is a second mechanism to dynamically change an inventory – there is actually an Ansible module which maintains the copy of the inventory that Ansible builds in memory at startup (and thus makes changes that are only valid for this run), the add_host module. This is mostly used when we use Ansible to actually bring up hosts.

To demonstrate this, we will use a slightly different setup as we have used so far. Recall that Ansible can work with any host on which Python is installed and which can be accessed via SSH. Thus, instead of spinning up a virtual machine, we can as well bring up a container and use that to simulate a host. To spin up a container, we can use the Ansible module docker_container that we execute on the control machine, i.e. with the pseudo-group localhost, which is present even if the inventory is empty. After we have created the container, we add the container dynamically to the inventory and can then use it for the remainder of the playbook.

To realize this setup, the first thing which we need is a container image with a running SSH daemon. As base image, we can use the latest version of the official Ubuntu image for Ubuntu bionic. I have created a Dockerfile which, based on the Ubuntu image, installs the OpenSSH server and sudo, creates a user ansible, adds the user to the sudoer group and imports a key pair which is assumed to exist in the directory.

Once this image has been built, we can use the following playbook to bring up a container, dynamically add it to the inventory and run ping on it to test that the setup has worked. Note that this requires that the docker Python module is installed on the control host.

---
  # This is our first play - it will bring up a new Docker container and register it with
  # the inventory
  - name: Bring up a Docker container that we will use as our host and build a dynamic inventory
    hosts: localhost
    tasks:
    # We first use the docker_container module to start our container. This of course assumes that you
    # have built the image node:latest according to the Dockerfile which is distributed along with
    # this script. We use bridge networking, but do not expose port 22
    - name: Run Docker container
      docker_container:
        auto_remove: yes
        detach: yes
        name: myTestNode
        image: node:latest
        network_mode: bridge
        state: started
      register: dockerData
    # As we have chosen not to export the SSH port, we will have to figure out the IP of the container just created. We can
    # extract this from the return value of the docker run command
    - name: Extract IP address from docker dockerData
      set_fact:
        ipAddress:  "{{ dockerData['ansible_facts']['docker_container']['NetworkSettings']['IPAddress'] }}"
    # Now we add the new host to the inventory, as part of a new group docker_nodes
    # This inventory is then valid for the remainder of the playbook execution
    - name: Add new host to inventory
      add_host:
        hostname: myTestNode
        ansible_ssh_host: "{{ ipAddress }}"
        ansible_ssh_user: ansible
        ansible_ssh_private_key_file: "./ansible"
        groups: docker_nodes
  #
  # Our second play. We now ping our host using the newly created inventory
  #
  - name: Ping host
    hosts: docker_nodes
    become: yes
    tasks:
    - name: Ping  host
      ping:

If you want to run this example, you can download all the required files from my GitHub repository, build the required Docker image, generate the key and run the example as follows.

pip install docker
git clone https://www.github.com/christianb93/ansible-samples
cd ansible-samples/partV
./buildContainer
export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook docker.yaml

This is nice, and you might want to play with this to create additional containers for more advanced test setups. When you do this, however, you will soon realize that it would be very beneficial to be able to execute one task several times, i.e. to use loops. Time to look at control structures in Ansible, which we do in the next post.

Automating provisioning with Ansible – variables and facts

In the playbooks that we have considered so far, we have used tasks, links to the inventory and modules. In this post, we will add another important feature of Ansible to our toolbox – variables.

Declaring variables

Using variables in Ansible is slightly more complex than you might expect at the first glance, mainly due to the fact that variables can be defined at many different points, and the precedence rules are a bit complicated, making errors likely. Ignoring some of the details, here are the essential options that you have to define variables and assign values:

You can assign variables in a playbook on the level of a play, which are then valid for all tasks and all hosts within that play
Similarly, variables can be defined on the level of an individual task in a playbook
You can define variables on the level of hosts or groups of hosts in the inventory file
There is a module called set_fact that allows you to define variables and assign values which are then scoped per host and for the remainder of the playbook execution
Variables can be defined on the command line when executing a playbook
Variables can be bound to a module so that the return values of that module are assigned to that variable within the scope of the respective host
Variable definitions can be moved into separate files and be referenced from within the playbook
Finally, Ansible will provide some variables and facts

Let us go through these various options using the following playbook as an example.

---
- hosts: all
  become: yes
  # We can define a variable on the level of a play, it is
  # then valid for all hosts to which the play applies
  vars:
    myVar1: "Hello"
  vars_files:
  - vars.yaml
  tasks:
    # We can also set a variable using the set_fact module
    # This will be valid for the respective host until completion
    # of the playbook
  - name: Set variable
    set_fact:
      myVar2: "World"
      myVar5: "{{ ansible_facts['machine_id'] }}"
    # We can register variables with tasks, so that the output of the
    # task will be captured in the variable
  - name: Register variables
    command: "date"
    register:
      myVar3
  - name: Print variables
    # We can also set variables on the task level
    vars:
      myVar4: 123
    debug:
      var: myVar1, myVar2, myVar3['stdout'], myVar4, myVar5, myVar6, myVar7

At the top of this playbook, we see an additional attribute vars on the level of the play. This attribute itself is a list and contains key-value pairs that define variables which are valid across all tasks in the playbook. In our example, this is the variable myVar1.

The same syntax is used for the variable myVar4 on the task level. This variable is then only valid for that specific task.

Directly below the declaration of myVar1, we instruct Ansible to pick up variable definitions from an external file. This file is again in YAML syntax and can define arbitrary key-value pairs. In our example, this file could be as simple as

---
  myVar7: abcd

Separating variable definitions from the rest of the playbook is very useful if you deal with several environments. You could then move all environment-specific variables into separate files so that you can use the same playbook for all environments. You could even turn the name of the file holding the variables into a variable that is then set using a command line switch (see below), which allows you to use different sets of variables for each execution without having to change the playbook.

The variable myVar3 is registered with the module command, meaning that it will capture the output of this module. Note that this output will usually be a complex data structure, i.e. a dictionary. One of the keys in this dictionary, which depends on the module, is typically stdout and captures the output of the command.

For myVar2, we use the module set_fact to define it and assign a value to it. Note that this value will only be valid per host, as myVar5 demonstrates (here we use a fact and Jinja2 syntax – we will discuss this further below).

In the last task of the playbook, we print out the value of all variables using the debug module. If you look at this statement, you will see that we print out a variable – myVar6 – which is not defined anywhere in the playbook. This variable is in fact defined in the inventory. Recall that the inventory for our test setup with Vagrant introduced in the last post looked as follows.

[servers]
192.168.33.10
192.168.33.11

To define the variable myVar6, change this file as follows.

[servers]
192.168.33.10 myVar6=10
192.168.33.11 myVar6=11

Note that behind each host, we have added the variable name along with its value which is specific per host. If you now run this playbook with a command like

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook  \
        -u vagrant \
        --private-key ~/vagrant/vagrant_key \
        -i ~/vagrant/hosts.ini \
         definingVariables.yaml

then the last task will produce an output that contains a list of all variables along with their values. You will see that myVar6 has the value defined in the inventory, that myVar5 is in fact different for each host and that all other variables have the values defined in the playbook.

As mentioned before, it is also possible to define variables using an argument to the ansible-playbook executable. If, for instance, you use the following command to run the playbook

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook  \
        -u vagrant \
        --private-key ~/vagrant/vagrant_key \
        -i ~/vagrant/hosts.ini \
        -e myVar4=789 \
         definingVariables.yaml

then the output will change and the variable myVar4 has the value 789. This is an example of the precedence rule mentioned above – an assignment specified on the command line overwrites all other definitions.

Using facts and special variables

So far we have been using variables that we instantiate and to which we assign values. In addition to these custom variables, Ansible will create and populate a few variables for us.

First, for every machine in the inventory to which Ansible connects, it will create a complex data structure called ansible_facts which represents data that Ansible collects on the machine. To see an example, run the command (assuming again that you use the setup from my last post)

export ANSIBLE_HOST_KEY_CHECKING=False
ansible \
      -u vagrant \
      -i ~/vagrant/hosts.ini \
      --private-key ~/vagrant/vagrant_key  \
      -m setup all

This will print a JSON representation of the facts that Ansible has gathered. We see that facts include information on the machine like the number of cores and hyperthreads per core, the available memory, the IP addresses, the devices, the machine ID (which we have used in our example above), environment variables and so forth. In addition, we find some information on the user that Ansible is using to connect, the used Python interpreter and the operating system installed.

It is also possible to add custom facts by placing a file with key-value pairs in a special directory on the host. Confusingly, this is called local facts, even though these facts are not defined on the control machine on which Ansible is running but on the host that is provisioned. Specifically, a file in /etc/ansible/facts.d ending with .fact can contain key-value pairs that are interpreted as facts and added to the dictionary ansible_local.

Suppose, for instance, that on one of your hosts, you have created a file called /etc/ansible/facts.d/myfacts.fact with the following content

[test]
testvar=1

If you then run the above command again to gather all facts, then, for that specific host, the output will contain a variable

"ansible_local": {
            "myfacts": {
                "test": {
                    "testvar": "1"
                }
            }

So we see that ansible_local is a dictionary, with the keys being the names of the files in which the facts are stored (without the extension). The value for each of the files is again a dictionary, where the key is the section in the facts file (the one in brackets) and the value is a dictionary with one entry for each of the variables defined in this section (you might want to consult the Wikipedia page on the INI file format).

In addition to facts, Ansible will populate some special variables like the inventory_hostname, or the groups that appear in the inventory file.

Using variables – Jinja2 templates

In the example above, we have used variables in the debug statement to print their value. This, of course, is not a typical usage. In general, you will want to expand a variable to use it at some other point in your playbook. To this end, Ansible uses Jinja2 templates.

Jinja2 is a very powerful templating language and Python library, and we will only be able to touch on some of its features in this post. Essentially, Jinja2 accepts a template and a set of Python variables and then renders the template, substituting special expressions according to the values of the variables. Jinja2 differentiates between expressions which are evaluated and replaced by the values of the variables they refer to, and tags which control the flow of the template processing, so that you can realize things like loops and if-then statements.

Let us start with a very simple and basic example. In the above playbook, we have hard-coded the name of our variable file vars.yaml. As mentioned above, it is sometimes useful to use different variable files, depending on the environment. To see how this can be done, change the start of our playbook as follows.

---
- hosts: all
  become: yes
  # We can define a variable on the level of a play, it is
  # then valid for all hosts to which the play applies
  vars:
    myVar1: "Hello"
  vars_files:
  - "{{ myfile }}"

When you now run the playbook again, the execution will fail and Ansible will complain about an undefined variable. To fix this, we need to define the variable myfile somewhere, say on the command line.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook  \
        -u vagrant \
        --private-key ~/vagrant/vagrant_key \
        -i ~/vagrant/hosts.ini \
        -e myVar4=789 \
        -e myfile=vars.yaml \
         definingVariables.yaml

What happens is that before executing the playbook, Ansible will run the playbook through the Jinja2 templating engine. The expression {{myfile}} is the most basic example of a Jinja2 expression and evaluates to the value of the variable myfile. So the entire expression gets replaced by vars.yaml and Ansible will read the variables defined there.

Simple variable substitution is probably the most commonly used feature of Jinja2. But Jinja2 can do much more. As an example, let us modify our playbook so that we use a certain value for myVar7 in DEV and a different value in PROD. The beginning of our playbook now looks as follows (everything else is unchanged):

---
- hosts: all
  become: yes
  # We can define a variable on the level of a play, it is
  # then valid for all hosts to which the play applies
  vars:
    myVar1: "Hello"
    myVar7: "{% if myEnv == 'DEV' %} A {% else %} B {% endif %}"

Let us run this again. On the command line, we set the variable myEnv to DEV.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook  \
        -u vagrant \
        --private-key ~/vagrant/vagrant_key \
        -i ~/vagrant/hosts.ini \
        -e myVar4=789 \
        -e myEnv=DEV \
         definingVariables.yaml

In the output, you will see that the value of the variable is ” A “, as expected. If you use a different value for myEnv, you get ” B “. The characters “{%” instruct Jinja2 to treat everything that follows (until “%}”) as tag. Tags are comparable to statements in a programming language. Here, we use the if-then-else tag which evaluates to a value depending on a condition.

Jinja2 comes with many tags, and I advise you to browse the documentation of all available control structures. In addition to control structures, Jinja2 also uses filters that can be applied to variables and can be chained.

To see this in action, we turn to an example which demonstrates a second common use of Jinja2 templates with Ansible apart from using them in playbooks – the template module. This module is very similar to the copy module, but it takes a Jinja2 template on the control machine and does not only copy it to the remote machine, but also evaluates it.

Suppose, for instance, you wanted to dynamically create a web page on the remote machine that reflects some of the machines’s characteristics, as captured by the Ansible facts. Then, you could use a template that refers to facts to produce some HTML output. I have created a playbook that demonstrates how this works – this playbook will install NGINX in a Docker container and dynamically create a webpage containing machine details. If you run this playbook with our Vagrant based setup and point your browser to http://192.168.33.10/, you will see a screen similar to the one below, displaying things like the number of CPU cores in the virtual machine or the network interfaces attached to it.

I will not go through this in detail, but I advise you to try out the playbook and take a look at the Jinja2 template that it uses. I have added a few comments which, along with the Jinja2 documentation, should give you a good idea how the evaluation works.

To close this post, let us see how we can test Jinja2 templates. Of course, you could simply run Ansible, but this is a bit slow and creates an overhead that you might want to avoid. As Jinja2 is a Python library, there is a much easier approach – you can simply create a small Python script that imports your template, runs the Jinja2 engine and prints the result. First, of course, you need to install the Jinja2 Python module.

pip3 install jinja2

Here is an example of how this might work. We import the template index.html.j2 that we also use for our dynamic webpage displayed above, define some test data, run the engine and print the result.

import jinja2
#
# Create a Jinja2 environment, using the file system loader
# to be able to load templates from the local file system
#
env = jinja2.Environment(
    loader=jinja2.FileSystemLoader('.')
)
#
# Load our template
#
template = env.get_template('index.html.j2')
#
# Prepare the input variables, as Ansible would do it (use ansible -m setup to see
# how this structure looks like, you can even copy the JSON output)
#
groups = {'all': ['127.0.0.1', '192.168.33.44']}
ansible_facts={
        "all_ipv4_addresses": [
            "10.0.2.15",
            "192.168.33.10"
        ],
        "env": {
            "HOME": "/home/vagrant",
        },
        "interfaces": [
            "enp0s8"
        ],
        "enp0s8": {
            "ipv4": {
                "address": "192.168.33.11",
                "broadcast": "192.168.33.255",
                "netmask": "255.255.255.0",
                "network": "192.168.33.0"
            },
            "macaddress": "08:00:27:77:1a:9c",
            "type": "ether"
        },
}
#
# Render the template and print the output
#
print(template.render(groups=groups, ansible_facts=ansible_facts))

An additional feature that Ansible offers on top of the standard Jinja2 templating language and that is sometimes useful are lookups. Lookups allow you to query data from external sources on the control machine (where they are evaluated), like environment variables, the content of a file, and many more. For example, the expression

"{{ lookup('env', 'HOME') }}"

in a playbook or a template will evaluate to the value of the environment variable HOME on the control machine. Lookups are enabled by plugins (the name of the plugin is the first argument to the lookup statement), and Ansible comes with a large number of pre-installed lookup plugins.

We have now discussed Ansible variables in some depth. You might want to read through the corresponding section of the Ansible documentation which contains some more details and links to additional information. In the next post, we will turn our attention back from playbooks to inventories and how to structure and manage them.

Automating provisioning with Ansible – playbooks

So far, we have used Ansible to execute individual commands on all hosts in the inventory, one by one. Today, we will learn how to use playbooks to orchestrate the command execution. As a side note, we will also learn how to set up a local test environment using Vagrant.

Setting up a local test environment

To develop and debug Ansible playbooks, it can be very useful to have a local test environment. If you have a reasonably powerful machine (RAM will be the bottleneck), this can be easily done by spinning up virtual machines using Vagrant. Essentially, Vagrant is a tool to orchestrate virtual machines, based on configuration files which can be put under version control. We will not go into details on Vagrant in this post, but only describe briefly how the setup works.

First, we need to install Vagrant and VirtualBox. On Ubuntu, this is done using

sudo apt-get install virtualbox vagrant

Next, create a folder vagrant in your home directory, switch to it and create a SSH key pair that we will later use to access our virtual machines.

ssh-keygen -b 2048 -t rsa -f vagrant_key -P ""

This will create two files, the private key vagrant_key and the corresponding public key vagrant_key.pub.

Next, we need a configuration file for the virtual machines we want to bring up. This file is traditionally called Vagrantfile. In our case, the file looks as follows.

Vagrant.configure("2") do |config|

  config.ssh.private_key_path = ['~/.vagrant.d/insecure_private_key', '~/.keys/ansible_key'] 
  config.vm.provision "file", source: "~/vagrant/vagrant_key.pub", destination: "~/.ssh/authorized_keys" 
  
  config.vm.define "boxA" do |boxA|
    boxA.vm.box = "ubuntu/bionic64"
    boxA.vm.network "private_network", ip: "192.168.33.10"
  end

  config.vm.define "boxB" do |boxB|
    boxB.vm.box = "ubuntu/bionic64"
    boxB.vm.network "private_network", ip: "192.168.33.11"
  end
end

We will not go into the details, but essentially this file instructs Vagrant to provision two machines, called boxA and boxB (the file, including some comments, can also be downloaded here). Both machines will be connected to a virtual network device and will be reachable from the host and from each other using the IP addresses 192.168.33.10 and 192.168.33.11 (using the VirtualBox networking machinery in the background, which I have described in a bit more detail in this post). On both files, we will install the public key just created, and we ask Vagrant to use the corresponding private key.

Now place this file in the newly created directory ~/vagrant and, from there, run

vagrant up

Vagrant will now start to spin up the two virtual machines. When you run this command for the first time, it needs to download the machine image which will take some time, but once the image is present locally, this usually only takes one or two minutes.

Our first playbook

Without further ado, here is our first playbook that we will use to explain the basic structure of an Ansible playbook.

---
# Our first play. We create an Ansible user on the host
- name: Initial setup
  hosts: all
  become: yes
  tasks:
  - name: Create a user ansible_user on the host
    user:
      name: ansible_user
      state: present
  - name: Create a directory /home/ansible_user/.ssh
    file:
      path: /home/ansible_user/.ssh
      group: ansible_user
      owner: ansible_user
      mode: 0700
      state: directory
  - name: Distribute ssh key
    copy:
      src: ~/.keys/ansible_key.pub
      dest: /home/ansible_user/.ssh/authorized_keys
      mode: 0700
      owner: ansible_user
      group: ansible_user
  - name: Add newly created user to sudoers file
    lineinfile:
      path: /etc/sudoers
      state: present
      line: "ansible_user      ALL = NOPASSWD: ALL"

Let us go through this line by line. First, recall from our previous posts that an Ansible playbook is a sequence of plays. Each play in turn refers to a group of hosts in the inventory.

A playbook is written in YAML-syntax that you probably know well if you have followed my series on Kubernetes (if not, you might want to take a look at the summary page on Wikipedia). Being a sequence of plays, it is a list. Our example contains only one play.

The first attribute of our play is called name. This is simply a human-readable name that ideally briefly summarizes what the play is doing. The next attribute hosts refers to a set of hosts in our inventory. In our case, we want to run the play for all nodes in the inventory. More precisely, this is a pattern which can specify individual hosts, groups of hosts or various combinations.

The second attribute is the become flag that we have already seen several times. We need this, as our vagrant user is not root and we need to do a sudo to execute most of our commands.

The next attribute, tasks, is itself a list and contains all tasks that make up the playbook. When Ansible executes a playbook, it will execute the plays in the order in which they are present in the playbook. For each play, it will go through the tasks and, for each task, it will loop through the corresponding hosts and execute this task on each host. So a task will have to be completed for each host before the next task starts. If an error occurs, this host will be removed from the rotation and the execution of the play will continue.

We have learned in an earlier post that a task is basically the execution of a module. Each task usually starts with a name. The next line specifies the module to execute, followed by a list of parameters for this module. Our first task executes the user module, passing the name of the user to be created as an argument, similar to what we did when executing commands ad-hoc. Our second task executes the file module, the third task the copy module and the last task the lineinfile module. If you have followed my previous posts, you will easily recognize what these tasks are doing – we create a new user, distribute the SSH key in ~/.keys/ansible_key.pub and add the user to the sudoers file on each host (of course, we could as well continue to use the vagrant user, but we perform the switch to a different user for the sake of demonstration). To run this example, you will have to create a SSH key pair called ansible_key and store it in the subdirectory .keys of your home directory, as in my last post.

Executing our playbook

How do we actually execute a playbook? The community edition of Ansible contains a command-line tool called ansible-playbook that accepts one or more playbooks as arguments and executes them. So assuming that you have saved the above playbook in a file called myFirstPlaybook, you can run it as follows.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook -i hosts.ini  -u vagrant --private-key ~/vagrant/vagrant_key myFirstPlaybook.yaml

The parameters are quite similar to the parameters of the ansible command. We use -i to specify the location of an inventory file, -u to specify the user to use for the SSH connections and –private-key to specify the location of the private key file of this user.

When we run this playbook, Ansible will log a summary of the actions to the console. We will see that it starts to work on our play, and then, for each task, executes the respective action once for each host. When the script completes, we can verify that everything worked by using SSH to connect to the machines using our newly created user, for instance

ssh -i ~/.keys/ansible_key -o StrictHostKeyChecking=no ansible_user@192.168.33.10

Instead of executing our playbook manually after Vagrant has completed the setup of the machines, it is also possible to integrate Ansible with Vagrant. In fact, Ansible can be used in Vagrant as a Provisioner, meaning that we can specify an Ansible playbook that Vagrant will execute for each host immediately after is has been provisioned. To see this in action, copy the playbook to the Vagrant directory ~/vagrant and, in the Vagrantfile, add the three lines

  config.vm.provision "ansible" do |ansible|
    ansible.playbook = "myFirstPlaybook.yaml"
  end

immediately after the line starting with config.vm.provision "file". When you now shut down the machines again using

vagrant destroy

from within the vagrant directory and then re-start using

vagrant destroy

you will see that the playbook gets executed once for each machine that is being provisioned. Behind the scenes, Vagrant will create an inventory in ~/vagrant/.vagrant/provisioners/ansible/inventory and will add the hosts there (when inspecting this file, you will find that Vagrant incorrectly adds the insecure default SSH key as a parameter to each host, but apparently this overridden by the settings in our Vagrantfile and does not cause any problems).

Of course there is much more that you can do with playbooks. As a first more complex example, I have created a playbook that

Creates a new user ansible_user and adds it to the sudoer file
Distributes a corresponding public SSH key
Installs Docker
Installs pip3 and the Python Docker library
Pulls an NGINX container from the Docker hub
Creates a HTML file and dynamically add the primary IP address of the host
Starts the container and maps the HTTP port

To run this playbook, copy it to the vagrant directory as well along with the HTML file template, change the name of the playbook in the Vagrantfile to nginx.yaml and run vagrant as before.

Once this script completes, you can point your browser to 192.168.33.10 or 192.168.33.11 and see a custom NGINX welcome screen. This playbook uses some additional features that we will discuss in the next post, most notably variables and facts.

Automating provisioning with Ansible – using modules

In the previous post, we have learned the basics of Ansible and how to use Ansible to execute a command – represented by a module – on a group of remote hosts. In this post, we will look into some useful modules in a bit more detail and learn a bit more on idempotency and state.

Installing software with the apt module

Maybe one of the most common scenarios that you will encounter when using Ansible is that you want to install a certain set of packages on a set of remote hosts. Thus, you want to use Ansible to invoke a packet manager. Ansible comes with modules for all major packet managers – yum for RedHat systems, apt for Debian derived systems like Ubuntu, or even the Solaris packet managers. In our example, we assume that you have a set of Ubuntu hosts on which you need to install a certain package, say Docker. Assuming that you have these hosts in your inventory in the group servers, the command to do this is as follows.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible servers \
  -i hosts.ini \
  -u root \
  --private-key ~/.ssh/do_k8s \
  -m apt \
  -a 'name=docker.io update_cache=yes state=present'

Most options of this command are as in my previous post. We use a defined private key and a defined inventory file and instruct Ansible to use the user root when logging into the machines. With the switch -m, we ask Ansible to run the apt module. With the switch -a, we pass a set of parameters to this module, i.e. a set of key-value pairs. Let us look at these parameters in detail.

The first parameter, name, is simply the package that we want to install. The parameter update_cache instructs apt to update the package information before installing the package, which is the equivalent of apt-get update. The last parameter is the parameter state. This defines the target state that we want to achieve. In our case, we want the package to be present.

When we run this command first for a newly provisioned host, chances are that the host does not yet have Docker installed, and the apt module will install it. We will receive a rather lengthy JSON formatted response, starting as follows.

206.189.56.178 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python3"
    },
    "cache_update_time": 1569866674,
    "cache_updated": true,
    "changed": true,
    "stderr": "",
    "stderr_lines": [],
    ...

In the output, we see the line "changed" : true, which indicates that Ansible has actually changed the state of the target system. This is what we expect – the package was not yet installed, we want the latest version to be installed, and this requires a change.

Now let us run this command a second time. This time, the output will be considerably shorter.

206.189.56.178 | SUCCESS => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python3"
    },
    "cache_update_time": 1569866855,
    "cache_updated": true,
    "changed": false
}

Now the attribute changed in the JSON output is set to false. In fact, the module has first determined the current state and found that the package is already installed. It has then looked at the target state and found that current state and target state are equal. Therefore, no action is necessary and the module does not perform any actual change.

This example highlights two very important and very general (and related) properties of Ansible modules. First, they are idempotent, i.e. executing them more than once results in the same state as executing them only once. This is very useful if you manage groups of hosts. If the command fails for one host, you can simply execute it again for all hosts, and it will not do any harm on the hosts where the first run worked.

Second, an Ansible task does (in general, there are exceptions like the command module) not specify an action, but a target state. The module is then supposed to determine the current state, compare it to the target state and only take those actions necessary to move the system from the current state to the target state.

These are very general principles, and the exact way how they are implemented is specific to the module in question. Let us look at a second example to see how this works in practice.

Working with files

Next, we will look at some modules that allow us to work with files. First, there is a module called copy that can be used to copy a file from the control host (i.e. the host on which Ansible is running) to all nodes. The following commands will create a local file called test.asc and distribute it to all nodes in the servers group in the inventory, using /tmp/test.asc as target location.

export ANSIBLE_HOST_KEY_CHECKING=False
echo "Hello World!" > test.asc
ansible servers \
  -i hosts.ini \
  -u root \
  --private-key ~/.ssh/do_k8s \
  -m copy \
  -a 'src=test.asc dest=/tmp/test.asc'

After we have executed this command, let us look at the first host in the inventory to see that this has worked. We first use grep to strip off the first line of the inventory file, then awk to get the first IP address and finally scp to manually copy the file back from the target machine to a local file test.asc.check. We can then look at this file to see that is contains what we expect.

ip=$(cat hosts.ini  | grep -v "\[" | awk 'NR == 1')
scp -i ~/.keys/k8s_token \
    root@$ip:/tmp/test.asc test.asc.check
cat test.asc.check

Again, it is instructive to run the copy command twice. If we do this, we see that as long as the local version of the file is unchanged, Ansible will not repeat the operation and leave the destination file alone. Again, this follows the state principle – as the node is already in the target state (file present), no action is necessary. If, however, we change the local file and resubmit the command, Ansible will again detect a deviation from the target state and copy the changed file again.

It is also possible to fetch files from the nodes. This is done with the fetch module. As this will in general produce more than one file (one per node), the destination that we specify is a directory and Ansible will create one subdirectory for each node on the local machine and place the files there.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible servers \
  -i hosts.ini \
  -u root \
  --private-key ~/.ssh/do_k8s \
  -m fetch \
  -a 'src=/tmp/test.asc dest=.'

Creating users and ssh keys

When provisioning a new host, maybe the first thing that you want to do is to create a new user on the host and provide SSH keys to be able to use this user going forward. This is especially important when using a platform like DigitalOcean which by default only creates a root account.

The first step is to actually create the user on the hosts, and of course there is an Ansible module for this – the user module. The following command will create a user called ansible_user on all hosts in the servers group.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible servers \
  -i hosts.ini \
  -u root \
  --private-key ~/.ssh/do_k8s \
  -m user \
  -a 'name=ansible_user'

To use this user via Ansible, we need to create an SSH key pair and distribute it. Recall that when SSH keys are used to log in to host A (“server”) from host B (“client”), the public key needs to be present on host A, whereas the private key is on host B. On host A, the key needs to be appended to the file authorized_keys in the .ssh directory of the users home directory.

So let us first create our key pair. Assuming that you have OpenSSH installed on your local machine, the following command will create a private / public RSA key pair with 2048 bit key length and no passphrase and store it in ~/.keys in two files – the private key will be in ansible_keys, the public key in ansible_keys.pub.

ssh-keygen -b 2048 -f ~/.keys/ansible_key -t rsa -P ""

Next, we want to distribute the public key to all hosts in our inventory, i.e. for each host, we want to append the key to the file authorized_keys in the users SSH directory. In our case, instead of appending the key, we can simply overwrite the authorized_keys file with our newly created public key. Of course, we can use Ansible for this task. First, we use the file module to create a directory .ssh in the new users home directory. Note that we still use our initial user root and the corresponding private key.

ansible servers \
  -i hosts.ini \
  -u root \
  --private-key ~/.ssh/do_k8s \
  -m file \
  -a 'path=/home/ansible_user/.ssh owner=ansible_user mode=0700 state=directory group=ansible_user'

Next, we use the copy module to copy the public key into the newly created directory with target file name authorized_keys (there is also a module called authorized_keys that we could use for that purpose).

ansible servers \
  -i hosts.ini \
  -u root \
  --private-key ~/.ssh/do_k8s \
  -m copy \
  -a 'src=~/.keys/ansible_key.pub dest=/home/ansible_user/.ssh/authorized_keys group=ansible_user owner=ansible_user mode=0700'

We can verify that this worked using again the ping module. This time, we use the newly created Ansible user and the corresponding private key.

ansible servers \
  -i hosts.ini \
  -u ansible_user \
  --private-key ~/.keys/ansible_key \
  -m ping

However, there is still a problem. For many purposes, our user is only useful if it has the right to use sudo without a password. There are several ways to achieve this, the easiest being to add the following line to the file /etc/sudoers which will allow the user ansible_user to run any command using sudo without a password.

ansible_user ALL = NOPASSWD: ALL

How to add this line? Again, there is an Ansible module that comes to our rescue – lineinfile. This module can be used to enforce the presence of defined lines in a file, at specified locations. The documentation has all the options, but for our purpose, the usage is rather simple.

ansible servers \
  -i hosts.ini \
  -u root \
  --private-key ~/.ssh/do_k8s \
  -m lineinfile \
  -a 'path=/etc/sudoers state=present line="ansible_user      ALL = NOPASSWD: ALL"'

Again, this module follows the principles of state and idempotency – if you execute this command twice, the output will indicate that the second execution does not result in any changes on the target system.

So far, we have used the ansible command line client to execute commands ad-hoc. This, however, is not the usual way of using Ansible. Instead, you would typically compose a playbook that contains the necessary commands and use Ansible to run this playbook. In the next post, we will learn how to create and deal with playbooks.

Automating provisioning with Ansible – the basics

For my projects, I often need a clean Linux box with a defined state which I can use to play around and, if a make a mistake, simply dump it again. Of course, I use a cloud environment for this purpose. However, I often find myself logging into one of these newly created machines to carry out installations manually. Time to learn how to automate this.

Ansible – the basics

Ansible is a platform designed to automate the provisioning and installation of virtual machines (or, more general, any type of servers, including physical machines and even containers). Ansible is agent-less, meaning that in order to manage a virtual machine, it does not require any special agent installed on that machine. Instead, it uses ssh to access the machine (in the Linux / Unix world, whereas with Windows, you would use something like WinRM).

When you use Ansible, you run the Ansible engine on a dedicated machine, on which Ansible itself needs to be installed. This machine is called the control machine. From this control machine, you manage one or more remote machines, also called hosts. Ansible will connect to these nodes via SSH and execute the required commands there. Obviously, Ansible needs a list of all hosts that it needs to manage, which is called the inventory. Inventories can be static, i.e. an inventory is simply a file listing all nodes, or can be dynamic, implemented either as a script that supplies a JSON-formatted list of nodes and is invoked by Ansible, or a plugin written in Python. This is especially useful when using Ansible with cloud environments where the list of nodes changes over time as nodes are being brought up or are shut down.

The “thing that is actually executed” on a node is, at the end of a day, a Python script. These scripts are called Ansible modules. You might ask yourself how the module can be invoked on the node. The answer is simple – Ansible copies the module to the node, executes it and then removes it again. When you use Ansible, then, essentially, you execute a list of tasks for each node or group of nodes, and a task is a call to a module with specific arguments.

A sequence of tasks to be applied to a specific group of nodes is called a play, and the file that specifies a set of plays is called a playbook. Thus a playbook allows you to apply a defined sequence of module calls to a set of nodes in a repeatable fashion. Playbooks are just plain files in YAML-format, and as such can be kept under version control and can be treated according to the principles of infrastructure-as-a-code.

Installation and first steps

Being agentless, Ansible is rather simple to install. In fact, Ansible is written in Python (with the source code of the community version being available on GitHub), and can be installed with

pip3 install --user ansible

Next, we need a host. For this post, I did simply spin up a host on DigitalOcean. In my example, the host has been assigned the IP address 134.209.240.213. We need to put this IP address into an inventory file, to make it visible to Ansible. A simple inventory file (in INI format, YAML is also supported) looks as follows.

[servers]
134.209.240.213

Here, servers is a label that we can use to group nodes if we need to manage nodes with different profiles (for instance webservers, application servers or database servers).

Of course, you would in general not create this file manually. We will look at dynamic inventories in a later post, but for the time being, you can use a version of the script that I used in an earlier post on DigitalOcean to automate the provisioning of droplets. This script brings up a machine and adds it to a repository file hosts.ini.

With this inventory file in place, we can already use Ansible to ad-hoc execute commands on this node. Here is a simple example.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible -i hosts.ini \
        servers \
        -u root \
        --private-key ~/.ssh/do_k8s \
        -m ping

Let us go through this example step by step. First, we set an environment variable which prevents the SSH host key check when first connecting to an unknown host. Next, we invoke the ansible command line client. The first parameter is the name of the inventory file, host.ini in our case. The second arguments (servers) instructs Ansible only to run our command for those hosts that carry the label servers in our inventory file. This allows us to use different node profiles and apply different commands to them. Next, the -u switch asks Ansible to use the root user to connect to our hosts (which is the default ssh user on DigitalOcean). The next switch specifies the file in which the private host key that Ansible will use to connect to the node is stored.

Finally, the switch -m specifies the module to use. For this example, we use the ping module which tries to establish a connection to the host (you can look at the source code of this module here).

Modules can also take arguments. A very versatile module is the command module (which is actually the default, if no module is specified). This module accepts an argument which it will simply execute as a command on the node. For instance, the following code snippet executes uname -a on all nodes.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible -i hosts.ini \
        servers \
        -u root \
        --private-key ~/.ssh/do_k8s \
        -m command \
        -a "uname -a"

Privilege escalation

Our examples so far have assumed that the user that Ansible uses to SSH into the machines has all the required privileges, for instance to run apt. On DigitalOcean, the standard user is root, where this simple approach is working, but in other cloud environments, like EC2, the situation is different.

Suppose, for instance, that we have created an EC2 instance using the Amazon Linux AMI (which has a default user called ec2-user) and added its IP address to our hosts.ini file (of course, I have written a script to automate this). Let us also assume that the SSH key used is called ansibleTest and stored in the directory ~/.keys. Then, a ping would look as follows.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible -i hosts.ini \
        servers \
        -u ec2-user \
        --private-key ~/.keys/ansibleTest.pem \
        -m ping

This would actually work, but this changes if we want to install a package like Docker. We will learn more about package installation and state in the next post, but naively, the command to actually install the package would be as follows.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible -i hosts.ini \
        servers \
        -u ec2-user \
        --private-key ~/.keys/ansibleTest.pem \
        -m yum -a 'name=docker state=present'

Here, we use the yum module which is able to leverage the YUM packet manager used by Amazon Linux to install, upgrade, remove and manage packages. However, this will fail, and you will receive a message like ‘You need to be root to perform this command’. This happens because Ansible will SSH into your machine using the ec2_user, and this user, by default, does not have the necessary privileges to run yum.

On the command line, you would of course use sudo to fix this. To instruct Ansible to do this, we have to add the switch -b to our command. Ansible will then use sudo to execute the modules as root.

export ANSIBLE_HOST_KEY_CHECKING=False
ansible -i hosts.ini \
        servers \
        -b \
        -u ec2-user \
        --private-key ~/.keys/ansibleTest.pem \
        -m yum -a 'name=docker state=present'

An additional parameter, –become-user, can be used to control which user Ansible will use to run the operations. The default is root, but any other user can be used as well (assuming, of course, that the user has the required privileges for the respective command).

The examples above still look like something that could easily be done with a simple shell script, looping over all hosts in an inventory file and invoking an ssh command. The real power of Ansible, however, becomes apparent when we start to talk about modules in the next post in this series.

Database programming with Python

Most of us probably started to use Python as a scripting language to quickly create working code for e.g. numerical and scientific calculations. But, of course, Python is much more than that. If you intend to use Python for more traditional applications, you will sooner or later need to interface with a database. Today, we will see how the Python DB-API can be used for that purpose.

The Python DB-API

Of course there are many databases that can be used with Python – various modules exist like mysql-connector for MySQL, Psycopg for PostgreSQL or sqlite3 for the embedded database SQLite. Obviously, all these drivers differ slightly, but it turns out that they follow a common design pattern which is known as the Python DB-API and described in PEP-249.

If you know Java, you have probably heard of JDBC, which is a standard to access relational databases from Java applications. The Python DB-API is a bit different – it is not a library, but it is a set of design patterns that drivers are supposed to follow.

When using JDBC, there is, for instance, an interface class called java.sql.Connection. Whatever driver and whatever database you use, this interface will always be the same for compliant drivers. For Python, the situation is a bit different. If you use SQLite, then there will be a class sqlite3.Connection. If you use MySQL, there will be a class with the slightly cryptic name mysql.connector.connection_cext.CMySQLConnection. These are different classes, and they do not inherit from a common superclass or interface. Still, the Python DB-ABI dictates that these classes all have a common set of methods and attributes to make switching between different databases as easy as possible.

Before we start to look at an example, it is helpful to understand the basic objects of the DB-API class model. First, there is a connection, which, of course, represents a connection to a database. How to initially obtain a connection is specific for the database used, and we will see some examples further below.

When a connection is obtained, it is active or open. It can be closed by executing its close() method, which will render the connection unusable and roll back any uncommitted changes. A connection also has commit() and rollback() to support transaction control (however, the specification leaves the implementation some freedom with respect to features like auto-commit and isolation levels, and does not even mandate that implementations support transactions at all). The specification is also not fully clear on how to start transactions, it seems that most driver automatically start a new transaction if a statement is executed and no transaction is in progress.

To actually execute statements and fetch results, a cursor is used. Again, it is implementation specific whether this is implemented as a real, server-side cursor or as a local cache within the client. Cursors are created by calling the connections cursor method, and is only valid within the context of a connection (i.e. if a connection is closed, all cursors obtained from it will become unusable). Once we have a cursor, we can execute statements on it using either its execute method (which executes a statement with one set of parameters) or its executemany method, which executes a statement several times with differents sets of parameters. How exactly parameters are referenced in a statement is implementation specific, and we will see some examples below.

The statement executed via a cursor can be an update or a select statement. In the latter case, we can access its results either via the fetchall() method which returns all rows of the result set as a sequence, or the fetchone() method which returns the next row. The API also defines a method fetchmany() which returns a specified number of rows at a time.

Using Python with SQLite

After this general introduction, let us now see how the API works in practice. For the sake of simplicity, we will first use the embedded database SQLite.

One of the nice things about SQLite3 is that the module that you will need to use it in a Python program – slqite3 is part of the Python standard library and therefore part of any Python standard installation. So there is no separate installation needed to use it, and we can start to play with it right away.

To use a databases, the first thing we have to do is to establish a connection to it. As mentioned above, the process to do this is specific for the SQLite database. As SQLite identifies a database by the file where the data is stored, we basically only have to provide the file name to the driver to either establish a connection to an existing database or to create a new database (which will happen automatically if the file we refer to does not exist). Here is a code snippet that will establish a connection to a database stored in your home directory in the file.

import sqlite3 as dblib
from pathlib import Path

home = str(Path.home())
db = home + "/example.db"
c = dblib.connect(db)

Now let us create a table. This sounds simple enough, just execute a CREATE TABLE statement. However, this will raise an error if the table already exists. Of course, we could use the IF NOT EXISTS clause to achieve this, but for the sake of demonstration let us choose a different approach.

We first try to drop the table. If the table does not exist, this will raise an exception which we can catch, if the table exists it will be removed. In any case, we can now proceed to create the table. To execute these statements, we need a cursor as explained above which we can get from the newly created connection. So our code looks as follows (of course, in productive code, you would put more effort into making sure that the reason for the exception is what you expect):

cursor = c.cursor()
try:
    cursor.execute('''DROP TABLE books''')
except dblib.Error as err:
    pass
cursor.execute('''CREATE TABLE books
             (author text, title text)''')

Next, we want to insert data into our database. Suppose you wanted to insert a book called “Moby Dick” written by Herman Melville. So we want to execute an SQL statement like

INSERT INTO books
  (author, title)
VALUES
  ('Herman Melville', 'Moby  Dick')
;

Of course, we could simply assemble this SQL statement in Python and pass it to the execute method of our cursor directly. But this is not the recommended way of doing things. Instead, one generally uses SQL host variables. These are variable parts of a statement which are declared inside the statement and, at runtime, are bound to Python variables. This is generally advisable, for two reasons. First, using the naive approach of manually assembling SQL statements and parameter values makes your program more vulnerable to SQL injection. Second, using host variables allows the driver to prepare the statement only once (i.e. to parse and tokenize the statement and prepare an execution plan) even if you insert multiple rows, thus increasing performance.

To use host variables with SQLite, you would execute a statement in which the values you want to insert are replaced by placeholders, using question marks.

INSERT INTO books
  (author, title)
VALUES
  (?, ?)
;

When you call the execute method of a cursor, you pass this SQL string along with the values for the placeholders, and the database driver will then replace by their actual values, a process called binding.

However, when coding this, there is a subtlety we need to observe. The special characters used for the placeholders are not specified by the DB-API Python standard and might therefore be driver specific. Fortunately, even though the standard does not define the placeholders, it defines a way to obtain them that is the same for all drivers.

Specifically, the standard specifies several placeholder styles which can be queried by accessing a global variable. In our example, we use only two of these styles, called pyformat and question mark. Pyformat, the style used by e.g. MySQL, uses Python-like placeholders like %s, whereas question mark uses – well, you might have guessed that – question marks as SQLite does it. So to keep our code reusable, we first retrieve the placeholder style, translate this into the actual placeholder (a dictionary comes in handy here) and then assemble our statement.

knownMarkers = {'pyformat' : '%s', 'qmark' : '?'}
marker =  knownMarkers[dblib.paramstyle]

examples = [('Dickens', 'David Copperfield'),
            ('Melville', 'Moby Dick')]

sqlString =  '''INSERT INTO books
                       (author, title)
                 VALUES ''' + "(" + marker + "," + marker + ")"
cursor.executemany(sqlString, examples)
c.commit()

Note that we explicitly commit at the end of the statement, as closing the connection would otherwise rollback our changes. We also use the executemany method which performs several insertions, binding host variables from a sequence of value tuples.

Reading from our database is now very simple. Again, we first need to get a connection and a cursor. Once we have that, we assemble a SELECT statement and execute it. We then use the fetch or fetchall method to retrieve the result set and can iterate through the result set as usual.

# Get all rows from table books
statement = "SELECT author, title from books;"
cursor.execute(statement)
rows = cursor.fetchall()

# Iterate through result set
for row in rows:
    print ("Author: " + row[0])
    print ("Title:  " + row[1])

As we can see, the order of the columns in the individual tuples within the result set is as in our SELECT statement. However, I have not found an explicit guarantee for this behaviour in the specification. If we want to make sure that we get the columns right, we can use the cursor.description attribute which the standard mandates and which contains a sequence of tuples, wherein each tuple represents a column in the result and contains a set of attributes like name and type.

We have now seen how we can use the DB-API to create database tables, to insert rows and to retrieve results. I have assembled the code that we have discussed in two scripts, one to prepare a database and one to read from it, which you can find here.

Using Python with MySQL

To illustrate the usage of the DB-API for different databases, let us now try to do the same thing with MySQL. Of course, there is some more setup involved in this case. First, the driver that we will use – the mysql-connector – is not part of the standard Python library and needs to be installed separately. Of course, this is done using

pip3 install mysql-connector-python

We will also need the MySQL command line client mysql and the admin client mysqladmin. On an Ubuntu system, you can install them using

sudo apt-get install  mysql-client

As we intend to run the MySQL server in a docker container, you also need a Docker engine installed on your system. If you do not have that yet, the easiest way might be to install it using snap.

snap install docker

Next, we need to bring up an instance of the MySQL database using docker run, give it some time to come up, create a database called books and grant access rights to a new user that we will use to access our database from within Python. Here is a short script that will perform all these update scripts and that you can also find here.

# Start docker container
docker run -d --name some-mysql \
           --rm \
           -p 3306:3306 \
           -e MYSQL_ROOT_PASSWORD=my-secret-root-pw \
           -e MYSQL_USER=chr \
           -e MYSQL_PASSWORD=my-secret-pw \
            mysql 

# Give the database some time to come up
mysqladmin --user=root --password=my-secret-root-pw \
           --host='127.0.0.1'  \
            --silent status

while [ $? == 1  ]
do
  sleep 5
  mysqladmin --user=root --password=my-secret-root-pw \
             --host='127.0.0.1'  \
             --silent status
done 

# Create database books and grant rights to user chr
mysqladmin --user=root \
           --password=my-secret-root-pw \
            --host='127.0.0.1'  create books
echo 'grant all on books.* to 'chr';' \
              | mysql --user=root \
                      --password=my-secret-root-pw \
                      --host='127.0.0.1' books

BE CAREFUL: if you have read my post on Docker networking, you will know that using the -p switch implies that we can reach the database from every machine in the local network – so if you are not in a fully trusted environment, you definitely want to change this script and the program that will follow to use real passwords, not the ones used in this post.

Let us now see how we can access our newly created database from Python. Thanks to the DB-API, most of the code that we will have to use is actually the same as in the case of SQLite. Basically, there are two differences. First, the module that we have to import is of course different, but we can again use the alias dblib to be able to use the remainder of the code without changes.

import mysql.connector as dblib

The second change that we need to make is the way how we obtain the initial database connection, as connecting to a MySQL database requires additional parameters like credentials. Here is the respective statement

c = dblib.connect(user='chr', password='my-secret-pw',
                              host='127.0.0.1',
                              database='books')

All the remainder of the code can now be taken over from the SQLite case unchanged. On my GitHub page, I have created a separate directory holding the code, and if you compare this code to the code that we used earlier for SQLite, you will see that it is in fact only those two parts of the code that are different.

Of course, there is much more that could be said about database programming with Python. In reality, each database behaves differently, and you will have to deal with specifics like auto-commit, generated fields, different types, performance, result set sizes and so forth. However, the basics are the same thanks to the DB-API, and I hope that I could give you a good starting point for further investigations.

Building a CI/CD pipeline for Kubernetes with Travis and Helm

One of the strengths of Kubernetes is the ability to spin up pods and containers with a defined workload within seconds, which makes it the ideal platform for automated testing and continuous deployment. In this, we will see how GitHub, Kubernetes, Helm and Travis CI play together nicely to establish a fully cloud based CI/CD pipeline for your Kubernetes projects.

Introduction

Traditional CI/CD pipelines require a fully equipped workstation, with tools like Jenkins, build environments, libraries, repositories and so forth installed on them. When you are used to working in a cloud based environment, however, you might be looking for alternatives, allowing you to maintain your projects from everywhere and from virtually every PC with a basic equipment. What are your options to make this happen?

Of course there many possible approaches to realize this. You could, for instance, maintain a separate virtual machine running Jenkins and trigger your builds from there, maybe using Docker containers or Kubernetes as build agents. You could use something like Gitlab CI with build agents on Kubernetes. You could install Jenkins X on your Kubernetes cluster. Or you could turn to Kubernetes native solutions like Argo or Tekton.

All these approaches, however, have in common that they require additional infrastructure, which means additional cost. Therefore I dediced to stick to Travis CI as a CI engine and control my builds from there. As Travis runs builds in a dedicated virtual machine, I can use kind to bring up a cluster for integration testing at no additional cost.

The next thing I wanted to try out is a multi-staged pipeline based on the GitOps approach. Roughly speaking, this approach advocates the use of several repositories, one per stage, which each reflect the actual state of the respective stage using Infrastructure-as-a-code. Thus, you would have one repository for development, one for integration testing and one for production (knowing, of course, that real organisations typically have additional stages). Each repository contains the configuration (like Kubernetes YAML files or other configuration items) for the respective Kubernetes cluster. At every point in time, the cluster state is fully in sync with the state of the repository. Thus, if you want to make changes to a cluster, you would not use kubectl or the API to directly deploy into the cluster and update your repository after the fact, but you would rather change the configuration of the cluster stored in the repository, and have a fully automated process in place which detects this change and updates the cluster.

The tool chain originally devised by the folks at Weaveworks requires access to a Kubernetes cluster, which, as described above, I wanted to avoid for cost reasons. Still, some of the basic ideas of GitOps can be applied with Travis CI as well.

Finally, I needed an example project. Of course, I decided to choose my bitcoin controller for Kubernetes, which is described in a series of earlier posts starting here.

Overall design and workflow

Based on these considerations, I came up with the following high-level design. The entire pipeline is based on three GitHub repositories.

The first repository, bitcoin-controller, represents the DEV stage of the project. It contains the actual source code of the bitcoin controller.
The second repository, bitcoin-controller-helm-qa, represents the QA stage. It does not contain source code, but a Helm chart that describes the state of the QA environment.
Finally, the third repository, bitcoin-controller-helm, represents a release of the production stage and contains the final, tested and released packaged Helm charts

To illustrate the overall pipeline, let us take a look at the image below.

The process starts on the left hand side of the above diagram if a developer pushes a change into the DEV repository. At this point, the Travis CI process will start, spin up a virtual machine, install Go and required libraries and conduct build and unit test. Then, a Docker image is built and pushed into the Docker Hub image repository, using the Github commit as a tag. Finally, the new tag is written into the Helm chart stored in the QA repository so that the Helm chart points to the now latest version of the Docker image.

This change in the bitcoin-controller-helm-qa repository now triggers a second Travis CI pipeline. Once the virtual machine has been brought up by Travis, we install kind, spin up a Kubernetes cluster, install Helm in this cluster, download the current version of the Helm charts and install the bitcoin controller using this Helm chart. As we have previously updated the Docker tag in the Helm chart, this will pull the latest version of the Docker image.

We then run the integration tests against our newly established cluster. If the integration test succeeds, we package our Helm chart and upload them into the bitcoin-controller-helm repository.

However, we do not want to perform this last step for every single commit, but only for releases. To achieve this, we check at this point whether the commit was a tagged commit. If yes, a new package is built using the tag as version number. If not, the process stops at this point and no promote to the bitcoin-controller-helm-qa is executed.

Possible extensions

This simple approach can of course be extended into several directions. First, we could add an additional stage to also test our packaged Helm chart. In this stage, we would fully simulate a possible production environment, i.e. spin up a cluster at AWS, DigitalOcean or whatever your preferred provider is, deploy the packaged Helm chart and run additional tests. You could also easily integrate additional QS steps, like a performance test or static code analysis into this pipeline.

Some organisations like to add manual approval steps before deploying into production. Unfortunately, Travis CI does not seem to offer an easy solution for this. To solve this, one could potentially uses branches instead of tags to flag a certain code version as a release, and only allow specific users to perform a push or or merge into this branch.

Finally, we currently only store the Docker image which we then promote through the stages. This is fine for a simple project using Go, where there are no executables or other artifacts. For other projects, like a typical Java web application, you could use the same approach, but in addition store important artifacts like a WAR file in a separate repository, like Nexus or Artifactory.

Let us now dive into some more implementation details and pitfalls when trying to actually code this solution.

Build and deploy

Our pipeline starts when a developer pushes a code change into the DEV repository bitcoin-controller. At this point, Travis CI will step in and run our pipeline, according to the contents of the respective .travis.yml file. After some initial declarations, the actual processing is done by the stage definitions for the install, script and deploy phase.

install:
  - go get -d -t ./...

script:
  - go build ./cmd/controller/
  - go test -v  ./... -run "Unit" -count=1
  - ./travis/buildImage.sh

deploy:
  skip_cleanup: true
  provider: script
  script:  bash ./travis/deploy.sh
  on:
    all_branches: true

Let us go through this step by step. In the install phase, we run go get to install all required dependencies. Among other things, this will download the Kubernetes libraries that are needed by our project. Once this has been completed, we use the go utility to build and run the unit tests. We then invoke the script buildImage.sh.

The first part of the script is important for what follows – it determines the tag that we will be using for this build. Here are the respective lines from the script.

#
# Get short form of git hash for current commit
#
hash=$(git log --pretty=format:'%h' -n 1)
#
# Determine tag. If the build is from a tag push, use tag name, otherwise
# use commit hash
#
if [ "X$TRAVIS_TAG" == "X" ]; then
  tag=$hash
else
  tag=$TRAVIS_TAG
fi

Let us see how this works. We first use git log with the pretty format option to get the short form of the hash of the current commit (this works, as Travis CI will have checked out the code from Github and will have taken us to the root directory of the repository). We then check the environment variable TRAVIS_TAG which is set by Travis CI if the build trigger originates from pushing a tag to the server. If this variable is empty, we use the commit hash as our tag, and treat the build as an ordinary build (we will see later that this build will not make it into the final stage, but will only go through unit and integration testing). If the variable is not set, then we use the name of the tag itself.

The rest of the script is straighforward. We run a docker build using our tag to create an image locally, i.e. within the Docker instance of the Travis CI virtual machine used for the build. We also tag this image as latest to make sure that the latest tag does actually point to the latest version. Finally, we write the tag into a file for later use.

Now we move into the deploy stage. Here, we use the option skip_cleanup to prevent Travis from cleanup up our working directory. We then invoke another script deploy.sh. Here, we read the tag again from the temporary file that we have created during the build stage and push the image to the Docker Hub, using this tag once more.

#
# Login to Docker hub
#

echo "$DOCKER_PASSWORD" | docker login --username $DOCKER_USER --password-stdin

#
# Get tag
#
tag=$(cat $TRAVIS_BUILD_DIR/tag)

#
# Push images
#
docker push christianb93/bitcoin-controller:$tag
docker push christianb93/bitcoin-controller:latest

At this point, it is helpful to remember the use of image tags in Helm as discussed in one of my previous posts. Helm advocates the separation of charts (holding deployment information and dependencies) from configuration by moving the configuration into separate files (values.yaml) which are then merged back into the chart at runtime using placeholders. Applying this principle to image tags implies that we keep the image tag in a values.yaml file. To prepare for integration testing where we will use the Helm chart to deploy, we will now have to replace the tag name in this file by the current tag. So we need to check out our Helm chart using git clone and use our beloved sed to replace the image tag in the values file by its current value.

But this is not the only change that we want to make to our Helm chart. Remember that a Helm chart also contains versioning information – a chart version and an application version. However, at this point, we cannot simply use our tag anymore, as Helm requires that these version numbers follow the SemVer semantic versioning rules. So at this point, we need to define rules how we compose our version number.

We do this as follows. Every release receives a version number like 1.2, where the first digit is the major release and the second digit is the minor release. In GitHub, releases are created by pushing a tag, and the tag name is used as version number (and thus has to follow this convention). Development releases are marked by appending a hyphen followed by dev and the commit hash to the current version. So if the latest version is 0.9 and we create a new build with the commit hash 64ed033, the new version number will be 0.9-dev64ed033.

So we update the values file and the Helm chart itself with the new image tag and the new version numbers. We then push the change back into the Helm repository. This will trigger a second Travis CI pipeline and the integration testing starts.

Integration testing

When the Travis CI pipeline for the repository bitcoin-helm-qa has reached the install stage, the first thing that is being done is to download the script setupVMAndCluster.sh which is located in the code repository and to run it. This script is responsible for executing the following steps.

Download and install Helm (from the snap)
Download and install kubectl
Install kind
Use kind to create a test cluster inside the virtual machine that Travis CI has created for us
Init Helm and wait for the Tiller container to be ready
Get the source code from the code repository
Install all required Go libraries to be ready for the integration test

Most of these steps are straightforward, but there are a few details which are worth being mentioned. First, this setup requires a significant data volume to be downloaded – the kind binary, the container images required by kind, Helm and so forth. To avoid that this slows down the build, we use the caching feature provided by Travis CI which allows us to cache the content of an arbitrary directory. If, for instance, we find that the kind node image is in the cache, we skip the download and instead use docker load to pre-load the image into the local Docker instance.

The second point to observe is that for integration testing, we need the code for the test cases which is located in the code repository, not in the repository for which the pipeline has been triggered. Thus we need to manually clone into the code repository. However, we want to make sure that we get the version of the test cases that matches the version of the Helm chart (which could, for instance, be an issue if someone changes the code repository while a build is in progress). Thus we need to figure out the git commit hash of the code version under test an run git checkout to use that version. Fortunately, we have put the commit hash as application version into the Helm chart while running the build and deploy pipeline, so we can use grep and awk to extract and use the commit hash.

tag=$(cat Chart.yaml | grep "appVersion:" | awk {' print $2 '})
cd $GOPATH/src/github.com/christianb93
git clone https://github.com/christianb93/bitcoin-controller
cd bitcoin-controller
git checkout $tag

Once this script has completed, we have a fully functional Kubernetes cluster with Helm and Tiller installed running in our VM. We can now use the Helm chart to install the latest version of the bitcoin controller and run our tests. Once the tests complete, we perform a cleanup and run an additional script (promote.sh) to enter the final stage of the build process.

This script updates the repository bitcoin-controller-helm that represents the Helm repository with the fully tested and released versions of the bitcoin controller. We first examine the tag to figure out whether this is a stable version, i.e. a tagged release. If this is not the case, the script completes without any further action.

If the commit is a release, we rename the directory in which our Helm chart is located (as Helm assumes that the name of the Helm chart and the name of the directory coincide) and update the chart name in the Chart.yaml file. We then remove a few unneeded files and use the Helm client to package the chart.

Next we clone into the bitcoin-controller-helm repository, place our newly packaged chart there and update the index. Finally, we push the changes back into the repository – and we are done.