Automating provisioning with Ansible – building cloud environments

When you browse the module list of Ansible, you will find that the by far largest section is the list of cloud modules, i.e. modules to interact with a cloud environments to define, provision and maintain objects like virtual machines, networks, firewalls or storage. These modules make it easy to access the APIs of virtually every existing cloud provider. In this post, we will look at an example that demonstrates the usage of these modules for two cloud platforms – DigitalOcean and AWS.

Using Ansible with DigitalOcean

Ansible comes with several modules that are able to connect to the DigitalOcean API to provision machines, manage keys, load balancers, snapshots and so forth. Here, we will need two modules – the module digital_ocean_droplet to deal with individual droplets, and digital_ocean_sshkey_facts to manage SSH keys.

Before getting our hands dirty, we need to talk about credentials first. There are three types of credentials involved in what follows.

  • To connect to the DigitalOcean API, we need to identify ourselves with a token. Such a token can be created on the DigitalOcean cloud console and needs to be stored safely. We do not want to put this into a file, but instead assume that you have set up an environment variable DO_TOKEN containing its value and access that variable.
  • The DigitalOcean SSH key. This is the key that we create once and upload (its public part, of course) to the DigitalOcean website. When we provision our machines later, we pass the name of the key as a parameter and DigitalOcean will automatically add this public key to the authorized_keys file of the root user so that we can access the machine via SSH. We will also use this key to allow Ansible to connect to our machines.
  • The user SSH key. As in previous examples, we will, in addition, create a default user on each machine. We need to create a key pair and also distribute the public key to all machines.

Let us first create these keys, starting with the DigitalOcean SSH key. We will use the key file name do-default-key and store the key in ~/.ssh on the control machine. So run

ssh-keygen -b 2048 -t rsa -P "" -f ~/.ssh/do-default-key
cat ~/.ssh/

to create the key and print out the public key.

Then navigate to the DigitalOcean security console, hit “Add SSH key”, enter the public key, enter the name “do-default-key” and save the key.

Next, we create the user key. Again, we use ssh-keygen as above, but this time use a different file name. I use ~/.ssh/default-user-key. There is no need to upload this key to DigitalOcean, but we will use it later in our playbook for the default user that we create on each machine.

Finally, make sure that you have a valid DigitalOcean API token (if not, go to this page to get one) and put the value of this key into an environment variable DO_TOKEN.

We are now ready to write the playbook to create droplets. Of course, we place this into a role to enable reuse. I have called this role droplet and you can find its code here.

I will not go through the entire code (which is documented), but just mention some of the more interesting points. First, when we want to use the DigitalOcean module to create a droplet, we have to pass the SSH key ID. This is a number, which is different from the key name that we specified while doing the upload. Therefore, we first have to retrieve the key ID. For that purpose, we retrieve a list of all keys by calling the module digital_ocean_sshkey_facts which will write its results into the ansible_facts dictionary.

- name: Get available SSH keys
    oauth_token: "{{ apiToken }}"

We then need to navigate this structure to get the ID of the key we want to use. We do this by preparing a JSON query, which we put together in several steps to avoid issues with quoting. First, we apply the query string [?name=='{{ doKeyName }}'], where doKeyName is the variable that holds the name of the DigitalOcean SSH key, to navigate to the entry in the key list representing this key. The result will still be a list from which we extract the first (and typically only) element and finally access its attribute id holding the key ID.

- name: Prepare JSON query string
    jsonQueryString: "[?name=='{{ doKeyName }}']"
- name: Apply query String to extract matching key data
      keyData: "{{ ansible_facts['ssh_keys']|json_query(jsonQueryString) }}"
- name: Get keyId
      keyId: "{{ keyData[0]['id'] }}"

Once we have that, we can use the DigitalOcean Droplet module to actually bring up the droplet. This is rather straightforward, using the following code snippet.

- name: Bring up or stop machines
    oauth_token: "{{ apiToken }}"
    image: "{{osImage}}"
    region_id: "{{regionID}}"
    size_id: "{{sizeID}}"
    ssh_keys: [ "{{ keyId }}" ]
    state: "{{ targetState }}"
    unique_name: yes
    name: "droplet{{ item }}"
    wait: yes
    wait_timeout: 240
    - "createdByAnsible"
  loop:  "{{ range(0, machineCount|int )|list }}"
  register: machineInfo

Here, we go through a loop, with the number of iterations being governed by the value of the variable machineCount, and, in each iteration, bring up one machine. The name of this machine is a fixed part plus the loop variable, i.e. droplet0, droplet1 and so forth. Note the usage of the parameter unique_name. This parameter instructs the module to check that the supplied name is unique. So if there is already a machine with that name, we skip its creation – this is what makes the task idempotent!

Once the machine is up, we finally loop over all provisioned machines once more (using the variable machineInfo) and complete two additional step for each machine. First, we use add_host to add the host to the inventory – this makes it possible to use them later in the playbook. For convenience, we also add an entry for each host to the local SSH configuration, which allows us to log into the machine using a shorthand like ssh droplet0.

Once the role droplet has completed, our main playbook will start a second and a third play. These plays now use the inventory that we have just built dynamically. The first of them is quite simple and just waits until all machines are reachable. This is necessary, as DigitalOcean reports a machine as up and running at a point in time when the SSH daemon might not yet accept conditions. The next and last play simply executes two roles. The first role is the role that we have already put together in an earlier post which simply adds a standard non-root user, and the second role simply uses apt and pip to install a few basic packages.

A few words on security are in order. Here, we use the DigitalOcean standard setup, which gives you a machine with a public IP address directly connected to the Internet. No firewall is in place, and all ports on the machine are reachable from everywhere. This is obviously not a setup that I would recommend for anything else than a playground. We could start a local firewall on the machine by using the ufw module for Ansible as a starting point, and then you would probably continue with some basic hardening measures before installing anything else (which, of course, could be automated using Ansible as well).

Using Ansible with AWS EC2

Let us now see what we need to change when trying the same thing with AWS EC2 instead of Digital Ocean. First, of course, the authorization changes a bit.

  • The Ansible EC2 module uses the Python Boto library behind the scenes which is able to reuse the credentials of an existing AWS CLI configuration. So if you follow my setup instructions in an earlier post on using Python with AWS, no further setup is required
  • Again, there is an SSH key that AWS will install on each machine. To create a key, navigate to the AWS EC2 console, select “Key pairs” under “Network and Security” in the navigation bar on the left, create a key called ec2-default-key, copy the resulting PEM file to ~/.ssh/ and set the access rights 0700.
  • The user SSH key – we can use the same key as before

As the EC2 module uses the Boto3 Python library, you might have to install it first using pip or pip3.

Let us now go through the steps that we have to complete to spin up our servers and see how we realize idempotency. First, to create a server on EC2, we need to specify an AMI ID. AMI-IDs change with every new version of an AMI, and typically, you want to install the latest version. Thus we first have to figure out the latest AMI ID. Here is a snippet that will do this.

- name: Locate latest AMI
    region: "eu-central-1"
    owners: 099720109477
      name: "ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-????????"
      state: "available"
  register: allAMIs
- name: Sort by creation date to find the latest AMI
    latestAMI: "{{ allAMIs.images | sort(attribute='creation_date') | last }}"

Here, we first use the module ec2_ami_facts to get a list of all AMIs matching a certain expression – in this case, we look for available AMIs for Ubuntu 18.04 for EBS-backed machines with the owner 099720109477 (Ubuntu). We then use a Jinja2 template to sort by creation date and pick the latest entry. The result will be a dictionary, which, among other things, contains the AMI ID as image_id.

Now, as in the DigitalOcean example, we can start the actual provisioning step using the ec2 module. Here is the respective task.

- name: Provision machine
    exact_count: "{{ machineCount }}"
    count_tag: "managedByAnsible"
    image: "{{ latestAMI.image_id }}"
      managedByAnsible: true
    instance_type: "{{instanceType}}"
    key_name: "{{ ec2KeyName }}"
    region: "{{ ec2Region }}"
    wait: yes
  register: ec2Result

Idempotency is realized differently on EC2. Instead of using the server name (which Amazon will assign randomly), the module uses tags. When a machine is brought up, the tags specified in the attribute instance_tags are attached to the new instance. The attribute exact_count instructs the module to make sure that exactly this number of instances with this tag is running. Thus if you run the playbook twice, the module will first query all machines with the specified tag, figure out that there is already an instance running and will not create any additional machines.

This simple setup will provision all new machines into the default VPC and use the default security group. This has the disadvantage that in order for Ansible to be able to reach these machines, you will have to open port 22 for incoming connections in the attached security group. The playbook does not do this, but expects that you do this manually using for instance the EC2 console. Again, this is of course not a production-like.

Running the examples

I have put the full playbooks with all the required roles into my GitHub repository. To run the examples, carry out the following steps.

First, prepare the SSH keys and API tokens for EC2 and DigitalOcean as explained in the respective sections. Then, use the EC2 console to allow incoming traffic for port 22 from your local workstation in the default VPC, into which our playbook will launch the machines created on EC2. Next, run the following commands to clone the repository and execute the playbooks (if you have not used the SSH key names and locations as above, you will first have to edit the default variables in the roles ec2Instance and droplet accordingly).

git clone
cd partVII
# Turn off strict host key checking
# Run the EC2 example
ansible-playbook awsSite.yaml
# Run the DigitalOcean example
ansible-playbook doSite.yaml

Once the playbooks complete, you should now be able to ssh into the machine created on DigitalOcean using ssh droplet0 and into the machine created on EC2 using ssh aws0. Do not forget to delete all machines again when you are done to avoid unnecessary cost!

Limitations of our approach

The approach we have followed so far to manage our cloud environments is simple, but has a few drawbacks and limitations. First, the provisioning is done on the localhost in a loop, and therefore sequentially. Even if the provisioning of one machine only takes one or two minutes, this implies that the approach does not scale, as bringing up a large number of machines would take hours. One approach to fix that that I have seen on Stackexchange (unfortunately I cannot find the post any more, so I cannot give proper credit) is to use a static inventory file with a dummy group containing only the host names in combination with connection: local to execute the provisioning locally, but in parallel.

The second problem is more severe. Suppose, for instance, you have already created a droplet and now decide that you want 2GB of memory instead. So you might be tempted to go back, edit (or override) the variable sizeID and re-run the playbook. This will not give you any error message, but in fact, it will not do anything. The reason is that the module will only use the host name to verify that the target state has been reached, not any other attributes of the droplet (take a look at the module source code here to verify this). Thus, we have a successful playbook execution, but a deviation between the target state coded into the playbook and the real state.

Of course this could be fixed by a more sophisticated check – either by extending the module, or by coding this check into the playbook. Alternatively, we could try to leverage an existing solution that has this logic already implemented. For AWS, for instance, you could try to define your target state using CloudFormation templates and then leverage the CloudFormation Ansible module to apply this from within Ansible. Or, if you want to be less provider specific, you might want to give Terraform a try. This is what I decided to do, and we will dive into more details how to integrate Terraform and Ansible in the next post.

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s