Understanding cloud-init

For a recent project using Ansible to define and run KVM virtual machines, I had to debug an issue with cloud-init. This was a trigger for me to do a bit of research on how cloud-init operates, and I decided to share my findings in this post. Note that this post is not an instruction on how to use cloud-init or how to prepare configuration data, but an introduction into the structure and inner workings of cloud-init which should put you in a position to understand most of the available official documentation.

Overview

Before getting into the details of how cloud-init operates, let us try to understand the problem it is designed to solve. Suppose you are running virtual machines in a cloud environment, booting off a standardized image. In most cases, you will want to apply some instance-specific configuration to your machine, which could be creating users, adding SSH keys, installing packages and so forth. You could of course try to bake this configuration into the image, but this easily leads to a large number of different images you need to maintain.

A more efficient approach would be to write a little script which runs at boot time and pulls instance-specific configuration data from an external source, for instance a little web server or a repository. This data could then contain things like SSH keys or lists of packages to install or even arbitrary scripts that get executed. Most cloud environments have a built-in mechanism to make such data available, either by running a HTTP GET request against a well-known IP or by mapping data as a drive (to dig deeper, you might want to take a look at my recent post on meta-data in OpenStack as an example of how this works). So our script would need to figure out in which cloud environment it is running, get the data from the specific source provided by that environment and then trigger processing based on the retrieved data.

This is exactly what cloud-init is doing. Roughly speaking, cloud-init consists of the following components.

First, there are data sources that contain instance-specific configuration data, like SSH keys, users to be created, scripts to be executed and so forth. Cloud-init comes with data sources for all major cloud platforms, and a special “no-cloud” data source for standalone environments. Typically, data sources provide four different types of data, which are handled as YAML structures, i.e. essentially nested dictionaries, by cloud-init.

User data which is defined by the user of the platform, i.e. the person creating a virtual machine
Vendor data which is provided by the organization running the cloud platform. Actually, cloud init will merge user data and vendor data before running any modules, so that user data can overwrite vendor data
Network configuration which is applied at an early stage to set up networking
Meta data which is specific to an instance, like a unique ID of the instance (used by cloud-init to figure out whether it runs the first time after a machine has been created) or a hostname

Cloud-init is able to process different types of data in different formats, like YAML formatted data, shell scripts or even Jinja2 templates. It is also possible to mix data by providing a MIME multipart message to cloud-init.

This data is then processed by several functional blocks in cloud-init. First, cloud-init has some built-in functionality to set up networking, i.e. assigning IP addresses and bringing up interfaces. This runs very early, so that additional data can be pulled in over the network. Then, there are handlers which are code snippets (either built into cloud-init or provided by a user) which run when a certain format of configuration data is detected – more on this below. And finally, there are modules which provide most of the cloud-init functionality and obtain the result of reading from the data sources as input. Examples for cloud-init modules are

Bootcmd to run a command during the boot process
Growpart which will resize partitions to fit the actual available disk space
SSH to configure host keys and authorized keys
Users and groups to create or modify users and groups
Phone home to make a HTTP POST request to a defined URL, which can for instance be used to inform a config management system that the machine is up

These modules are not executed sequentially, but in four stages. These stages are linked into the boot process at different points, so that a module that, for instance, requires fully mounted file systems and networking can run at a later point than a module that performs an action very early in the boot process. The modules that will be invoked and the stages at which they are invoked are defined in a static configuration file /etc/cloud/cloud.cfg within the virtual machine image.

Having discussed the high-level structure of cloud-init, let us now dig into each of these components in a bit more detail.

The startup procedure

The first thing we have to understand is how cloud-init is integrated into the system startup procedure. Of course this is done using systemd, however there is a little twist. As cloud-init needs to be able to enable and disable itself at runtime, it does not use a static systemd unit file, but a generator, i.e. a little script which runs at an early boot stage and creates systemd units and targets. This script (which you can find here as a Jinja2 template which is evaluated when cloud-init is installed) goes through a couple of checks to see whether cloud-init should run and whether there are any valid data sources. If it finds that cloud-init should be executed, it adds a symlink to make the cloud-init target a precondition for the multi-user target.

When this has happened, the actual cloud-init execution takes place in several stages. Each of these stages corresponds to a systemd unit, and each invokes the same executable (/usr/bin/cloud-init), but with different arguments.

Unit	Invocation
cloud-init-local	/usr/bin/cloud-init init –local
cloud-init	/usr/bin/cloud-init init
cloud-config	/usr/bin/cloud-init modules –mode=config
cloud-final	/usr/bin/cloud-init modules –mode=final

The purpose of the first stage is to evaluate local data sources and to prepare the networking, so that modules running in later stages can assume a working networking configuration. The second, third and fourth stage then run specific modules, as configured in /etc/cloud/cloud.cfg at several points in the startup process (see this link for a description of the
various networking targets used by systemd).

The cloud-init executable which is invoked here is in fact a Python script, which runs the entry point “cloud-init” which resolves (via the usual mechanism via setup.py) to this function. From here, we then branch into several other functions (main_init for the first and second stage, main_modules for the third and fourth stage) which do the actual work.

The exact processing steps during each stage depend a bit on the configuration and data source as well as on previous executions due to caching. The following diagram shows a typical execution using the no-cloud data source in four stages and indicates the respective processing steps.

For other data sources, the processing can be different. On EC2, for instance, the EC2 datasource will itself set up a minimal network configuration to be able to retrieve metadata from the EC2 metadata service.

Data sources

One of the steps that takes place in the functions discussed above is to locate a suitable data source that cloud-init uses to obtain user data, meta data, and potentially vendor data and networking configuration data. A data source is an object providing this data, for instance the EC2 metadata service when running on AWS. Data sources have dependencies on either a file system or a network or both, and these dependencies determine at which stages a data source is available. All existing data sources are kept in the directory cloudinit/sources, and the function find_sources in this package is used to identify all data sources that can be used at a certain stage. These data sources are then probed by invoking their update_metadata method. Note that data sources typically maintain a cache to avoid having to re-read the data several times (the caches are in /run/cloud-init and in /var/lib/cloud/instance/).

The way how the actual data retrieval is done (encoded in the data-source specific method _get_data) is of course depending on the data source. The “NoCloud” data source for instance parses DMI data, the kernel command line, seed directories and information from file systems with a specific label, while the EC2 data sources makes a HTTP GET request to 169.254.169.254.

Network setup

One of the key activities that takes place in the first cloud-init stage is to bring up the network via the method apply_network_config of the Init class. Here, a network configuration can come from various sources, including – depending on the type of data source – an identified data source. Other possible sources are the kernel command line, the initram file system, or a system wide configuration. If no network configuration could be found, a distribution specific fallback configuration is applied which, for most distributions, is define here. An example for such a fallback configuration on an Ubuntu guest looks as follows.

{
  'ethernets': {
     'ens3': {
       'dhcp4': True, 
       'set-name': 'ens3', 
       'match': {
         'macaddress': '52:54:00:28:34:8f'
       }
     }
   }, 
  'version': 2
}

Once a network configuration has been determined, a distribution specific method apply_network_config is invoked to actually apply the configuration. On Ubuntu, for instance, the network configuration is translated into a netplan configuration and stored at /etc/netplan/50-cloud-init.yaml. Note that this only happens if either we are in the first boot cycle for an instance or the data source has changed. This implies, for instance, that if you attach a disk to a different instance with a different MAC address (which is persisted in the netplan configuration), the network setup in the new instance might fail because the instance ID is cached on disk and not refreshed, so that the netplan configuration is not recreated and the stale MAC address is not updated. This is only one example for the subtleties that can be caused by cached instance-specific data – thus if you ever clone a disk, make sure to use a tool like virt-sysprep to remove this sort of data.

Handlers and modules

User data and meta data for cloud-init can be in several different formats, which can actually be mixed in one file or data source. User data can, for instance, be in cloud-config format (starting with #cloud-config), or a shell script which will simply be executed at startup (starting with #!) or even a Jinja2 template. Some of these formats require pre-processing, and doing this is the job of a handler.

To understand handlers, it is useful to take a closer look at how the user data is actually retrieved from a data source and processed. We have mentioned above that in the first cloud-init stage, different data sources are probed by invoking their update_metadata method. When this happens for the first time, a data source will actually pull the data and cache it.

In the second processing state, the method consume_data of the Init object will be invoked. This method will retrieve user data and vendor data from the data source and invoke all handlers. Each handler is called several times – once initially, once for every part of the user data and once at the end. A typical example is the handler for the cloud-config format, which (in its handle_part method) will, when called first, reset its state, then, during the intermediate calls, merge the received data into one big configuration and, in the final call, write this resulting configuration in a file for later use.

Once all handlers are executed, it is time to run all modules. As already mentioned above, modules are executed in one of the defined stages. However, modules are also executed with a specified frequency. This mechanism can be used to make sure that a module executes only once, or until the instance ID changes, or during every boot process. To keep track of the module execution, cloud-init stores special files called semaphore files in the directory /var/lib/cloud/instance/sem and (for modules that should execute independently of an instance) in /var/lib/cloud. When cloud-init runs, it retrieves the instance-ID from the metadata and creates or updates a link from /var/lib/cloud/instance to a directory specific for this instance, to be able to track execution per instance even if a disk is re-mounted to a different instance.

Technically, a module is a Python module in the cloudinit.config package. Running a module simply means that cloud-init is going to invoke the function handle in the corresponding module. A nice example is the scripts_user module. Recall that the script handler discussed above will extract scripts (parts starting with #!) from the user data and place them as executable files in directory on the file system. The scripts_user module will simply go through this directory and run all scripts located there.

Other modules are more complex, and typically perform an action depending on a certain set of keys in the configuration merged from user data and vendor data. The SSH module, for instance, looks for keys like ssh_deletekeys (which, if present, causes the deletion of existing host keys), ssh_keys to define keys which will be used as host keys and ssh_authorized_keys which contains the keys to be used as authorized keys for the default user. In addition, if the meta data contains a key public-keys containing a list of SSH keys, these keys will be set up for the default user as well – this is the mechanism that AWS EC2 uses to pull the SSH keys defined during machine creation into the instance.

Debugging cloud-init

When you take a short look at the source code of cloud-init, you will probably be surprised by the complexity to which this actually quite powerful toolset has grown. As a downside, the behaviour can sometimes be unexpected, and you need to find ways to debug cloud-init.

If cloud-init fails completely, the first thing you need to do is to find an alternative way to log into your machine, either using a graphical console or an out-of-band mechanism, depending on what your cloud platform offers (or you might want to use a local testbed based on e.g. KVM, using an ISO image to provide user data and meta data – you might want to take a look at the Ansible scripts that I use for that purpose).

Once you have access to the machine, the first thing is to use systemctl status to see whether the various services (see above) behind cloud-init have been executed, and journalctl –unit=… to get their output. Next, you can take a look at the log file and state files written by cloud-init. Here are a few files that you might want to check.

/var/log/cloud-init.log – the main log file
/var/log/cloud-init-output.log – contains additional output, like the network configuration or the output of ssh-keygen
The various state files in /var/lib/cloud/instance and /run/cloud-init/ which contain semaphores, downloaded user data and vendor data and instance metadata.

Another very useful option is the debug module. This is a module (not enabled by default) which will simply print out the merged configuration, i.e. meta data, user data and vendor data. As other modules, this module is configured by supplying a YAML structure. To try this out, simply add the following lines to /etc/cloud/cloud.cfg.

debug:
   output: /var/log/cloud-init-debug.log
   verbose: true

This will instruct the module to print verbose output into the file /var/log/cloud-init-debug.log. As the module is not enabled by default, we either have to add it to the module lists in /etc/cloud/cloud.cfg and reboot or – much easier – use the single switch of the cloud-init executable to run only this module.

cloud-init single \
  --name debug \
  --frequency always

When you now look at the contents of the newly created file /var/log/cloud-init-debug.log, you will see the configuration that the cloud-init modules have actually received, after all merges are complete. With the same single switch, you can also re-run other modules (as in the example above, you might need to overwrite the frequency to enforce the execution of the module).

And of course, if everything else fails on you – cloud-init is written in Python, so the source code is there – typically in /usr/lib/python3/dist-packages. So you can modify the source code and add debugging statements as needed, and then use cloud-init single to re-run specific modules. Happy hacking!

1 Comment

Armin Doerzbach says:

March 13, 2023 at 5:52 pm

I like your post. I just wanted to mention one detail:
The startprocess starts with /usr/lib/systemd/system-generators/cloud-init-generator which enables/disables all the services involved in cloud-init and also checks for valid datasources by calling /usr/lib/cloud-init/ds-identify which does some “mysterious” checks on files in /sys/ and other things to find out which cloud the vm is running on and then populates /run/cloud-init/cloud.cfg with these information.

LikeLike

Overview

The startup procedure

Data sources

Network setup

Handlers and modules

Debugging cloud-init

Share this:

Related

Published by christianb93

1 Comment

Leave a Comment Cancel reply