The nuts and bolts of VertexAI – pipeline metadata

When you assemble and run a custom job, the Vertex AI platform is not aware of the inputs that you consume and the outputs that you create in this job, and consequently, it is up to you to create execution and artifact metadata and the corresponding relations between executions, input artifacts and output artifacts. For pipelines, the situation is different, as you explicitly declare input and output artifact of each component that is executed as part of a pipeline. Thus you could expect that the platform takes over the task to reflect this in the metadata, and in fact it does. Today, we take a closer look at the metadata that a pipeline run produces.

Pipeline jobs, executions and artifacts

As a starting point for what follows, let us first conduct a little experiment. Starting from the root directory of the cloned repository run the following commands to clean up existing metadata (be careful, this will really delete all metadata in your Vertex AI project) and submit a pipeline.

cd pipelines
python3 ../metadata/list_metadata.py --delete
python3 submit_pipeline.py

Then wait a few minutes until the pipeline run is complete and display the metadata that this pipeline run has created.

python3 ../metadata/list_metadata.py --verbose

We can see that Vertex AI will create a couple of metadata objects automatically. First, we see artifacts representing the input and outputs of our components, like the model, the test and validation data and the metrics that we log. Then we see one execution of type system.ContainerExecution for each component that is being executed, and the artifacts are correctly linked to these executions.

There is also an execution of type system.Run that apparently represents the pipeline run. Its metadata contains information like the project and the pipeline run ID, but also the parameters that we have used when submitting the pipeline.

Finally, Vertex AI will create two nested contexts for us. The first context is of type system.PipelineRun and represents the run. All executions and artifacts for this run are living inside this context. In addition, there is a context which represents the pipeline (type system.Pipeline) and the pipeline run is a child of this context. When we submit another run of the pipeline, then a new run context will be created and will become a child of this pipeline context as well. Here is a diagram summarizing the structure that we can see so far.

Experiment runs and pipelines

Let us now see how this changes if we specify an experiment when submitting our pipeline (the submit method of the PipelineJob has a corresponding parameter). In this case, the SDK will locate the pipeline run context and associate it with the experiment, i.e. it will add it as child to the metadata context that represents the experiment. Thus our pipeline run context has two parent contexts – the context representing the experiment and the context representing the pipeline. Let us try this.

python3 ../metadata/list_metadata.py --delete
python3 submit_pipeline.py --experiment=my-experiment
#
# Wait until completion
#
python3 ../metadata/list_metadata.py --verbose

This should give an output reflecting the following diagram.

If you now open the Vertex AI console and navigate to the “Experiments” tab, you will see our newly created experiment and inside this experiment, there is a new run (the name of this run is a combination of the pipeline name and a timestamp). You can also see that the metadata that is attached to the pipeline execution, in particular the input parameters, show up as parameters, and that all metrics that we log using log_metric on an artifact of type Metric will automatically be displayed in the corresponding tab for the pipeline run. So similar to what we have seen for custom jobs, you again have a central place from which you can access parameters, metrics and even the artifacts created during this pipeline run. However, this time our code inside the pipeline components does not have to use any reference to the Google cloud framework, we only interact with the KFP SDK, which makes local execution and unit testing a lot easier.

Using tensorboards with Vertex AI pipelines

We have seen that logging individual metrics from within a pipeline is very easy – just add an artifact of type Metrics to your component and call log_metrics on that, and Vertex AI will make sure that the data appears on the console. However, for time series, this is more complicated, as Vertex AI pipeline are apparently not yet fully integrated with the Vertex AI tensorboards.

What options do we have if we want to log time series data from within a component? One approach could be to simply use start_run(..., resume = True) to attach to the pipeline run and then log time series data as we have done it from within a custom job. Unfortunately, that does not work, as start_run assumes that the run you are referring to is of type system.ExperimentRun but our run is of type system.PipelineRun.

You could of course create a new experiment run and use that experiment run to log time series data. This works, but has the disadvantage that now every run of the pipeline will create two experiment runs on the console, which is at least confusing. Let us therefore briefly discuss how to use the tensorboard API directly to log data.

The tensorboard API is built around three main classes – a Tensorboard , a TensorboardExperiment and a TensorboardRun. A Tensorboard is simply that – an instance of a managed tensorboard on Vertex AI. Usually there is no need to create a tensorboard instance manually, as Vertex AI will make sure that there is a backing tensorboard if you create an experiment in the init function (the default is to use the same tensorboard as backing tensorboard for all of your experiments). You can access this tensorboard via the backing_tensorboard_resource_name attribute of an Experiment.

Once you have access to the tensorboard instance, the next step is to create a tensorboard experiment. I am not sure how exactly Google has implemented this behind the scenes, but I tend to think of a tensorboard experiment as a logging directory in which all event files will be stored. If you make sure that the name of a tensorboard experiment matches the name of an existing Vertex AI experiment, then a link to the tensorboard will be displayed next to the experiment in the Vertex AI console. In addition, I found that there is a special label vertex_tensorboard_experiment_source that you will have to add with value vertex_experiment to avoid that the tensorboard experiment is displayed as a separate line in the list of experiments.

Next, you will need to create a tensorboard run. This is very similar to an experiment run and – at least in a local tensorboard installation – corresponds to a directory in the logging dir where event files are stored.

If you actually want to log time series data, you will do so by passing a label, for instance “loss”, a value and a step. Label and value are passed as a dictionary, the step is an integer, so that a call looks like this.

#
# Create a tensorboard run
#
tb_run = ...
#
# Log time series data to it
# 
tb_run.write_tensorboard_scalar_data({
                            "loss" : 0.05
                        }, step = step)

All values with the same label – in our case “loss” – form a time series. However, before you can log data to a time series in this way, you will actually have to create a time series within the tensorboard run. This is a bit more complicated than it sounds as creating a time series that already exists will fail, so you need to check upfront whether the time series exists. To reduce the number of API calls needed, you might want to keep track of which time series have already been created locally.

In order to simplify the entire process and in order to allow for easier testing, I have put all of this into a utility class defined here. This class can be initialized once with the name of the experiment you want to use, a run name and your project ID and location and then creates the required hierarchy of objects behind the scenes. Note, however, that logging to a tensorboard is slow (I assume that the event file is stored on GCS so that appending a record is an expensive operation), so be careful not to log too many data points. In addition, even though our approach works well and gives you a convenient link to your time series in the Vertex AI console, the data you are logging in this way is not visible in the embedded view which is part of the Vertex AI console, but only in the actual tensorboard instance – I have not yet figured out how to make the data appear in the integrated view as well.

The standard pipeline that we use for our tests already uses tensorboard logging in this way if you submit it with an experiment name as above. Here is how the tensorboard instance will look like once the training step (which logs the training loss) is complete.

Some closing remarks

Pipeline metadata is a valuable tool, but does not mean that you do not have to implement additional mechanisms to allow for full traceability. If, for instance, you train a model on a dataset that you download from some location in your pipeline and then package and upload a new model version, the pipeline metadata will help you to reconstruct the pipeline run, but the entry in the model registry has no obvious link to the pipeline run (unless maybe the URI which will typically be some location inside the pipeline root), you will still need to track the version of your Python code for defining the model and the training script separately, and you can also not rely on the pipeline metadata alone to document the version of data you use that originates from outside of Vertex AI.

I tend to think of pipeline metadata as a tool which is great for training – you can conduct experiment runs, attach metrics and evaluation results to runs, compare results across runs and so forth – but as soon as you deploy to production, you will need additional documentation.

You might for instance want to create a model card that you add to your model archive. This model card can be assembled as a markdown or HTML artifact inside your pipeline declared as component output via Output[Markdown](VertexAI will even display the model card artifact for you).

Of course this is just a toy example, and in general you might want to use a toolkit like the Google Model Card Toolkit to assemble your model card, typically using metadata and metrics that you have collected during your pipeline run. You can then distribute the model card to a larger group without depending on access to Vertex AI metadata and archive it, maybe even in your version control system.

Vertex AI metadata is also comparatively expensive. At the time of writing, Google charges 10 USD per GB and month. Even standard storage at european locations is around 2 cents per GB and month, and archive storage is even less expensive. So from time to time, you want to clean up your metadata store and archive the data. As an example of how this could work, I have provided a script cleanup.py in the pipelines directory of my repository. This script removes all metadata (including experiments) as well as pipeline runs and custom job executions older than a certain number of days. In addition, you can chose to archive the artifact lineage into a file. Here is an example which will remove all data older than 5 days and write the lineage information into archive.dat.

python3 cleanup.py --retention_days=5 --archive=archive.dat

This closes our blog post for today. In the next post, we will turn our attention away from pipelines to networking and learn how you can connect jobs running on Vertex AI to your own VPCs and vice versa.

1 Comment

Leave a Comment