The nuts and bolts of VertexAI – metadata, logging and metrics

One of the core functionalities of every machine learning platform is traceability – we want to be able to track artifacts like models, training jobs and input data and tie all this together so that given a model version, we can go back all the way to the training data that we have used to train that version of the model. On VertexAI, this is handled via metadata which we will discuss today.

Metadata items and their relations

Let us start with a short introduction into the data model behind metadata. First, all metadata is kept ´behind the scenes in a metadata store. However, this is not simply a set of key-value pairs. Instead, the actual metadata is attached to three types of entities that populate the metadata store – artifacts, executions and contexts. Each of these entities has a resource name, following the usual naming conventions (with the metadata store as the parent), and a metadata field which holds the actual metadata that you want to store. In addition, each item has a display name and refers to a schema – we will come back to this later – plus a few fields specific for the respective entity.

Why these three types of metadata? The idea is that artifacts are the primary objects of interest. An artifact can be used as an input for a training run, like a dataset, or can be an output, like a model. Executions are the processes that consume and produce artifacts. Finally, there might be a need to group several executions together because they are somehow related, and this is the purpose of a context (at first glance, this seems to be based on MLMetadata which is part of TFX).

To model these ideas, the different types of metadata entities have relations that can be set and queried via API calls – they form a graph, called the lineage graph. For instance, an execution has a method assign_input_artifact to build a relation between an artifact and an execution, and a method get_input_artifact to query for these artifacts. Other types of entities have different relations – we can add an execution to a context and we can also build context hierarchies as a context can have children. Here is a graphical overview of the various relations that our metadata entities have.

Schemata, experiments and experiment runs

So far, this has been an abstract discussion. To make this concrete, we need to map this general pattern to the objects we are usually working with. An artifact, for instance, could be a model or a dataset – how do we distinguish between these types?

Instead of introducing dedicated objects for these different types of artifacts, VertexAI uses schemas. A schema describes the layout of the metadata but also serves to identify what type of entity we are looking at. Here is an API request that you can submit to get a list of all supported schemas.

TOKEN=$(gcloud auth print-access-token)
ENDPOINT="https://$GOOGLE_REGION-aiplatform.googleapis.com"
PARENT="projects/$GOOGLE_PROJECT_ID/locations/$GOOGLE_REGION"
curl \
  -H "Authorization: Bearer $TOKEN" \
   $ENDPOINT/v1/$PARENT/metadataStores/default/metadataSchemas

Not every schema makes sense for every type of metadata entity. In the output of the statement above, you will find a schemaType field for each of the schemas. This fields tells us whether the respective schema describes an artifact, an execution or a context.

In the SDK code, the used schemas per type are encoded in the modules in aiplatform.metadata.schema.system . There is, for instance, a module artifact_schema.py which contains all schemas that can be used for artifacts (there are some schemas, however, in the output of the REST API that we call above which do not appear in the sourcecode). lf we go through these modules and map artifact types to schemas, we get the following picture.

Let us quickly discuss the most important types of schemas. First, for the schema type Artifact, we see some familiar terms – there is the generic artifact type, there are models, datasets and metrics.

Let us now turn to executions. As already explained, an execution is representing an actual job run. We see a custom job execution as well as a “Run” (which seems to be a legacy type from an earlier version of the platform) and a “ContainerExecution”. For the context, we see four different options. Today, we will talk about experiments and experiment runs and leave pipelines and pipeline runs to a later post.

An experiment is supposed to be a group of executions that somehow belong together. Suppose, for instance, you want to try out different model architectures. For that purpose, you build a job to train the model and a second job to evaluate the outcomes. Then each architecture that you test would correspond to an experiment run. For each run, you execute the two jobs so that each run consists of two executions. Ideally, you want to be able to compare results and parameters for the different runs (if you have ever taken a look at MLFlow, this will sound familiar).

This fits nicely into our general pattern. We can treat our trained model as artifact and associate this with the execution representing the training job. We can then associate the executions with the experiment run which is modelled as a context. The experiment is a context as well which is the parent of the experiment run, and everything is tied together by the relations between the different entities sketched above.

Creating metadata entities

Time to try this out. Let us see how we can use the Python SDK to create metadata entities in our metadata store. We start with artifacts. An artifact is represented by an instance of the class aiplatform.metadata.artifact.Artifact. This class has a create method that we can simply invoke to instantiate an artifact.

artifact = aip.metadata.artifact.Artifact.create(
    schema_title = artifact_schema.Artifact.schema_title,
    uri = f"gs://vertex-ai-{google_project_id}/artifacts/my-artifact",
    display_name = "my-artifact",
    project = google_project_id,
    location = google_region
)

Note the URI which is specific to the artifact. The idea of this field is of course to store the location of the actual artifact, but this is really just a URI – the framework will not make an attempt to check whether the bucket even exists.

Similarly, we can create an execution which is an instance of execution.Execution in the metadata package. There is a little twist, however – this fails with a GRPC error unless you explicitly set the credentials to None (I believe that this happens because the credentials are optional in the create method but no default, not even None, is declared). So you need something like

execution = aip.metadata.execution.Execution.create(
    display_name = "my-execution",
    schema_title = execution_schema.CustomJobExecution.schema_title,
    project = google_project_id, 
    location = google_region,
    credentials = None
)

Once we have an execution and an artifact, we can now assign the artifact as either input or output. Let us declare our artifact as an output of the execution.

execution.assign_output_artifacts([artifact])    

Next, we create the experiment and the experiment run as a context. Again there is a subtle point – we will want that our experiment appears on the console as well, and for this to work, we have to declare a metadata field experiment_deleted. So the experiment is created as follows.

parent_context = aip.metadata.context.Context.create(
    schema_title = context_schema.Experiment.schema_title,
    display_name = "my-experiment",
    project = google_project_id,
    location = google_region,
    metadata = {"experiment_deleted": False}
)

We create the experiment run similarly (here no metadata is required) and then call add_context_children on the experiment context to declare the run a subcontext of the experiment. Finally, we tie the execution to the context via add_artifacts_and_executions. The full code can be found in create_metadata.py in the metadata directory of my repository.

It is instructive to use the Vertex AI console to understand what happens when running this. So after executing the script, let us first navigate to the “Experiments” tab in the section “Model Development”. In the list of experiments, you should now see an entry called “my-experiment”. This entry reflects the context of type system.Experiment that we have created. If you click on this entry, you will be taken to a list of experiment runs where the subcontext of type system.ExperimentRun shows up. So the console nicely reflects the hierarchical structure that we have created. You could now take a look at the metric created by this experiment run, we will see a bit later how this works.

Next navigate to the “Metadata” tab. On this page, you will find all metadata entries of type “Artifact”. In particular, you should see the entry “my-artifact” that we have created. Clicking on this yields a graphical representation between the artifact and the execution that has created as, exactly as we have modelled in our code.

This graph is still very straightforward in our case, but will soon start to become useful if we have more complex dependencies between various jobs, their inputs and outputs.

To inspect the objects that we have just created in more detail, I have put together a script list_metadata.py that uses the list class methods of the various involved classes to get all artifacts, executions and contexts and also prints out the relations between them. This script also has a flag --verbose to produce more output as well as a flag --delete which will delete all entries – this is useful to clean up, the metadata store grows quickly and needs to be purged on a regular basis (if you only want to clean up you might also want to use the purge method to avoid too many API calls, remember that there is a quota on the API).

Tracking metadata lineage

In the previous section, we have seen how we can create generic artifacts, contexts and executions. Usually this is not the way how the metadata lineage is actually built. Instead, the SDK offers a few convenience functions to track artifacts (or, as in the case of pipelines, does all this automatically).

Let us start with experiments. Typically, experiments are created by adding them as parameter when initializing the library.

aip.init(project = google_project_id, 
         location = google_region,
         experiment = "my-experiment")

Behind the scenes, this will result in a call to the set_experiment method of an object that is called the experiment tracker (this is initialized at module import time here). This method will create an instance of a dedicated Experiment class defined in aiplatform.metadata.experiment_resources, create a corresponding context with the resource name being equal to the experiment name and tie the experiment and the context together (if the experiment already exists, it will use the existing context). So while the context exists on the server side and is accessed via the API, the experiment is a purely local object managed by the SDK (at the time of writing and with version 1.39 of the SDK, there are comments in the code that suggest that this is going to change).

Next, we will typically start an experiment run. For that purpose, the API offers the start_run function which calls the corresponding method of the experiment tracker. This method is prepared to act as a Python context manager so that the run is started and completed automatically.

This method will again create an SDK object which is now an instance of ExperimentRun. This experiment run will then be added as child to the experiment, and in addition the experiment run is associated directly with the experiment. So at this point, the involved entities are related as follows.

Note that if a run already exists, you can attach to this run by passing resume = True as additional parameter when you call start_run .

We now have an experiment and an experiment run. Next, we typically want to create an execution. Again there is a convenience function start_execution implemented by the experiment tracker which will create an execution object that can be used as a context manager. In addition, this wraps the assign_input_artifacts and assign_output_artifacts methods of this execution so that all artifacts which will be attached to this execution will automatically also be attached to the experiment run. Here is a piece of code that brings all of this together (see the script create_lineage.py in the repository).

aip.init(project = google_project_id, 
         location = google_region,
         experiment = "my-experiment")

#
# do some training
#
.....
#
#
# Start run
# 
with aip.start_run(run = "my-experiment-run") as experiment_run:
    with aip.start_execution(display_name = "my-execution",
                             schema_title = "system.CustomJobExecution") as execution:

        # Reflect model in metadata and assign to execution
        #
        model  = aip.metadata.artifact.Artifact.create(
            schema_title = artifact_schema.Model.schema_title,
            uri = f"gs://vertex-ai-{google_project_id}/models/my-models",
            display_name = "my-model",
            project = google_project_id,
            location = google_region
        )
        execution.assign_output_artifacts([model])

Note that we start the run only once the actual training is complete, in aligment with the recommendation in the MLFlow Quickstart, to avoid the creation of invalid runs due to errors during the training process.

Logging metrics, parameters and time series data

So far, we have seen how we can associate artifacts with executions and build a lineage graph connecting experiments, experiment runs, executions and artifacts. However, a typical machine learning job will of course log more data, specifically parameters, training metrics like evaluation results and time series data like a training loss per epoch. Let us see what the SDK can offer for these cases.

Similar to functions like start_run which are defined on the package level, the actual implementation of the various logging functions is again a part of the experiment tracker. The first logging function that we will look at is log_params. This allows you to log a dictionary of parameter names and values (which must be integers, floats or strings). Behind the scenes, this will simply look up the experiment run, navigate to its associated metadata context and add the parameters to the metadata of this context, using the key _params. Data logged in this way will show up on the console when you display the details of an experiment run and select the tab “Parameters”. In Python, you can access the parameters by reconstructing the experiment run from the experiment name and the experiment run name and calling get_params on it.

experiment_run = aip.ExperimentRun(
    run_name = "my-experiment-run",
    experiment = "my-experiment")
parameters = experiment_run.get_params()

Logging metrics with log_metrics is very similar, with the only difference that a different key in the metadata is used.

Logging time series data is a bit more interesting, as here a tensorboard instance comes into play. If you create an experiment in the aiplatform.init function, this will invoke the method set_experiment of the experiment tracker. This method will, among other things that we have already discussed, create a tensorboard instance and assign it as so-called backing tensorboard to the experiment. When an experiment run is created, an additional artifact representing what is called the tensorboard run is created as well and associated with the experiment run (you might already have detected this in the list of artifacts on the console). This tensorboard run is then used to log time series which then appears both on the console as well as in a dedicated tensorboard instance which you reach by clicking on “Open Tensorboard” on the console. Note that Google charges 10 USD per GB data and month in this instance, so you will want to clean up from time to time.

Tensorboard experiment, tensorboard runs and logging to a tensorboard can also be done independently of the other types of metadata – we will cover this in a future post after having introduced pipelines.

Let us briefly touch upon to more advanced logging features. First, there is a function log_model which will store the model at a defined GCS bucket and create an artifact pointing to this model. This only works for a few explicitly supported ML frameworks. Similarly, there is a feature called autologging which automatically turns on MLFlow autologging but again this only works if you use a framework which supports this like PyTorch lightning.

Associating a custom job with an experiment run

Let us now everything that we have learned so far together and let us write a custom job which logs parameters, metrics and time series data into an experiment run. As, in general, we could have more than one job under an experiment run, our general strategy is as follows.

First, still outside of the job, we add the experiment parameter when initializing the framework which will create the experiment (if it does not yet exist) and add it to the experiment tracker as seen above. Similarly, we start an experiment run – as the name of an experiment run needs to be unique (it is used as ID internally), we will create the experiment run name as a combination of the experiment name and a timestamp.

Inside the actual training job, we can then call start_run again, this time passing resume = True so that the experiment tracker again points to our previously created run. We can then start an execution to which we can associate metadata, and we can log into the run as demonstrated above.

This approach is supported by the parameters experiment and experiment_run of the submit method (or run method) of a custom job. When we set these parameters, the entries AIP_EXPERIMENT_NAME and AIP_EXPERIMENT_RUN_NAME will be added to the environment of the job so that within the job, we can retrieve the name of the experiment and of the experiment run from there. In addition, the name of the custom job will be added to the context representing the run as metadata item, using the key custom_jobs. So our code to prepare and submit respectively run the job would be something like

EXPERIMENT = "my-experiment"
aip.init(project = google_project_id,
         location = google_region,
         experiment = EXPERIMENT)
...
with aip.start_run(run_name) as experiment_run:
    job = aip.CustomJob.from_local_script(
        display_name = "my-job",
        script_path = "train.py",
        container_uri = image,
        machine_type  = "n1-standard-4",
        base_output_dir = f"gs://{staging_bucket}/job_output/{timestamp}",
        project = google_project_id,
        location = google_region,
        staging_bucket = staging_bucket,
        environment_variables = {
            "GOOGLE_PROJECT_ID" : google_project_id,
            "GOOGLE_REGION" : google_region
        }
    )
    job.run(
        service_account = service_account,
        experiment = EXPERIMENT, 
        experiment_run = run_name
    )

Note that we also pass the Google project ID and the location to the training job as environment variables so that the job can call init with the same parameters. Within the training code, we would then do something like

experiment_name = os.environ.get("AIP_EXPERIMENT_NAME")
experiment_run_name = os.environ.get("AIP_EXPERIMENT_RUN_NAME")

aip.init(project = google_project_id,
         location = google_region,
         experiment = experiment_name)

with aip.start_run(experiment_run_name, resume = True):
   ...
    aip.log_params({
        "epochs" : epochs,  
        "lr" : lr,
    })
    #
    # Store model on GCS
    #
    uri = upload_model(model)
    #
    # Log artifact
    #
    with aip.start_execution(display_name = f"{experiment_run_name}-train",
                             schema_title = "system.CustomJobExecution") 
                   as execution:
        model_artifact  = aip.metadata.artifact.Artifact.create(
            schema_title = "system.Model",
            uri = "uri",
            display_name = "my-model",
            project = google_project_id,
            location = google_region
        )
        execution.assign_output_artifacts([model_artifact])


Of course, you could also log times series data from within the job into your tensorboard instance. Also note that in this setup, it might make sense to use run instead of submit, as otherwise the status of the run in the metadata will be updated to “Complete” once we are done with the submission and leave the context, while the actual job is still running.

Let us try this. Make sure that you are in the directory metadata within the repository clone and run our test script by typing

python3 run_job.py

This should run a job (not just submit it was we have done it previously) that in the background executes the training function in train.py and uses all of the various metadata features discussed for far. You should see our run in the console as soon as the job has started, but metrics will only be populated when it really executes (provisioning of the container might take a few minutes).

In the Vertex AI metadata console, we can now easily navigate from the experiment run to parameters and metrics and even to the artifact, and we see the time series data updated in near time on the console or in our tensorboard instance. With that, we have reached the end of this blog post (which turned out to be a bit longer than expected) – in the next post, we will start to turn our attention to pipelines.

1 Comment

Leave a Comment