The nuts and bolts of VertexAI – models and model archives

Having completed our setup in the previous post, we will now dive into models. We will learn how a model is packaged and how we can upload models into the VertexAI model registry for later use, for instance in a pipeline job.

Overview

Before we dive into the details, let us first try to understand what a model stored in the VertexAI model registry does actually contain. In VertexAI, a model is comprised of two components. First, there is a model archive. The exact format of the archive depends on the used framework, we will see later how we can prepare an archive designed to work with PyTorch. Essentially, the archive contains the model and the weights plus some instructions on how to use it in the form of a piece of code called the handler.

In addition, the model contains a runtime component, i.e. a Docker image. When we later deploy a model in the model registry to an endpoint, VertexAI will pull this image, add a reference to the model archive and run it.

At the end of the day, the docker image can be an arbitrary image as long as it complies with some requirements that the platform defines. First, the image needs to expose an HTTP server on a defined port (the default being 8080, but this can be changed). The server needs to accept requests in a specified JSON format and return replies in a specified format (more on this below). Second, the server needs to expose a health endpoint – again on port 8080 – which VertexAI uses to determine if the container is ready or needs to be restarted.

The idea is that the HTTP endpoint is accepting features and returning predictions, but the details are left to the container. Also how exactly the container uses the archive is up to you (theoretically, you could even put the model artifacts into the container itself and ignore the archive entirely, but of course the idea is to separate runtime and weights so that you can reuse the container).

VertexAI does offer a collection of prebuilt containers, but I was not able to make them work (I had strange error messages from gsutil which is part of the prebuilt containers, so I decided to roll my own container). As the prebuilt containers, the custom container that we will build will be based on Torchserve.

Torchserve

In its essence, Torchserve is a framework to package models and expose prediction endpoints. It specifies a model archive format (using the extension .mar by convention), provides a tool to package models into archives and provides a runtime environment to run a prediction endpoint. We will use Torchserve to create a package locally, then test it out locally and finally deploy the model to the VertexAI model registry. Here is a diagram showing the individual steps of training, archiving and uploading a model through which we will go now.

Training a model

First, we need a trained model. For our purposes, we use a very small binary classification model that we we train to accept two numerical features (coordinates x and y in the plane) and return 1 if the point in the plane defined by them is above the diagonal and 0 if not. 

I have put together a little script train.py that creates some random training data and runs the training (which only takes a few seconds, even on a CPU). So from the root directory of my repository navigate to the folder where the script is located and run it (do not forget to activate the virtual environment and set all the environment variables as described in the setup in the first post).

cd models
python3 train.py

Once the training is complete, the script will serialize the state dictionary and save the model using torch.save:

torch.save(model.state_dict(), "model.bin")

Building a custom handler

Next let us turn to our handler. In the Torchserve framework, a handler is a Python class that has two methods called initialize and handle . The first method is invoked once and gives the handler the opportunity to prepare for requests, for instance by initializing a model. The second method is handling individual requests.

Of course you can implement your own handler, but I recommend to use the BaseHandler which is part of the Torchserve package. This handler defines three methods that we will have to implement in our own code.

The first method that we need is the preprocess method. This method accepts the body of the request in JSON format which we assume to be a list of prediction requests. For simplicity, we only consider the first entry in the list, convert it into a tensor and return it.

The base handler will then take over again and invoke the inference method of our handler. This method is supposed to call the actual model and return the result. Finally, the postprocess method will be invoked which converts our output back into a list. The full code for our custom handler is here.

What about loading the model? We have left this entirely to the base handler. It is instructive to look at the code of the handler to see what happens in its initialize method. Here we see that if the archive contains a Python model file, the model is loaded in what Torchserve calls eager mode by invoking this function which will import the Python model and will then load the state dictionary. Thus if we include a model file in our archive the model binary is expected to be a Python state dict, which is the reason why our train script uses this method for saving the model.

Preparing the archive

Once the training completes and has written the state dictionary, we can now create our package. For that purpose we use the torchserve archiving tool.

torch-model-archiver \
  --serialized-file model.bin \
  --model-file model.py \
  --handler handler.py \
  --model-name "my-model" \
  --version 1.0

This will create a file my-model.mar in the current working directory. This is actually a ZIP file which you can unzip to find that it contains the three files that we have specified plus a manifest file in JSON format. Let us go quickly through the arguments. First, we specify the three files that we have discussed – the serialized model weights, the model as Python source file and the handler. We also provide a model name which ends up in the manifest and determines the name of the output file. Finally, we specify a version which also goes into the manifest file.

Testing your model archive locally

Time to test our archive. For that purpose, we start a torchserve process locally and see what happens. We also install an additional package that torchserve needs to capture metrics.

pip3 install nvgpu
torchserve --start \
           --foreground \
           --model-store . \
           --ts-config torchserve.config \
           --models model=my-model.mar

Again, let us quickly discuss the parameters. First, torchserve is actually running a Java process behind the scenes. With the first two parameters, we instruct torchserve to start this process but to stay in the foreground (if we skip this, we would have to use torchserve --stop to stop the process again). Next, we specify a model store. This is simply a directory that contains the archives. With the next parameter we specify the models to load. Here we only load one model and use the syntax

<model-name>=<archive file>

to make our model known internally under the name “model” (we do this because this is what VertexAI will do later as well).

Finally, we point torchserve to a config file that is contained in my repository (so make sure to run this from within the models directory). Apart from some runtime information like the number of threads we use, this has one very important parameter – service_envelope=json. It took me some time to figure out how this works, so here are the details.

If this parameter is set, the server will assume that the request comes as part of a JSON structure looking like this

{ 
  "instances" : [...]
}

i.e. a dictionary with a key instances the value of which is a list. This list is supposed to contain the samples for which we want to run a prediction. The server will then strip off that envelope and simply pass on the list, so that in the preprocess method in our handler, we will only see that list. If we do not set this parameter, this envelope processing will not work. However, when we later deploy to VertexAI which assumes the envelope, we will be in trouble, so this is why it is important to have that setting in the configuration file.

Let us now run a prediction:

curl \
        --data '{ "instances" : [[0.5, 0.35]]}' \
        --header 'Content-Type: application/json'  \
        http://127.0.0.1:8080/predictions/model

You should get back a JSON structure looking as follows (I have added line breaks for better readibility)

{
  "predictions": ["0"]
}

Again, the server has wrapped the output of our postprocess method which is a list into a JSON envelope.

Note that the content type is required, otherwise the request will fail. Also note that the path under which our handler is reachable consists of the fixed string predictions combined with the name of our model taken from the “model=my-model.mar” parameter that we added when starting torchserve. In addition to the prediction endpoint, there is also a health endpoint under the same port.

curl http://127.0.0.1:8080/ping

Building and testing the container

Let us now discuss how we put this into a container. Our container will be based on the base image that we use for that series, but there are some other items that we need to add. Specifically, we need to add the config file and we want to make sure that our container has a Java runtime so that torchserve can start the actual server process. We also need a shell script serving as entry point.

How do we make our model archive available in the container? When we run our container on VertexAI later, the platform will expose the location of the model archive in the environment variable AIP_STORAGE_URI which will contain a fully qualified GCS blob name. Thus we need an additional script to download the model file from there. I have put together a Python script download_model.py for that purpose. Our entry point script now needs to use this to download the model and then starts the torchserve process as we have done it above. The build script will therefore take the download script and the entrypoint, copy that into the container, install a Java runtime and the nvgpu package and define an entrypoint that is our script. Here are the commands to run the build script.

cd ../docker/prediction
./build.sh

As our container expects the archive in a GCS bucket, let us copy our archive to the bucket that we have prepared previously. Google expects us to provide a directory in which our folder is located, let us therefore add a directory models and place our model there

cd ../../models
uri="gs://vertex-ai-$GOOGLE_PROJECT_ID/models/model.mar"
gsutil cp my-model.mar $uri

We are now ready to run our container locally. Here are a few things to consider. First, we need to make sure that the code running within our container uses the service account key for the run account. Thus, we first need to get a service account key and place it in our local directory.

sa="vertex-ai-run@$GOOGLE_PROJECT_ID.iam.gserviceaccount.com"
gcloud iam service-accounts keys create \
  key.json \
  --iam-account=$sa

We will then mount the current directory into the container and, inside the container, set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the key. We also need to map the port 8080 on which torchserve is listening and we need to set the AIP_STORAGE_URI to the location of the model. VertexAI will later make this point to the directory containing the archive, so we do this as well.

uri="gs://vertex-ai-$GOOGLE_PROJECT_ID/models"
endpoint="$GOOGLE_REGION-docker.pkg.dev/$GOOGLE_PROJECT_ID"
repository="$endpoint/vertex-ai-docker-repo"
image="$repository/prediction:latest"
docker run -it \
           -e AIP_STORAGE_URI=$uri \
           -e GOOGLE_APPLICATION_CREDENTIALS="/credentials/key.json" \
           -v $(pwd):/credentials/ \
           -p 8080:8080 \
           $image

Time to send a prediction request. In a separate terminal, run

curl \
        --data '{"instances": [ [0.5, 0.35]]}' \
        --header "Content-Type: application/json" \
        http://127.0.0.1:8080/predictions/model

As a result, you should again see the prediction as before. So we have managed to build a stand-alone container that only needs a link to the model archive to work.

Uploading models

Time to upload our model to the registry. We already have our model archive on GCS, but still need to push our prediction container.

endpoint="$GOOGLE_REGION-docker.pkg.dev/$GOOGLE_PROJECT_ID"
repository="$endpoint/vertex-ai-docker-repo"
image="$repository/prediction:latest"
docker push $image

We can now upload the model. We have several options to do this. First, we could invoke the upload API call using for instance curl. Alternatively, we could use `gcloud ai models upload`. However, we will do this in Python to make our first contact with the Google Python SDK for VertexAI.

The package that we need is already part of our requirements file and should therefore already be installed. Here is the code to import and initialize it using again our project ID and location from the environment.

import os
import google.cloud.aiplatform as aip

google_project_id = os.environ.get("GOOGLE_PROJECT_ID")
google_region = os.environ.get("GOOGLE_REGION")
aip.init(project = google_project_id, location = google_region)

Note that with this way of initializing, the SDK will use the credentials pointed to by GOOGLE_APPLICATION_CREDENTIALS i.e. our build service account.

In general, the module contains top-level classes for each of the relevant resources – models, pipeline jobs, custom jobs, metadata entities, endpoints and so forth. Most of these classes have class methods like create, list or get that reflect the basic API operations in addition to more specific methods. Models are a bit of an exception because they do not have a create method but are created via their upload method. Let us now use that to upload our model. Here is a table listing the most relevant parameters of this method.

ParameterDescription
serving_container_image_uriThe URI of the container image that we will use for the model
artifact_uriThe URI of the model archive (without the file name model.mar and without a trailing slash
model_idAn alphanumeric ID of the model that we can choose
serving_container_predict_routeThe route under which our container expects prediction requests, i.e. “/predictions/model” in our case
serving_container_health_routeThe route under which our container exposes its health endpoint, i.e. “/ping” in our case
display_nameA readable name used on the console to display the model
projectThe ID of the Google project
locationThe Google region
Key parameters for Model.upload()

In addition, there are many other parameters that allow you to change the container specification (add environment variables, change ports, override the entrypoint) or manage versioning (we will cover this in our next post). Let us now assemble an upload request using the fields above.

registry = f"{google_region}-docker.pkg.dev"
repository = f"{google_project_id}/vertex-ai-docker-repo"
image = f"{registry}/{repository}/prediction:latest"
bucket = f"gs://vertex-ai-{google_project_id}"
aip.Model.upload(
    serving_container_image_uri = image,
    artifact_uri = f"{bucket}/models",
    model_id = "vertexaimodel",
    serving_container_predict_route = "/predictions/model",
    serving_container_health_route = "/ping",
    display_name = "my-model",
    project = google_project_id,
    location = google_region
)

When we run this (either you type this into a Python shell, or run the script upload_model.py in the repository), we create something that Google calls an LRO (long running operation) as the upload might take a few minutes, even for our small models (I assume that Google transfers the container image which has almost 6 GB into a separate registry). Once the upload completes, we should verify that the model has been created – you can do this either on the VertexAI model registry page directly or by using gcloud ai models list. In either case, you should see a new model being created (make sure to do the lookup in the proper location). You can also use

gcloud ai models describe vertexaimodel

to display some details of the model.

This is maybe a good point in time to discuss naming. Most objects in VertexAI can be given a display name which is used on the console and is supposed to be human readable. However, display names are typically not assumed to be unique. Instead, each resource receives an ID that is sometimes (as for models) alphanumeric, sometimes numeric. This ID along with the type of the resource, the location and the project identifies a resource uniquely. The combination of this data gives us what is called the resource name, which is a path looking something like

projects/<project number>/locations/<location>/models/vertexaimodel

This is what you see in the output of gcloud describe being displayed as “name”.

That closes our post for today. Next time we will see how we can deploy our model to an endpoint, how we can use curl or a Python client to make predictions and how we can create more than one version of a model.

The nuts and bolts of VertexAI – overview

Today, all major cloud providers have established impressive machine learning capabilities on their respective platforms – Amazon has AWS SageMaker, Google has VertexAI and Microsoft has Azure Machine Learning. Being tired of spinning up and shutting down GPU-enabled virtual machines manually, I started to explore one of them a couple of months ago – Googles VertexAI. In this short series, I will guide you through the most important features of the platform in depth and explain how you can use them and how they work.

What is VertexAI?

VertexAI is a collection of Google services centered around machine learning on the Google cloud. The platform was introduced in 2021 and received a major update in 2023 when support for GenAI models was added. At its heart, the platform lets you define and manage datasets, train and version models, deploy them to prediction endpoints and track metadata along the way so that you can trace model versions back to training runs and to the used data. In addition, the Model Garden makes Googles own GenAI models available but also allows you to access various open source models like Llama2 or Mistral-7B, and the VertexAI studio allows you to test and version your prompts and play with the models. And of course VertexAI lets you launch and manage Jupyter notebook instances.

Let us now take a closer look at some of the most relevant components. First there are models. As we will see later, a model is essentially a versioned archive containing your model artefacts stored in a model registry for easy access. Next, there are prediction endpoints which, at the end of the day, are containers that you deploy to run your models so that you can query them either online or in batch mode, running on Google provided infrastructure.

To schedule training runs, you have several options. You can either compose and submit a job which again is essentially a container running either a pre-built model or your custom python code. Alternatively, you can combine jobs into pipelines and let Google manage the dependencies between individual jobs in the pipeline and the input and output data of each job for you.

When you define and run a pipeline, you consume and create artifacts like datasets or model versions. Experiments let you bundle training runs, and metadata allows you to track these artifacts, including a visualization of data lineage so that you can reconstruct for each artifact during which pipeline run it has been generated.

Finally, VertexAI also allows you to define managed datasets that the platform will store for you. You can even use AutoML which means that given some data, you can select a prebuilt model for standard tasks like classification or sentiment analysis and train this model on your data. Theoretically, this allows you to simply upload a tabular dataset, start a training run for a classification model, deploy the trained model and run a prediction without having to write a single line of code (I have to admit, however, that I was not convinced when I tried this – even on a small dataset, the runtime was much longer than what I did locally, and the training runs are really expensive as you pay much more than you would pay if you would simply run a custom model in a container or virtual machine).

In this series, we will dive into most of these features.

  • First, we will learn how to work with models. We will train a model locally and see how we can package this model for upload into the model registry.
  • We will then see how the creation of endpoints and the deployment of models works
  • Having mastered this, we will build a simple training job that allows us to train our model on the VertexAI platform
  • Next, we will study experiments and metadata and see how we can use the tensor boards integrated into VertexAI to log metrics
  • We will then take a look at VertexAI pipelines that combine several jobs and learn how to compose and run pipelines covering the full ML life cycle
  • Then we will talk about networks and connectivity and see how you can connect pipelines and prediction endpoints to applications running in one of your VPCs.
  • Finally, we will take a short look at managed datasets and how they can be imported, used and exported

Initial setup

As always, this will be a hands-on exercise. To follow along, however, there are of course some preparations and some initial setup.

First, you will obviously need a Google account. I also assume some basic familiarity with the Google cloud platform i.e. the console and gcloud (you should also be able to follow the examples if you have not worked on Google cloud before, but I will not explain things like virtual machines, IAM, networks and so forth). You should also make sure that you have gcloud and gsutil installed.

Next, you should decide on a region and a project that you will use. Make sure that you stick to this region as data transfer between regions can be costly. Throughout this series, I will assume that you have two environment variables set that I will refer to at several points in the code.

export GOOGLE_PROJECT_ID=<the alphanumerical ID of your project>
export GOOGLE_REGION=<the region, like us-central1 or europe-west4>

The next few setup steps involve creating a dedicated GCS bucket for this series and two service accounts with the necessary access rights. You can either follow the instructions below step by step or simply clone my repository and run the script in the setup folder. Before you do any of this, please make sure that your gcloud client is authorized (verify with gcloud auth list).

VertexAI uses Google Cloud Storage buckets extensively for storing data and models. I recommend to create a decicated bucket to use it with Vertex AI in the region where you will also schedule training runs and models.

gcloud storage buckets create \
    gs://vertex-ai-$GOOGLE_PROJECT_ID \
    --location=$GOOGLE_REGION 

Next, we will create two service accounts. Our first service account vertex-ai-run is the account that we will use to run jobs and containers on the platform. The second account vertex-ai-build is used when we assemble or submit jobs or upload models. In our setup, these service accounts have the same access rights, but in a more production-like setup you would of course separate those two accounts more carefully.

gcloud iam service-accounts create \
    vertex-ai-run \
    --display-name=vertex-ai-run \
    --description="A service account to run jobs and endpoints"

gcloud iam service-accounts create \
    vertex-ai-build \
    --display-name=vertex-ai-build \
    --description="A service account to assemble and submit jobs"

We will also need a docker repository in Artifact Registry to store our custom docker images.

gcloud artifacts repositories create \
    vertex-ai-docker-repo  \
    --repository-format=docker \
    --location=$GOOGLE_REGION \
    --description="Vertex AI custom images"

Now let us create the necessary policy bindings. For each of the service accounts, we will grant the role aiplatform.user that contains the necessary permissions to create, modify, read and delete the objects that we will work with. In addition, we will give both accounts the storage.legacyBucketOwner and storage.objectAdmin roles so that they can create and access objects in our buckets, as well as the reader role on our repository.

accounts=(
    vertex-ai-run
    vertex-ai-build
)
project_roles=(
    aiplatform.user
    artifactregistry.reader
)
bucket_roles=(
    storage.objectAdmin
    storage.legacyBucketOwner
)
for account in ${accounts[@]}
do 
    sa="$account@$GOOGLE_PROJECT_ID.iam.gserviceaccount.com"
    bucket="gs://vertex-ai-$GOOGLE_PROJECT_ID"
    for role in ${project_roles[@]}
    do
        gcloud projects add-iam-policy-binding \
            $GOOGLE_PROJECT_ID \
            --member="serviceAccount:$sa" \
            --role="roles/$role"
    done
    for role in ${bucket_roles[@]}
    do
        gcloud projects add-iam-policy-binding \
            $bucket \
            --member="serviceAccount:$sa" \
            --role="roles/$role"
    done
done 

Finally, as our build user will submit job using the run user, it needs the role serviceAccountUser on the run service account.

sa="vertex-ai-build@$GOOGLE_PROJECT_ID.iam.gserviceaccount.com"
gcloud iam service-accounts add-iam-policy-binding \
        vertex-ai-run@$GOOGLE_PROJECT_ID.iam.gserviceaccount.com  \
        --member="serviceAccount:$sa" \
        --role="roles/iam.serviceAccountUser"

As we will be using the build account locally, we need a JSON key for this account. So head over to the Service Accounts tab in the Google Cloud IAM console, select the project you will be using, find the service account vertex-ai-build that we have just created, select the tab “Keys” and add a service account key in JSON format. Store the key in a safe location and set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the file, for instance

export GOOGLE_APPLICATION_CREDENTIALS=$HOME/.keys/vertex-ai-build.json

Setting up our Python environment

If you have read a few of my previous posts, you will not be surprised that the language of choice for this is Python (even though Google has of course SDKs for many other languages as well). There is a couple of packages that we will need, and I recommend to set them up in a virtual environment specifically for this. So run the following commands in the root of the repository (if not done yet, this is the time to clone it using git clone https://github.com/christianb93/Vertex-AI)

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

Building the docker base container

On our journey through VertexAI, we will need various docker containers. I have organized these containers such that they are derived from a base container that contains a Python runtime and the dependencies that we will need. First, make sure that you have installed docker (if not done yet, here are the commands to do this on Ubuntu 22.04, for other distributions this might vary):

sudo apt install docker.io
sudo usermod -aG docker $USER
newgrp docker

To build the container and push it into the registry that we have created, first make sure that the environment variables for the Google project ID and the Google location are set as described above and that you have run

gcloud auth configure-docker $GOOGLE_REGION-docker.pkg.dev 

to add gcloud as a credential helper to your local docker installation. Then switch to the corresponding subdirectory of this repository, run the build script and trigger the push.

cd docker/base
./build.sh
docker push $GOOGLE_REGION-docker.pkg.dev/$GOOGLE_PROJECT_ID/vertex-ai-docker-repo/base:latest

Of course this will take some time as you have to download all dependencies once and then push the container to the registry. You might even want to do this in a VM in the respective region to speed up things (but we will sooner or later still need the image locally as well).

Cost considerations

A few words on cost. Of course, we will have to consume Google resources in this series and that will create some cost. Basically there are three major types of cost. First, Google will charge for data that we store in the platform. We will use GCS buckets but the data sets that we have to store that small, so that should not be an issue – in my region, standard regional storage is about 2 cents per GB and month. We will also store images in the Artifact Registry. At the time of writing, Google charges 10 cents per GB and month. Our images will have around 10 GB, so that would be 1 USD per month – still not dramatic, but you might want to clean up the images at some point. There is also metadata involved that we will create which is considerably more expensive (10 USD per GB and month), but again our volumes will be small.

Second, there is a charge for data transfer – both out of the platform and across regions. Be careful at this point and avoid traffic between regions or continents – transferring large images in the GB range can quickly become costly (this is the reason why we hold all our data in one region). For transfers out, like downloading an image, there is a free tier for up to 200 GB / month which should be more than enough for our purposes.

Finally, there is a cost for all machines that we will need when running endpoints or batch jobs. We will usually use small machines for these purposes like n1-standard-2 which will cost you roughly 10 cents per hour plus a few cents for disks. If you are careful and clean up quickly that cost should be manageable.

There is a couple of things, however, that you should avoid. One of the most expensive operations on the platform is the AutoML feature as Google will charge a flat fee of 20 USD per run, regardless of the machine types that you use. Do not do this as for our purposes, this is clearly far beyond the cost for the actually consumed compute resources. There is also a flat fee of 3 cents per pipeline run in addition to the compute resources, so you want to test your code locally before submitting it to avoid cost for failed runs.

With that our preparations are complete. In the next post, we will start to get our hands dirty and learn how to package and upload a model into the Vertex AI model registry.