The nuts and bolts of VertexAI – models and model archives

Having completed our setup in the previous post, we will now dive into models. We will learn how a model is packaged and how we can upload models into the VertexAI model registry for later use, for instance in a pipeline job.

Overview

Before we dive into the details, let us first try to understand what a model stored in the VertexAI model registry does actually contain. In VertexAI, a model is comprised of two components. First, there is a model archive. The exact format of the archive depends on the used framework, we will see later how we can prepare an archive designed to work with PyTorch. Essentially, the archive contains the model and the weights plus some instructions on how to use it in the form of a piece of code called the handler.

In addition, the model contains a runtime component, i.e. a Docker image. When we later deploy a model in the model registry to an endpoint, VertexAI will pull this image, add a reference to the model archive and run it.

At the end of the day, the docker image can be an arbitrary image as long as it complies with some requirements that the platform defines. First, the image needs to expose an HTTP server on a defined port (the default being 8080, but this can be changed). The server needs to accept requests in a specified JSON format and return replies in a specified format (more on this below). Second, the server needs to expose a health endpoint – again on port 8080 – which VertexAI uses to determine if the container is ready or needs to be restarted.

The idea is that the HTTP endpoint is accepting features and returning predictions, but the details are left to the container. Also how exactly the container uses the archive is up to you (theoretically, you could even put the model artifacts into the container itself and ignore the archive entirely, but of course the idea is to separate runtime and weights so that you can reuse the container).

VertexAI does offer a collection of prebuilt containers, but I was not able to make them work (I had strange error messages from gsutil which is part of the prebuilt containers, so I decided to roll my own container). As the prebuilt containers, the custom container that we will build will be based on Torchserve.

Torchserve

In its essence, Torchserve is a framework to package models and expose prediction endpoints. It specifies a model archive format (using the extension .mar by convention), provides a tool to package models into archives and provides a runtime environment to run a prediction endpoint. We will use Torchserve to create a package locally, then test it out locally and finally deploy the model to the VertexAI model registry. Here is a diagram showing the individual steps of training, archiving and uploading a model through which we will go now.

Training a model

First, we need a trained model. For our purposes, we use a very small binary classification model that we we train to accept two numerical features (coordinates x and y in the plane) and return 1 if the point in the plane defined by them is above the diagonal and 0 if not. 

I have put together a little script train.py that creates some random training data and runs the training (which only takes a few seconds, even on a CPU). So from the root directory of my repository navigate to the folder where the script is located and run it (do not forget to activate the virtual environment and set all the environment variables as described in the setup in the first post).

cd models
python3 train.py

Once the training is complete, the script will serialize the state dictionary and save the model using torch.save:

torch.save(model.state_dict(), "model.bin")

Building a custom handler

Next let us turn to our handler. In the Torchserve framework, a handler is a Python class that has two methods called initialize and handle . The first method is invoked once and gives the handler the opportunity to prepare for requests, for instance by initializing a model. The second method is handling individual requests.

Of course you can implement your own handler, but I recommend to use the BaseHandler which is part of the Torchserve package. This handler defines three methods that we will have to implement in our own code.

The first method that we need is the preprocess method. This method accepts the body of the request in JSON format which we assume to be a list of prediction requests. For simplicity, we only consider the first entry in the list, convert it into a tensor and return it.

The base handler will then take over again and invoke the inference method of our handler. This method is supposed to call the actual model and return the result. Finally, the postprocess method will be invoked which converts our output back into a list. The full code for our custom handler is here.

What about loading the model? We have left this entirely to the base handler. It is instructive to look at the code of the handler to see what happens in its initialize method. Here we see that if the archive contains a Python model file, the model is loaded in what Torchserve calls eager mode by invoking this function which will import the Python model and will then load the state dictionary. Thus if we include a model file in our archive the model binary is expected to be a Python state dict, which is the reason why our train script uses this method for saving the model.

Preparing the archive

Once the training completes and has written the state dictionary, we can now create our package. For that purpose we use the torchserve archiving tool.

torch-model-archiver \
  --serialized-file model.bin \
  --model-file model.py \
  --handler handler.py \
  --model-name "my-model" \
  --version 1.0

This will create a file my-model.mar in the current working directory. This is actually a ZIP file which you can unzip to find that it contains the three files that we have specified plus a manifest file in JSON format. Let us go quickly through the arguments. First, we specify the three files that we have discussed – the serialized model weights, the model as Python source file and the handler. We also provide a model name which ends up in the manifest and determines the name of the output file. Finally, we specify a version which also goes into the manifest file.

Testing your model archive locally

Time to test our archive. For that purpose, we start a torchserve process locally and see what happens. We also install an additional package that torchserve needs to capture metrics.

pip3 install nvgpu
torchserve --start \
           --foreground \
           --model-store . \
           --ts-config torchserve.config \
           --models model=my-model.mar

Again, let us quickly discuss the parameters. First, torchserve is actually running a Java process behind the scenes. With the first two parameters, we instruct torchserve to start this process but to stay in the foreground (if we skip this, we would have to use torchserve --stop to stop the process again). Next, we specify a model store. This is simply a directory that contains the archives. With the next parameter we specify the models to load. Here we only load one model and use the syntax

<model-name>=<archive file>

to make our model known internally under the name “model” (we do this because this is what VertexAI will do later as well).

Finally, we point torchserve to a config file that is contained in my repository (so make sure to run this from within the models directory). Apart from some runtime information like the number of threads we use, this has one very important parameter – service_envelope=json. It took me some time to figure out how this works, so here are the details.

If this parameter is set, the server will assume that the request comes as part of a JSON structure looking like this

{ 
  "instances" : [...]
}

i.e. a dictionary with a key instances the value of which is a list. This list is supposed to contain the samples for which we want to run a prediction. The server will then strip off that envelope and simply pass on the list, so that in the preprocess method in our handler, we will only see that list. If we do not set this parameter, this envelope processing will not work. However, when we later deploy to VertexAI which assumes the envelope, we will be in trouble, so this is why it is important to have that setting in the configuration file.

Let us now run a prediction:

curl \
        --data '{ "instances" : [[0.5, 0.35]]}' \
        --header 'Content-Type: application/json'  \
        http://127.0.0.1:8080/predictions/model

You should get back a JSON structure looking as follows (I have added line breaks for better readibility)

{
  "predictions": ["0"]
}

Again, the server has wrapped the output of our postprocess method which is a list into a JSON envelope.

Note that the content type is required, otherwise the request will fail. Also note that the path under which our handler is reachable consists of the fixed string predictions combined with the name of our model taken from the “model=my-model.mar” parameter that we added when starting torchserve. In addition to the prediction endpoint, there is also a health endpoint under the same port.

curl http://127.0.0.1:8080/ping

Building and testing the container

Let us now discuss how we put this into a container. Our container will be based on the base image that we use for that series, but there are some other items that we need to add. Specifically, we need to add the config file and we want to make sure that our container has a Java runtime so that torchserve can start the actual server process. We also need a shell script serving as entry point.

How do we make our model archive available in the container? When we run our container on VertexAI later, the platform will expose the location of the model archive in the environment variable AIP_STORAGE_URI which will contain a fully qualified GCS blob name. Thus we need an additional script to download the model file from there. I have put together a Python script download_model.py for that purpose. Our entry point script now needs to use this to download the model and then starts the torchserve process as we have done it above. The build script will therefore take the download script and the entrypoint, copy that into the container, install a Java runtime and the nvgpu package and define an entrypoint that is our script. Here are the commands to run the build script.

cd ../docker/prediction
./build.sh

As our container expects the archive in a GCS bucket, let us copy our archive to the bucket that we have prepared previously. Google expects us to provide a directory in which our folder is located, let us therefore add a directory models and place our model there

cd ../../models
uri="gs://vertex-ai-$GOOGLE_PROJECT_ID/models/model.mar"
gsutil cp my-model.mar $uri

We are now ready to run our container locally. Here are a few things to consider. First, we need to make sure that the code running within our container uses the service account key for the run account. Thus, we first need to get a service account key and place it in our local directory.

sa="vertex-ai-run@$GOOGLE_PROJECT_ID.iam.gserviceaccount.com"
gcloud iam service-accounts keys create \
  key.json \
  --iam-account=$sa

We will then mount the current directory into the container and, inside the container, set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the key. We also need to map the port 8080 on which torchserve is listening and we need to set the AIP_STORAGE_URI to the location of the model. VertexAI will later make this point to the directory containing the archive, so we do this as well.

uri="gs://vertex-ai-$GOOGLE_PROJECT_ID/models"
endpoint="$GOOGLE_REGION-docker.pkg.dev/$GOOGLE_PROJECT_ID"
repository="$endpoint/vertex-ai-docker-repo"
image="$repository/prediction:latest"
docker run -it \
           -e AIP_STORAGE_URI=$uri \
           -e GOOGLE_APPLICATION_CREDENTIALS="/credentials/key.json" \
           -v $(pwd):/credentials/ \
           -p 8080:8080 \
           $image

Time to send a prediction request. In a separate terminal, run

curl \
        --data '{"instances": [ [0.5, 0.35]]}' \
        --header "Content-Type: application/json" \
        http://127.0.0.1:8080/predictions/model

As a result, you should again see the prediction as before. So we have managed to build a stand-alone container that only needs a link to the model archive to work.

Uploading models

Time to upload our model to the registry. We already have our model archive on GCS, but still need to push our prediction container.

endpoint="$GOOGLE_REGION-docker.pkg.dev/$GOOGLE_PROJECT_ID"
repository="$endpoint/vertex-ai-docker-repo"
image="$repository/prediction:latest"
docker push $image

We can now upload the model. We have several options to do this. First, we could invoke the upload API call using for instance curl. Alternatively, we could use `gcloud ai models upload`. However, we will do this in Python to make our first contact with the Google Python SDK for VertexAI.

The package that we need is already part of our requirements file and should therefore already be installed. Here is the code to import and initialize it using again our project ID and location from the environment.

import os
import google.cloud.aiplatform as aip

google_project_id = os.environ.get("GOOGLE_PROJECT_ID")
google_region = os.environ.get("GOOGLE_REGION")
aip.init(project = google_project_id, location = google_region)

Note that with this way of initializing, the SDK will use the credentials pointed to by GOOGLE_APPLICATION_CREDENTIALS i.e. our build service account.

In general, the module contains top-level classes for each of the relevant resources – models, pipeline jobs, custom jobs, metadata entities, endpoints and so forth. Most of these classes have class methods like create, list or get that reflect the basic API operations in addition to more specific methods. Models are a bit of an exception because they do not have a create method but are created via their upload method. Let us now use that to upload our model. Here is a table listing the most relevant parameters of this method.

ParameterDescription
serving_container_image_uriThe URI of the container image that we will use for the model
artifact_uriThe URI of the model archive (without the file name model.mar and without a trailing slash
model_idAn alphanumeric ID of the model that we can choose
serving_container_predict_routeThe route under which our container expects prediction requests, i.e. “/predictions/model” in our case
serving_container_health_routeThe route under which our container exposes its health endpoint, i.e. “/ping” in our case
display_nameA readable name used on the console to display the model
projectThe ID of the Google project
locationThe Google region
Key parameters for Model.upload()

In addition, there are many other parameters that allow you to change the container specification (add environment variables, change ports, override the entrypoint) or manage versioning (we will cover this in our next post). Let us now assemble an upload request using the fields above.

registry = f"{google_region}-docker.pkg.dev"
repository = f"{google_project_id}/vertex-ai-docker-repo"
image = f"{registry}/{repository}/prediction:latest"
bucket = f"gs://vertex-ai-{google_project_id}"
aip.Model.upload(
    serving_container_image_uri = image,
    artifact_uri = f"{bucket}/models",
    model_id = "vertexaimodel",
    serving_container_predict_route = "/predictions/model",
    serving_container_health_route = "/ping",
    display_name = "my-model",
    project = google_project_id,
    location = google_region
)

When we run this (either you type this into a Python shell, or run the script upload_model.py in the repository), we create something that Google calls an LRO (long running operation) as the upload might take a few minutes, even for our small models (I assume that Google transfers the container image which has almost 6 GB into a separate registry). Once the upload completes, we should verify that the model has been created – you can do this either on the VertexAI model registry page directly or by using gcloud ai models list. In either case, you should see a new model being created (make sure to do the lookup in the proper location). You can also use

gcloud ai models describe vertexaimodel

to display some details of the model.

This is maybe a good point in time to discuss naming. Most objects in VertexAI can be given a display name which is used on the console and is supposed to be human readable. However, display names are typically not assumed to be unique. Instead, each resource receives an ID that is sometimes (as for models) alphanumeric, sometimes numeric. This ID along with the type of the resource, the location and the project identifies a resource uniquely. The combination of this data gives us what is called the resource name, which is a path looking something like

projects/<project number>/locations/<location>/models/vertexaimodel

This is what you see in the output of gcloud describe being displayed as “name”.

That closes our post for today. Next time we will see how we can deploy our model to an endpoint, how we can use curl or a Python client to make predictions and how we can create more than one version of a model.

2 Comments

Leave a Comment