The nuts and bolts of VertexAI – prediction endpoints and model versions

In the last post, we have learned how to package a model and upload it into the model registry. Today, we will see how we can deploy a model from the registry to an endpoint and use that endpoint to make predictions. We will also learn how a model can be exported again and how we can create additional versions of a model.

Creating an endpoint

First, let us take a look at what VertexAI calls an endpoint. An endpoint is not a physical resource, but more a logical container into which we can deploy models all of which will share the same URL. For each model that we deploy, we will specify hardware resources during the deployment which VertexAI will then provision under the given endpoint by bringing up one or more instances. VertexAI will automatically scale up and down as needed and distribute the traffic among the available instances.

Note that it is also possible to deploy different models to the same endpoint. The URL for these two models will be the same, but a part of the traffic is routed to one model and another part routed to the second model. There is a parameter traffic split that allows you to influence that routing. This feature can for instance be used to realize a rolling upgrade of an existing model. We will get an idea of how exactly this works a bit later.

To get started, let us now see how we can create an endpoint. This is rather straightforward, we use the class method create on the Endpoint class.

endpoint = aip.Endpoint.create(
    display_name = "vertex-ai-endpoint",
    project = google_project_id, 
   location = google_region
)

Note that we will specify all details about instances, machine types and so forth when we create the actual deployment. So far, our endpoint is just an empty hull that does not consume any resources.

Note that having the endpoint already allows us to derive the URL under which our models will be reachable. An endpoint again has a fully qualified resource name, and the URL under which we can reach the endpoint is

https://{google_region}-aiplatform.googleapis.com/v1/{endpoint_resource_name}:predict

If we already have created our endpoint previously, we can use the list method of the Endpoint class to get a list of all endpoints and search for our endpoint. We can even supply a filter expression to the list method to only get those endpoints with a given display name back.

endpoints = aip.Endpoint.list(
    filter='display_name="vertex-ai-endpoint"'
)

Deploying a model

Next, let us deploy a model. This is done by calling deploy on an existing endpoint. To do that, we first need a reference to the model, and for that purpose, we need the full resource name of the model.

Instead of listing all models and search for matches, we will take a different approach this time. As we have defined the model ID to be “vertexaimodel”, we only need the location and project to assemble the fully qualified resource name. Unfortunately, things are not that simple, as we need the project number instead of the project ID. So we first use a different Google API – the resource manager API – to search all projects to find the one with our project ID, extract the project resource name, use that to assemble the model name and then get the model.

import google.cloud.resourcemanager_v3 as grm
projects_client = grm.ProjectsClient()
projects = projects_client.search_projects(
    query=f"id={google_project_id}"
)
project = [p for p in projects][0]
model_prefix = f"{project.name}/locations/{google_region}/models/"
model_name = f"{model_prefix}vertexaimodel"
model = aip.Model(
    model_name = model_name
)

We now have an endpoint and we have a model and can now call the deploy method on the endpoint. When we do this, the most important parameters to specify are the model that we want to deploy, the machine type, the minimum and maximum number of replicas that we want and the service account (in the usual format as an e-mail address) that we want to attach to the container.

endpoint.deploy(
    model = model,
    machine_type = "n1-standard-2",
    min_replica_count = 1,
    max_replica_count = 1,
    service_account = service_account
)

This again creates an LRO and only returns once the deployment is complete (which can take some time). 

To try this out, you can either run the commands above (and the obvious boiler plate code around it) manually or you can run the deploy.py script in the models directory of my repository. When the deployment is done, we can verify the result using curl. This is a bit tricky, because we first have to use gcloud to get the endpoint ID, then use gcloud once more to get the endpoint name and finally assemble that to an URL. We also need a bearer token which is included in the HTTP header.

ENDPOINT_ID=$(gcloud ai endpoints list \
  --filter="display_name=vertex-ai-endpoint" \
  --format="value(name)")
TOKEN=$(gcloud auth print-access-token)
BASE_URL="https://$GOOGLE_REGION-aiplatform.googleapis.com/v1"
ENDPOINT=$(gcloud ai endpoints describe \
  $ENDPOINT_ID \
  --format="value(name)")
URL=$BASE_URL/$ENDPOINT
echo "Using prediction endpoint $URL"
curl  \
    --header 'Content-Type: application/json'  \
    --header "Authorization: Bearer $TOKEN"  \
    --data '{ "instances" : [[0.5, 0.35]]}' \
   $URL:predict

Of course the Endpoint class also has a predict method that you can call directly. Here is the equivalent to the invocation above in Python.

prediction = endpoint.predict(
    instances = [[0.5, 0.35]]
)

We nicely see the structure of the JSON envelope being reflected in the arguments of the predict method. Note that there is also a raw prediction that returns information like the model version in the response headers instead of the body, you can access this method either by replacing the URL “$URL:predict” in the curl command above with “$URL:rawPredict” or using endpoint.raw_predict in the Python code, supplying the required headers like the content type yourself.

raw_prediction = endpoint.raw_predict(
    body = b'{"instances" : [[0.5, 0.35]]}',
    headers = {"Content-Type": "application/json"},
)

Creating model versions

At this point, it is helpful to pause for a moment and look at what we have done, using gcloud. Let us ask gcloud to describe our endpoint and see what we get (you might need some patience here, I have seen deployments taking 20 minutes and more).

ENDPOINT_ID=$(gcloud ai endpoints list \
  --filter="display_name=vertex-ai-endpoint" \
  --format="value(name)")
gcloud ai endpoints describe $ENDPOINT_ID

We see that our endpoint now contains a list of deployed models. Each of these deployed models reflects the parameters that we have used for the deployment and has an own ID. We also see a section trafficSplit in the output which shows that at the moment, all traffic goes to one (the only one) deployed model.

When you repeat the curl statement above and take a closer look at the output, you will also see that VertexAI adds some data in addition to the prediction results (I assume that the ability to do is the reason why VertexAI uses the JSON envelope). Specifically, we also see the deployed model ID in the output as well as the model and the model version that has been used.

It is also interesting to take a look at the log files that our endpoint has created. Let us use the gcloud logging reader to do this.

ENDPOINT_ID=$(gcloud ai endpoints list \
  --filter="display_name=vertex-ai-endpoint" \
  --format="value(name)")
type_filter="resource.type=aiplatform.googleapis.com/Endpoint"
endpoint_filter="resource.labels.endpoint_id=$ENDPOINT_ID"
location_filter="resource.labels.location=$GOOGLE_REGION"
filter="$type_filter AND $location_filter AND $endpoint_filter"
gcloud logging read \
  "$filter" \
  --limit 200 \
  --order=asc \
  --format="value(jsonPayload)"

In the first few lines, we see that VertexAI points us to a model location that is not the bucket that we have initially used for the upload, again showing that the model registry does actually hold copies of our artifacts. We can also see the first ping requests coming in (and if we would add more lines we would also see the prediction requests in the access log).

It is nice to have a deployed model, but what happens if we update the model? This is where model versions come into play. When you upload a model, you can, instead of specifying a model ID, specify the ID of an existing model, called the parent. When you do this, the model registry will create a new version of the model. Some attributes like the display name or labels are the same for all version, whereas some other attributes like (obviously) the version number are updated. So the code to do an upload would look as follows.

model = aip.Model.upload(
        serving_container_image_uri = image,
        artifact_uri = f"{bucket}/models",
        parent_model = "vertexaimodel",
        serving_container_predict_route = "/predictions/model",
        serving_container_health_route = "/ping",
        project = google_project_id,
        location = google_region
    )

Let us try this out. The script upload_model.py does already already contain some logic to check whether the model already exists and will create a new version if that is the case. So let us simply upload the same model again and check the result.

python3 upload_model.py
gcloud ai models describe vertexaimodel

We should see that the version number in the output has automatically been incremented to two. When we only use the model ID or the model name to refer to the model, this will automatically give us this version (this is called the version alias). We can, however, also access the previous versions by appending the version ID to the model ID, like this

gcloud ai models describe vertexaimodel@1

Let us now deploy the updated version of the model. We use the same code as before, but this time, we pass the traffic split as a parameter so that half of the requests will go to the old version and half of the requests will go to the new version.

python3 deploy.py --split=50

When you now run predictions again as above and look for the version number in the output, you should see that some requests are answered by the new version of the model and some requests are answered by the old version. Note that we have only specified the split for the newly deployed model and VertexAI has silently adjusted the traffic split for the existing instances as well. Here is how our deployment now looks like.

This is also reflected in the output of gcloud ai endpoints describe – if you run this again as above, you will see that there are now two deployed models with different versions of the model and the traffic has been updated to be 50 / 50. Note that you can also read the traffic split using the traffic_split property of the Endpoint class and you can use its update method to change the traffic split, for instance to implement a canary rollout approach.

Undeploying models and deleting endpoints

Suppose you want to undeploy a model, maybe the version one of our model that is now obsolete, or maybe all deploy models in an endpoint. There are several methods to do this. First, an endpoint provides the convenience method undeploy_all which undeploys all deployed models in this endpoint. Alternatively, let us see how we can iterate through all deployed models and undeploy them one by one. Assuming that we have access to the endpoint, this is surprisingly easy:

for m in endpoint.list_models():
    endpoint.undeploy(deployed_model_id = m.id)

There are some subtleties when manually adjusting the traffic split (an undeployment is not valid if all remaining models would have zero traffic afterwards, so we should start to undeploy all models with zero traffic first), which, however, are not relevant in our case. Once all models have been undeployed we can also delete the endpoint using endpoint.delete(). In the models directory of the repository, you will find a script called undeploy.py that deletes the deployed models and the endpoint. Do not forget to run this when you are done as the running instances create some cost.

Exporting models

To close this blog post, let us quickly discuss how we can export a model that is stored in the model registry. Not surprisingly, exporting means that VertexAI copies either the model archive or the container image or both to a GCS bucket respectively an Artifact Registry repository path that we can choose.

To export a model, we must specify a supported export format. The list of export formats that a model supports can be found using gcloud (or alternatively also the Python client by accessing the property supported_export_formats):

gcloud ai models describe \
  vertexaimodel \
  --format="json(supportedExportFormats)"

For our model, we should only see one entry corresponding to the format “custom-trained” that allows us to export both the image and the model. Let us do this. Run the script export.py which essentially consists of the following statement.

model.export_model(
            export_format_id = "custom-trained",
            artifact_destination = f"gs://vertex-ai-{google_project_id}/exports",
            image_destination = f"{google_region}-docker.pkg.dev/{google_project_id}/vertex-ai-docker-repo/export:latest"
)

As long as we stay within the same region (which I highly recommend as otherwise Google charges cross-regional fees for the data transfer), this should complete within one or two minutes. You should now see that in the provided GCS path, VertexAI has created a folder structure containing the model name and time stamp and placed our archive there.

gsutil ls -R gs://vertex-ai-$GOOGLE_PROJECT_ID/exports/*
gcloud artifacts docker images list \
    $GOOGLE_REGION-docker.pkg.dev/$GOOGLE_PROJECT_ID/vertex-ai-docker-repo

If you compare the digest of the exported image, you will find that it is identical to the prediction image that we used as part of the model. We can also download the model archive and verify that it really contains our artifacts (the simple command below only works if we have only done one export)

uri=$(gsutil ls -R \
  gs://vertex-ai-$GOOGLE_PROJECT_ID/exports/* \
  | grep "model.mar")
(cd /tmp ; gsutil cp $uri model.zip ; unzip -c model.zip -x model.bin)

This will download the archive to a temporary directory, unzip it and print the files (except the binary model.bin) contained in it. You should recognize our model, our handler and the manifest that the torch archiver has created.

This concludes our post for today. We now have a fair understanding of how we can create and update models (I have not explained deletion, but that should be fairly obvious), how we can set up endpoints and how models are deployed to endpoints for prediction. In the next post in this series, we will turn to simple training jobs.

Leave a Comment