Machine learning on Terra, Part II: Scaling training, and serving

When scaling ML training and prediction, it is often helpful to use services like Vertex AI, in addition to in-notebook computation. Vertex AI brings together the Google Cloud services for building ML under one unified UI, API and SDK. With Vertex AI, you can train and compare models using AutoML or your own custom code, and your models may be stored in one central model repository, and deployed to the same Vertex AI endpoints.

The rest of this article explores these approaches in more detail.

Scaling ML overview

In-notebook training and prediction in Terra — as in the example notebook highlighted in Part I of this series — is useful for smaller-scale experimentation, often using a subset of the full training dataset.

Why scale?

However, biomedical research often involves the use of very large datasets, and the NN model architectures used may be very complex. In addition, it’s often not clear from the outset what model architectures and hyperparameters will give the best results. Experimentation in this space can be quite computationally expensive, requiring large memory allocations and many accelerators. And training experiments can take a long time to run— often multiple days (or much longer).

Challenges to scaling up

A Jupyter Cloud Environment that is sufficiently beefy for a deep learning training job may be expensive to keep running. So, even if you want to do single-node training, it may be useful to spin up the training job on a cloud service like Google Cloud Platform’s Vertex AI, and keep the notebook environment from which the job is launched in a lean configuration. You pay only for the resources you use — the Vertex AI training nodes are spun down as soon as the job has completed — and can keep your Terra notebook small and inexpensive. And, when you run a training job using cloud services, you don’t need to worry about the notebook environment automatically timing out during a long training run.

Additionally, for large-scale training, you may want to run on a distributed cluster of nodes, to increase the number of GPUs available to you. And, you may often want to do a hyperparameter tuning search, in order to get a sense of what model architecture params will give the best results. Vertex AI makes such jobs straightforward.

After your models are trained, you will often want to serve them (make them available at an endpoint for prediction). Deploying your trained models to the cloud means that things like scalable serving, management of multiple versions, and support for traffic splitting are handled for you.

For these reasons and more, it is often helpful to use services like Vertex AI, in addition to in-notebook computation. Vertex AI brings together the Google Cloud services for building ML under one unified UI, API and SDK. With Vertex AI, you can train and compare models using AutoML or your own custom code, and your models may be stored in one central model repository, and deployed to the same Vertex AI endpoints.

The rest of this article explores these approaches in more detail. (Then, Part III of this series will introduce some other relevant Vertex AI and GCP services, like Vertex AI Pipelines and Dataflow.) In many cases, we’ll link to example notebooks that walk through how to configure and use Vertex AI services. As the examples show, it is straightforward to call out to Vertex AI services from a Terra notebook via a ‘native’ GCP project.

Where to find the sample notebooks

We’ll use a set of example notebooks that provide running image classification examples. The GitHub repository for the examples is here. Part I of this series introduced the training dataset and NN model architecture used in the examples.

Controlling costs

None of the example notebooks referenced in this article need to use GPUs on Terra. Instead, we'll use GPUs via the Vertex AI services, e.g. when running a training job on Vertex AI. That means the Terra notebook environment itself can be relatively inexpensive to run — and in fact the Terra notebook doesn’t need to stay running once you’ve kicked off a call to a service. You can always track the progress of a job via the GCP Cloud Console.

Training a model on Vertex AI

This example notebook shows how to configure and launch your training job on Vertex AI, and walks you through the process.

The notebooks demonstrate how you can pass a Python training script to define a training job, or you can package more complex code as a module.

job = aiplatform.CustomTrainingJob(
   display_name=MODEL_DISPLAY_NAME,
   script_path="task.py",
   container_uri=TRAIN_IMAGE,
   requirements=["your-import"],
   model_serving_container_image_uri=DEPLOY_IMAGE,
)

When you launch a training job, you pass the training params, and specify the container image on which to train, as well as the machine type and the number of GPUs to use.

model = job.run(
    model_display_name=MODEL_DISPLAY_NAME,
    args=CMDARGS,
    replica_count=1,
    machine_type=TRAIN_COMPUTE,
    accelerator_type=TRAIN_GPU.name,
    accelerator_count=TRAIN_NGPU,
    tensorboard=TENSORBOARD_INSTANCE,
    service_account=TRAINING_SA,
    sync=False,
)

You can monitor some info about the training job from the notebook as it runs, if you like, but you can also shut down the notebook without terminating the training run. So, you don’t need to worry about the Jupyter Cloud Environment idling out while the training is running.

You can then monitor the status of the running job from the GCP project’s Cloud Console, including detailed logging. (In many cases, you can also monitor training via TensorBoard, as described below).

Hyperparameter tuning

In the model development phase, it’s often important to perform a hyperparameter (HP) tuning search, to find the config that gives the best predictive accuracy. Hyperparameters can influence the model architecture (e.g., searching for the most optimal config for a set of hidden layers) or specify other training job inputs, e.g. searching for the best optimization function to use.

Often, the HP tuning training runs are done for fewer epochs than “full” training would require, and then the top N best parameter sets are selected, and full training runs are done for those.

Depending upon the HP search space, the combinatorics can make this a very time-consuming process, and one that is best handled by an automated process, not ‘manual’ search. If multiple searches can be run in parallel, this will speed up the process significantly.

The Vertex AI HP tuning UI

Vertex AI makes parallelizable HP tuning jobs easy to set up. First, your training script or module needs to accept as arguments all of the hyperparameters that you want to include in your search. For example, if you want to try varying the learning rate or loss function, those must be included as training module params. Then, you configure a dictionary of hyperparams that specifies how each should be varied, and in what range, during the HP Tuning search.

You will need to ensure that your training code supports unique checkpoint directories for each currently running trial, so that the multiple trials don’t stomp on each other. You can do this via an environment variable that holds the trial id:

trial_id = os.environ.get("CLOUD_ML_TRIAL_ID") 
checkpoint_dir = f"{checkpoint_dir}/{trial_id}"

That is all that’s necessary— Vertex AI handles the rest. When you launch your hyperparameter tuning job, you specify these parameters:

How the worker nodes should be configured (machine type, whether to use GPUs, training image)
How many parallel trials you want your job to use (how many workers should be running at once)
How many trials total to run. (You can also specify the HP tuning search algorithm to use— the default is a Bayesian-based search. )

If your project has the CPU and GPU quota available so you can make your HP tuning search highly parallelized, it will finish much more quickly.

# Create and run a HyperparameterTuningJob
custom_job = aiplatform.CustomJob(display_name=MODEL_DISPLAY_NAME,
                              worker_pool_specs=worker_pool_specs,
                              staging_bucket=BUCKET)
 
hp_job = aiplatform.HyperparameterTuningJob(
    display_name=MODEL_DISPLAY_NAME,
    custom_job=custom_job,
    metric_spec={'accuracy': 'maximize'},
    parameter_spec=pdict,
    max_trial_count=32,
    parallel_trial_count=4,
    search_algorithm=None)

This notebook shows an example of how to set up and run an HP tuning search.

Note on HP tuning servicesThere are two HP tuning services provided by Vertex AI, both of which use Vizier under the hood, but are configured and used in different ways. The notebook examples use the service that is integrated with Vertex AI Training.

Multi-node distributed training

For large training jobs, single-node, multi-GPU configurations may not be sufficient. If that’s the case, you may want to distribute your training job across a cluster of nodes (each of which may be configured with multiple GPUs). Typically, it makes sense to hold off on moving to a distributed cluster configuration until it’s necessary, since you’ll see additional network latency— if training on a single node with multiple GPUs is feasible, that will be more efficient.

But when it’s helpful to train on a distributed cluster, Vertex AI makes this very straightforward. You’ll first need to change your training code in one respect: In a multi-worker scenario, it’s important that the workers don’t checkpoint to the same directory. Saving can contain collective ops, so all ‘workers’ must save, not just the ‘chief’. In the training module code, we can grab info about the cluster config from the environment, and use that to give each worker its own directories for checkpointing, model saving, etc.

tf_config = os.getenv("TF_CONFIG")
if tf_config:
  tf_config = json.loads(tf_config)
 
  if not _is_chief(tf_config["task"]["type"], tf_config["task"]["index"]):
	...
    checkpoint_dir = os.path.join(checkpoint_dir, "worker-{}").format(
        tf_config["task"]["index"]
    )

Once that’s done, it’s straightforward to kick off the distributed training job:

job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=MODEL_DISPLAY_NAME,
    python_package_gcs_uri=PYTHON_PACKAGE_GCS_URI,
    python_module_name='trainer.task',
    container_uri=TRAIN_IMAGE,
)
model = job.run(
    args=CMDARGS,
    replica_count=3,
    machine_type=TRAIN_COMPUTE,
    accelerator_type=TRAIN_GPU.name,
    accelerator_count=TRAIN_NGPU,
    sync=False,
)

This notebook shows an example of how to set up and run a multi-node training job.

Using Vertex AI’s Managed TensorBoard

TensorBoard supports multiple ML frameworks, and provides visualization and tooling for machine learning experimentation, including the ability to explore the model graph, run profiling, project embeddings, and more.

Vertex AI includes a managed TensorBoard service. After you create a TensorBoard instance, you can either upload log files to it directly, or use an integration with Vertex AI custom training to upload them from the training job. A number of the example workspace notebooks do the latter.

Then, you can visit the Vertex AI ‘Experiments’ panel in the Cloud Console. There, you can click on the ‘OPEN TENSORBOARD’ links to view the TensorBoard experiments for your training runs. (In the UI, the TensorBoard experiments are currently mixed with other logging of information about experiments to the Vertex AI Metadata server.) Many of the example notebooks show how to do this as well.

Scaling out model serving (prediction)

After you’ve developed sufficiently good models, you will want to use them for prediction, and perhaps make them available for colleagues to do the same.

This can be accomplished at a small scale from within a Terra notebook (e.g. by loading from a saved model); or by setting up a model server on an on-prem cluster.

However, at times you may find it useful to deploy a model in a more scalable manner. This might be the case if you’re developing a prototype and want to make a model available for serving, where the number of serving instances will autoscale according to demand. You might want to use GPUs and high-memory machines for inference, to reduce latency, or to set up traffic splitting between model versions as you run experiments.

Vertex AI’s support for model prediction can help with all of this. If you have a model that you want to serve via Vertex AI, you first upload it, then create a model Endpoint and deploy the model to the endpoint. Then, online prediction requests are made to the endpoint. You can set up traffic splitting for a given endpoint, and for each deployed model, configure the machine type and container image, and number and type of GPUs used, and specify the min and max number of instances to be used for serving— controlling the endpoint’s autoscaling.

You can control access to a given endpoint if you wish to make it available to colleagues. You can also create and use private endpoints, using VPC Network Peering to peer your network with the Vertex AI online prediction service— this may be useful for datasets with restricted access.

Creating a serving endpoint

This example notebook shows how to upload and deploy a trained model to an endpoint, and how to send prediction requests to the endpoint. For these examples, we’ve specified that the model is uploaded automatically, as part of the training process, after training completes. We’ve done this by accessing a special environment variable in the training code, AIP_MODEL_DIR, which allows the model to be saved to a location known to Vertex AI Training, from which it is uploaded. The job.run call shown above returns a model resource, associated with the uploaded model, and we can use that to create an endpoint to which the model is deployed for online prediction.

While not included in the example below, you can additionally specify the accelerator type and number to use for serving.

endpoint = model.deploy(
    deployed_model_display_name=MODEL_DISPLAY_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type=DEPLOY_COMPUTE,
    # accelerator_type=...
    accelerator_count=0,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
)

It is also possible to upload the model as a separate step— you may first wish to evaluate a trained model before you decide whether to take further action. We’ll see this approach in Part III of this article, with the Vertex AI Pipelines examples.

After the model is deployed, we can send prediction requests to it— either programmatically, or from the Cloud Console UI. A notebook example walks through how to do this.

predictions = endpoint.predict(instances=[img_array])

Bring your own data (don’t worry about building the model architecture)

The training sections above included scenarios for experimenting with and training your own model architectures.

It can also be useful to experiment with the results you get from using the GCP AutoML and BigQuery ML (BQML) training services: often the generated models are very good, and may suffice for your purposes. And even if you suspect that a tailored model architecture might give better results, they are a good way to generate a benchmark that your own model should out-perform.

With these services, you bring your own data, and the model is constructed and trained for you. While not shown in the set of workspace notebooks, you can run these jobs via your ‘native’ GCP account directly from a Terra notebook too.

AutoML

The Vertex AI AutoML services search through a space of model architectures and hyperparameters to find a configuration that gives good results for your provided dataset, and then train the model with your dataset. The AutoML services include image classification and object detection, training with structured data (classification, regression, forecasting), and document and video analysis. You can interact programmatically with the service via the Vertex AI SDK. The Cloud Console’s UI makes it easy to create, evaluate, and use the trained models for both batch and online prediction.

If your modeling task is a good fit for one of the AutoML modeling scenarios (e.g., image classification), it can be very helpful to use AutoML results as a baseline. They demonstrate a minimum degree of accuracy that your own models should be able to achieve. In some cases, you may find that the AutoML models give close to the best performance.

The Vertex AI AutoML services will not be a good fit for all classes of model architectures, such as mixed input models. They can also be more restrictive in whether they allow the model to be exported for serving elsewhere (in addition to deployment to Vertex AI). But where there is a good fit, benchmarking with AutoML can be very helpful.

Example: AutoML Vision for image classification

Mosquito image data from the Debug project provides a good example of building an image classification model using AutoML Vision. For this data, the task is to predict the sex of the mosquito(s) shown in an image. Here’s a typical image:

To train using AutoML, first create a Vertex AI Dataset by e.g. uploading a csv file that contains Cloud Storage (GCS) URIs for the input images, and their labels, or by connecting to a BigQuery table with the data.

Such Datasets, once created, allow exploration of the dataset stats. This can be useful in helping to detect dataset imbalances, potential bias issues, correlations between input fields, etc.

You can use the Datasets in various contexts, including as input to an AutoML training task. For the ‘Debug’ task, we want to train for a single best label, but multi-label classification is also supported.

After model training is complete, you can get the evaluation metrics for the model, and optionally deploy the model to Vertex AI to send it prediction requests.

AutoML Vision: model evaluation information

AutoML vision also supports object detection and segmentation.

AutoML and XAI (explanations of model predictions)

Compared to many other modeling approaches, it’s not as straightforward to understand why an ML model produces the results it does. But there are techniques that allow some analysis of a prediction. Often you will see these techniques labeled “XAI”, for explainable AI). See more detail in this AI Explanations article.

Vertex AI’s AutoML services support XAI capabilities for some of the AutoML variants. After deploying a trained model for online serving, you can request an explanation of the prediction results. This is a useful aspect of using the AutoML service — in doing the XAI setup work for you, it makes it easier to explore why you’re getting the results you see.

For AutoML image classification, XAI support is currently in Preview. Vertex AI support for XAI is discussed in a bit more detail in Part III of this article.

AutoML Vision XAI, applied to the Debug dataset. The highlighted areas particularly contributed to the prediction of this mosquito as ‘male’.

BigQuery ML

BigQuery ML (BQML) lets you create and execute machine learning models in BigQuery using standard SQL queries on your own datasets, or any others that you have access to.

BQML supports many types of models, including supervised, unsupervised, and time-series. Unlike the AutoML services, not all of these models fall under the ‘deep learning’ umbrella.

Supervised models include linear and logistic regression, DNN models for classification and regression, and Boosted Trees. BQML also supports an AutoML Tables integration, as well as TensorFlow model importing.

Unsupervised models include K-means clustering, and PCA and Autoencoder to help with dimensionality reduction.

What’s next?

Part III of this series discusses ML Ops: what it is, why it is important for research in biomedical domains, and Vertex AI services and tooling that can help.

If you’re interested in further exploration of Vertex AI, these sample notebooks may be of interest.