Machine learning on Terra, Part III: ML Ops

Amy Unruh
  • Updated

When you try to ‘productionize’ or scale your ML analysis, what was built as the prototype is often only a small piece of what you need to pay attention to (where “productionize” indicates making a system or process more robust— it doesn’t necessarily need to be externally-facing). This article discusses some approaches that can help with other aspects.

Introduction: ML on Terra - Beyond the prototype 

In the past few years, there has been an explosion of activity in applying machine-learning-based techniques for research in biomedicine and related fields. Machine learning (ML) is increasingly becoming an important component of a researcher’s toolkit.

Part I and Part 2 of this series described how to scale and manage model training and serving, both directly from a Terra notebook and by leveraging Google Cloud Platform’s (GCP) Vertex AI.

However, often that’s only part of the story. Building the first proof-of-concept version of a machine learning system is usually pretty straightforward. But when you try to ‘productionize’ or scale, what was built as the prototype is often only a small piece of what you need to pay attention to (where “productionize” indicates making a system or process more robust— it doesn’t necessarily need to be externally-facing). This article discusses some approaches that can help with these other aspects.

Example notebook resources

We’ll use a set of example notebooks that provide running image classification examples. The GitHub repository for the examples is here. Part I of this series introduced the training dataset and NN model architecture used in the examples.

ML Ops: Challenges and solutions

The figure below shows a common perception when thinking about a machine learning application: many of us anticipate that the main challenge is going to be getting the models working properly and accurately.

7xraBr3BfNpxnLR.pngA common perception of what's hardest when building an ML-based application

The reality is that building the model is only a small part of what you need to be paying attention to, and those other things can often require just as much attention and effort.

9frWsmCKdjsGW2H.pngThe reality is usually closer to this. (Credit: Hidden Technical Debt of Machine Learning Systems, D. Sculley, et al.)

Why it's hard to scale

Why do things become so much harder when trying to scale out a ML workflow? Here are just a few of the reasons:

  • Data cleaning and processing becomes very hard at scale. Dealing with data— getting it to where it needs to be, feature engineering, etc., is a large part of the overall effort required to put a machine learning system into production.
  • It can also be hard to scale out training and serving infrastructure, to make sure that your system has sufficient resources when it needs them, and that it can scale back down when they’re not required.
  • Problems can arise from issues like model or data drift, or training/serving skew. Your production infrastructure needs to be able to detect and handle situations where a model is no longer sufficiently accurate, or new data might indicate model retraining is necessary, or online prediction data is not consistent with the feature engineering done for model training.
  • In a production environment, access control and security become even more important, as do governance and version management processes.

Guidance for scaling up

And so, we have the growing discipline of “ML Ops” – which has a goal of unifying machine learning system development and operations, to guide teams through the challenges of doing production machine learning, and automating for repeatability at scale. It takes both its name and some of its core principles and tooling from DevOps. However, it has its own challenges— for example, it’s necessary to manage the lifecycle of data and models, as well as code— and that has led ML Ops to evolve as a domain of its own. ML helps you reduce the amount of time that it takes to reliably go from data ingestion to deploying your model in production, in a way that lets you monitor and understand your ML system.

ML Ops patterns and practices

ML Ops patterns and best practices can help address many of the problems listed above, including:

  • Formalize your ML workflows: for production, move away from stitched-together notebooks or monolithic scripts.
  • Your ML workflows should behave in the same way across environments. Workflow execution should also scale out resources when needed, and scale down when they’re not.
  • Designing for composability, modularity, and reuse of ML workflow building blocks will ensure that you can reliably reproduce and rerun your workflows.
  • Similarly, your infrastructure should support workflow monitoring, experiment tracking, versioning, and step caching. This typically requires making ML workflow metadata explicit.
  • The data scientists on your team will probably be prototyping in notebooks. You need well-defined processes to capture that work and move it out of notebooks for production use.
  • Mechanisms for supporting collaboration and role-based access control also become important. Informal methods of providing team members access won’t scale any more.

ML Ops in biomedical domains

ML Ops arguably becomes even more important in biomedical and healthcare domains than many other areas. It’s typically necessary to handle very large volumes of data from multiple sources— and to train complex models on those large datasets; to be able to track model and data versions and provenance and support change management processes; to monitor and react to model performance and accuracy (and to analyze why a model is producing given results for an input); and to ensure that access control and data governance is properly handled while supporting collaboration.

In the rest of this article, we’ll discuss how to operationalize ML Ops concepts from Terra notebooks using Vertex AI tools and services. (The sections below don’t comprehensively include all the ML-Ops-oriented parts of Vertex AI).

Formalize your ML workflows with Vertex AI Pipelines

Google Cloud Platform’s (GCP) Vertex AI is a managed machine learning platform, to speed the rate of experimentation and productionizing of machine learning models. Vertex AI Pipelines can be viewed as the backbone of the Vertex AI ML Ops story.

Vertex AI Pipelines helps you to automate, monitor, and govern your ML systems by orchestrating your ML workflow in a serverless manner, and storing information about your workflow's artifacts and execution using Vertex ML Metadata.

Two different Python SDKs, KFP or TFX, may be used to define the pipelines. For the examples in this article, we’ll use the KFP SDK. Then, you use the Vertex AI SDK to run pipelines and monitor their progress.

You can define, run, and modify these pipelines programmatically, e.g. directly from a Terra Notebook, using a connected GCP project (see examples below). You can also access the Cloud Console-based UI for many operations (with gcloud support coming soon).

The steps (components) in a Vertex Pipeline are container-based, which ensures that execution of a step is reproducible, and that step components can be reused and shared. There are sets of prebuilt components for accessing Vertex AI services as well as other GCP services, and it is straightforward to define your own ‘custom’ pipeline components using the KFP SDK. In this article, we’ll highlight example pipelines that consist of both pre-built and custom components.

You don't have to build your own containersWhile the steps are container-based, you typically won’t have to build your own containers unless you want to. The SDK adds tooling so that you can specify library installs and the container command/entry point along with a suitable base image.

In the rest of this section, we’ll describe some Vertex AI Pipelines features in more detail, and then discuss how it compares to some other workflow management frameworks common to the biomedical domain.

Using pipelines to codify in-notebook ML workflows

In the example notebooks of Part I of this article, we trained a model, showed how to generate some model metrics, then uploaded the model to Vertex AI and deployed it to a Vertex AI Endpoint for scalable serving. That is a machine learning workflow, but its pieces are informally specified in notebook cells. It can be problematic to work this way for a number of reasons1. It can be hard to track notebook cell tweaks in a rigorous manner— changes to an ‘upstream’ cell can give rise to a situation where the indicated outputs of running downstream cells are no longer valid.

In addition, there is typically little support for logging metadata about the results of notebook cell execution. (However, later in this article, we’ll discuss using the Vertex AI Experiments API in-notebook to track such information). A workflow can be implicitly defined by a notebook, but the workflow and its steps are not first-class entities, and so it’s problematic to version them at finer granularity than the notebook.

As an alternative, we can define pipelines that specify a ML workflow. We can still work in a notebook environment, but by codifying a workflow as a pipeline, it becomes reliably reproducible: we can share both the pipeline definition and any components we defined, put them under source control, and know they will run the same elsewhere. When the pipelines run, they automatically log execution and resource metadata— as discussed further below— so that an ‘audit trail’ of the workflow’s execution is recorded.

This example notebook shows how to define and run Vertex AI Pipelines that support the ML workflows of Part I. In the process, the notebooks show how to generate yaml-based specifications of any custom components (steps) that have been defined. These yaml files can be put under source control, and then shared and loaded to define a step in some other context.

6ZWFQTwog2Ng9ut.pngA Vertex AI Pipeline.

Below is an example of a Vertex AI pipeline– see its notebook for the full code. All the ‘gcc_aip methods are defining pre-built Vertex AI pipeline steps. You can use these components without writing any additional code. The classif_model_eval_metrics step is a custom (user-defined) component, which extracts some model metrics info and makes a decision about whether or not the model should be deployed.

Note the ‘with dsl.Condition’ statement, which defines a conditional branch for the pipeline: it will upload and deploy the trained model based on the output of the classif_model_eval_metrics step.

def image_classification_pl(
    project: str = PROJECT_ID,
    location: str = LOCATION,
    gcs_workdir: str = GCS_WORKDIR,
    ...<other pipeline input parameters>...
    training_job_run_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(

    endpoint_create_op = gcc_aip.EndpointCreateOp(

    model_eval_task = classif_model_eval_metrics(  # custom component
        bucket_name, gcs_metrics_path, thresholds_dict_str

    with dsl.Condition(
        model_eval_task.outputs["dep_decision"] == "true",

        model_upload_op = gcc_aip.ModelUploadOp(

        model_deploy_op = gcc_aip.ModelDeployOp( 
            traffic_split={"0": 100},

Metadata tracking

Vertex AI Pipelines does automatic logging of metadata as a pipeline executes, supporting workflow artifact and lineage tracking. For example, an ML model's lineage may include the training data, hyperparameters, and code that were used to create the model. Vertex AI Pipelines writes to the Vertex ML Metadata server, allowing for querying for a given pipeline’s information, and analysis and comparison across pipeline runs.

4CBkXjrkWDy84Vr.pngView lineage information by clicking on 'VIEW LINEAGE' after selecting an output Artifact of a Pipeline step.

4TEdPYWr4i7DjrT.pngThis view lets you see how resources and other artifacts are connected by step executions— in a sense the inversion of the execution graph above.

Vertex AI Pipelines step caching

Vertex AI Pipelines supports step execution caching, which means that if nothing has changed about the inputs, output specification, or component specification, the results of the previous execution can be reused. (The cached result doesn't have a time-to-live (TTL), and can be reused as long as the cache entry is not deleted from Vertex ML Metadata. If caching is not appropriate for a scenario, you can disable execution caching for either a given step or for an entire pipeline).

Make sure to delete the metadata entry for Vertex AI resourcesDeleting a Vertex AI resource (e.g., model), and not the corresponding ML Metadata entry, will result in improper caching in that Vertex AI Pipelines determines caching based on the existence of the ML Metadata and not on the resource.

This feature can be very useful for iterative development (and more). The example pipelines consider model metrics before deciding whether or not a model is accurate enough to deploy. Suppose we decide that we want to change the 'threshold' information that we're using to make the deployment decision. We can do this by cloning the pipeline, and changing one parameter.


The new pipeline run can use the cached versions of the upstream training and endpoint creation step executions— whose input parameters did not change. The ability to leverage caching saves a lot of development time, particularly for long-running steps such as model training.


Other workflow frameworks

It’s useful to compare Vertex AI Pipelines to other workflow frameworks commonly used for bioinformatics, such as dsub, WDL and cromwell, and Nextflow, particularly in conjunction with the Cloud Life Sciences API, and discuss when you’d want to use each. This will be the topic of an upcoming article.

Data pre-processing and feature engineering

It’s often required to do some pre-processing of a dataset, to select and transform raw data into a form useful for training.

As an example, for many TF.Keras-based image classification tasks, the Keras training code will use a utility, tensorflow.keras.preprocessing.image_dataset_from_directory, to create an input, with the labels (classes) inferred from the directory names. To get image data into the format expected by image_dataset_from_directory, the images need to be organized in subdirectories according to label.

This is one category of preprocessing that you may need to do, but there are many other types of feature engineering that may be necessary in order to convert training data into a form useful for the training task. Maybe you want to do some bucketing, or create feature crosses, or process texts to create an input vocabulary, or reduce input dimensionality, or derive some statistics that will be used as additional inputs, etc.

In many cases your ML framework will provide some support for feature engineering, including libraries like tf.keras.preprocessing and TensorFlow Transform (TFT), and the TensorFlow tf.feature_column methods. (Where you can push some feature engineering to the model graph, this helps address training-serving skew, by ensuring that your processing is done consistently across both training and prediction on new instances: another useful ML Ops design pattern).

For the image classification scenario above, suppose the initial data format is a BigQuery table with two fields per row: the Google Cloud Storage (GCS) path to an image, and its label, and that we want to organize the images by subdirectories before we run the training job2.  The image processing is an “embarrassingly parallel” activity: we can process each row in the original table independently, as we copy each image to its correctly-named subdirectory. Dataflow is a good fit for such pre-processing.


Dataflow is a managed service for executing a wide variety of data processing patterns, for both streaming and batch processing. Dataflow is Google’s managed Apache Beam service. It can scale out to many worker nodes and process a large dataset quickly. 

Using Dataflow for ML pre-processing

For the example above, we can create a Dataflow pipeline (not to be confused with Vertex AI Pipelines) to organize a set of image files.  We can then add a data-processing step to a Vertex AI Pipeline that runs this Dataflow pipeline— that is, it does the Dataflow-based processing as the first step, and outputs the resultant GCS directory, which can be used as input to the training step.

We can do this via two of the prebuilt components for Vertex AI and GCP; DataflowPythonJobOp and WaitGcpResourcesOp. The figure below shows such a pipeline, with the two new steps added: one to launch the Dataflow job, and the other to wait for it to complete.


An example Vertex AI Pipeline with Dataflow-based preprocessing steps added.

Feature stores

If your team is doing a lot of feature engineering, it can be challenging to share the results with each other, and to develop a consistent process for feature generation. By using a feature store, you can reuse already-generated features instead of re-deriving them for each new model or training scenario, increasing the velocity of development and helping to prevent training-serving skew— since you can use the derived features consistently for both training and serving.

The Vertex AI Feature Store provides a centralized repository for organizing, storing, and serving ML features at large scale. It is a fully managed solution; it manages and scales the underlying infrastructure for you, such as storage and compute resources. This means that you can focus on the feature computation logic instead of worrying about the challenges of deploying features into production. This notebook may be helpful if you want to explore further.

Experiment tracking

As noted above, one issue with developing and experimenting in notebooks is that cells can be unobtrusively changed, and a notebook can support many iterations of development, where if you are not careful, it is not always clear in retrospect which incarnation of the notebook cells produced a given set of results.

One approach that can help is being rigorous about logging persistent metadata about your experiments, how they relate to each other, and their results. That way, information about the experiments you’ve done is recorded outside the notebook environment and can be retrieved and analyzed later.

Examples of experiment tracking

This notebook shows how you can use the Vertex AI Experiments, part of the Vertex AI SDK, to log information about an experiment or series of experiments, and track information about the parameters used for (training) runs or workflows, and the metrics associated with them. An ‘Experiment’ is essentially a semantic grouping of information about a set of training runs (or other activities), and you can compare runs within a given experiment or across experiments. Information is logged to the Vertex ML Metadata server.


A view from the Cloud Console of some of the Experiment data collected for a training run. This information can also be queried directly from a notebook.

Vertex AI Experiments may be particularly helpful in the prototyping stage, when you’re trying to track the results of many different options or configs, and have not yet formalized your workflow. As shown in the notebook examples, you can retrieve the logged information via a pandas dataframe for analysis and comparison.

Explainable AI (XAI)

‘Deep learning’ techniques can often give more accurate results over approaches like linear models and decision trees. However, these more complex models tend to be more opaque as well, and it’s harder to understand these models’ prediction results and to understand why a model has generated a particular answer. The ability to do this can be particularly important for biomedical and related domains.

Explainable AI , often shorthanded as XAI, can help— at least to an extent— with this. This GCP AI Explanations article provides a survey of some of the available techniques.

Feature Attributions

One form of XAI is instance-level feature attributions, which provide a per-feature attribution score proportional to the feature's contribution to the model's prediction. You can use this information to see how much each feature in the data contributed to the predicted result. verify that the model is behaving as expected, recognize bias in your models, and get ideas for ways to improve your model and your training data.

Vertex Explainable AI integrates feature attributions into Vertex AI, for the following types of models:

  • AutoML image models (classification models only). An example of this was shown in Part I of this article, using a mosquito image from the ‘Debug’ dataset.
  • AutoML tabular models (classification and regression models only)
  • Custom-trained image models. (‘Custom-trained’ means that you run your own training code, rather than use an AutoML service. We’re using custom training for the Vertex AI training examples in the notebooks accompanying this article.)
  • Custom-trained models based on tabular data.

One nice aspect of these Vertex Explainable AI integrations is the UI support in the Cloud Console, which makes it easy to explore prediction explanations. E.g., for AutoML tabular models, feature attributions are displayed in Google Cloud Console as “feature importance. You can see model feature importance for the model overall, and local feature importance for both online and batch predictions.

For AutoML Vision models, you can choose to use the integrated gradients (IG) method or the XRAI method, which combines the integrated gradients method with additional steps to determine which regions of the image contribute the most to a given class prediction. For AutoML Tabular models, the Sampled Shapley method is used. (These models are meta-ensembles of trees and neural networks, so IG is not appropriate).


AutoML Vision XAI, applied to the Debug dataset.

Example-based Explanations

Example-based Explanations is a new service from Vertex AI, currently in Preview. The service works by first representing the examples in a latent (embeddings) space, and subsequently finding approximate nearest neighbors in that space. The service requires a model that translates the raw data into an embedding space. It enables analogy-based explanations for data with applications in error analysis, model debugging, and batch labeling of new data. In turn, this can lead to more accurate and robust models, and efficient data labeling pipelines.

Example-based explanations can retrieve similar data in your training dataset for any given prediction. This is particularly useful for debugging errors, since similar examples can expose why a particular example was mispredicted.


Using example-based explanations to find images similar to one that was misclassified.

For example, suppose we had a trained image classification model that misclassified a bird as a plane. Using example-based explanations, we can retrieve other images in the dataset which the model views as similar to our misclassified image. In examining the most similarly classified images in the dataset, we can identify that both our misclassified image of a bird, and the similar images of airplanes, all were dark silhouettes, highlighting the potential lack of images of silhouettes of birds in the training data. Example-based explanations have shown us that to improve our model, we need to augment our training data with more images of birds.

Vertex AI Workbench

The Vertex AI Workbench (previously named “GCP Notebooks”) aims to be a single development environment for the entire data science workflow. This environment may be of interest when you’re working in your ‘native’ GCP project outside Terra.

You can use Vertex AI Workbench's notebook-based environments to query and explore data, develop and train a model, and run your code as part of a pipeline.

The notebook environments come with many commonly-used packages and libraries pre-installed (including everything necessary to connect and work with to other Vertex AI and GCP services), support both Python and R, and provide variants with both TensorFlow Enterprise and PyTorch pre-installed. 

There are two Workbench notebook environments available:

  • User-managed Notebooks: Notebooks reside in the user’s project, and since these notebooks are VM instances under the hood, quotas like CPU, GPU, and Network are from the user’s project, specifically their Compute Engine quotas.

  • Managed Notebooks: Notebooks reside in Google-managed projects, often called Tenant Projects. Currently, the user can not control the Tenant project configuration, but resources can be increased for a given project if need be. In the future, a user will be able to request resource increases for Managed Notebooks directly from the GCP Cloud Console.

The Managed notebooks are the suggested choice going forward, unless a user needs the additional config afforded by the user-managed notebooks.

New features in managed notebooks

The managed notebooks provide a number of new features, including:


This article motivated the growing ML Ops field, described some useful ML Ops patterns and practices, and introduced some ML Ops-flavored Vertex AI services and tools. We pointed to some example notebooks that highlight some of these services.

If you’re interested in further exploration of Vertex AI, these sample notebooks may be of interest.


  1. That said, this article will also discuss tooling for automatic notebooks execution. 
  2. We would only need to do this step once per data set; after we’ve run it, we can remove the step from subsequent pipelines— or leverage step caching to avoid executing it each time. 

Was this article helpful?

0 out of 0 found this helpful



Please sign in to leave a comment.