Machine learning in Terra, Part I: Train in a Terra notebook

Amy Unruh
  • Updated


This series of articles describes how Terra, standalone or in conjunction with Google Cloud Platform’s (GCP) Vertex AI, can support your ML-based analysis. Vertex AI brings together the Google Cloud services for building ML under one unified UI and API plus SDK, and is compliant with HIPAA and other standards.

Part I (this article) covers the basics of training a model in a Terra notebook.

ML on Terra: Overview

There has been an explosion of activity in applying machine-learning-based techniques for research in biomedicine and related fields. Machine learning (ML) is increasingly becoming an important component of a researcher’s toolkit.

Recent ML projects in biomedicine

Documentation goals and objectives

The articles won't go into which ML techniques to use for which problems. Instead, we'll discuss how you can leverage Terra and GCP integrations to make your research process easier, more reproducible, and shareable. The series will primarily cover how to facilitate deep learning, using neural net (NN) architectures. Deep learning tasks often require large datasets and can be computationally intensive, but often give more accurate results than other ML techniques.

Topics in this series of articles

  • How to use accelerators (e.g. GPUs) for model training and serving (prediction) in a Terra notebook
  • How to scale out (distributed) training, serving, handling large datasets, and more
  • Support for experimentation and iterative development (including experiment logging, setting up hyperparameter tuning searches and TensorBoard servers, and leveraging cached executions)
  • How to use Vertex AI Pipelines to support reproducibility and reuse, lineage and artifact tracking, and collaboration
  • Data preprocessing, feature engineering, and why feature stores can be useful
  • Tooling for analysis such as deriving dataset statistics, evaluation of trained models, and explanations of model predictions (XAI)
  • Monitoring the performance of models deployed for serving (prediction)

Part I (this article) covers training a model in a Terra notebook.

About the ML task and dataset used for the examples

The running examples in this article show training a Keras image classification model.

PatchCamelyon dataset: tissue samples

The PatchCamelyon benchmark consists of 327,680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating the presence of metastatic tissue.

Tissue images (below)

Model architecture

For the examples, we’ll use one of Keras' prebuilt model architectures, Xception. There are other prebuilt image classification architectures that could be similarly used instead. The training does transfer learning, bootstrapping from model weights trained on the 'imagenet' dataset. Then, for the PatchCamelyon dataset, the model benefits from additional fine-tuning after a few epochs of transfer learning.

Machine Learning on Terra instructions

You can get started training an ML model directly in Terra from a notebook running in a Cloud Environment. Increasing the number of cores and GPUs (see Customizing your Cloud Environment) lets you boost your notebook power without needing to directly use other Google Cloud Platform (GCP) services. We recommend starting with in-notebook training as the default, and see if it remains sufficient for your purposes.

Step 1: Attach GPUs to your Jupyter Cloud Environment

To run a training job of any complexity, you’ll want to use GPUs. You’ll need to attach GPUs to your Cloud Environment as you set it up. See Starting and customizing your Jupyter app to learn more about options and how to set them up in Terra.

Step2: Train in a Jupyter notebook

Once you’ve set up your Cloud Environment with GPUs, you can run distributed training jobs right from a notebook. To learn more about notebooks in Terra, see Interactive analysis with Jupyter notebooks

If you’re using TensorFlow/Keras, you can accomplish this via tf.distribute.MirroredStrategy. This strategy is typically used for training on one instance with multiple GPUs (i.e. a Cloud Environment notebook using >1 GPU). TensorFlow makes it straightforward to do this via its support for distribution strategies, and other ML platforms provide similar support.

For an example of distributed training of an image classification model on Terra, see 01_keras_pcam.ipynb. Because the underlying model architecture is fairly complex, and the image dataset is fairly large, this on-notebook training example requires more memory than do many other examples (at least 2 cores).

When you run the notebook, you’ll see - if you’ve set up your Cloud Environment with > 1 GPU - that this config is detected, and tf.distribute.MirroredStrategy is used for model training. You’re taking advantage of the multiple GPUs to run single-node distributed training.

When to move to using Vertex AI services

At some point you might want to create a ‘native’ GCP project and access Vertex AI services from your notebook code. Reasons for using a native GCP project (rather than working in a Terra workspace) might include:

  • Your workflow has "outgrown" single-node training, and you want to further scale and distribute training, or run hyperparameter tuning jobs.
  • You want to serve predictions from an endpoint that is always up, autoscales with traffic, allows traffic splitting, and for which you can control access. You may also want to use Vertex AI’s support for explaining predictions.
  • As you ‘harden’ your ML workflow, you may want to take advantage of some of Vertex AI’s “ML Ops”-related services, including Vertex AI Pipelines, Model Monitoring, Feature Store, and others.
  • In some contexts, you may find it more straightforward and cost-effective to be able to launch a training job to run on Vertex AI, and then suspend your Terra notebook while it runs. This may be especially true for long runs. You’ll be billed only for the training VM(s) while they’re in use, and you don’t need to worry about monitoring for training completion or preventing your notebook from idling out (see Preventing runaway costs with Cloud Environment autopause)
  • You want to use some of Vertex AI’s AutoML products, which let you bring your own data, and automatically train a model architecture appropriate to your data.
  • You may want to use other GCP services not directly supported by Terra, such as Dataflow.

We’ll describe how to use these services in the next articles in this series.

What’s next?

Part II of this series discusses how to scale training and serving using Vertex AI.

Part III discusses ML Ops: what it is, why it is important for research in biomedical domains, and Vertex AI services and tooling that can help.


Was this article helpful?

1 out of 1 found this helpful



Please sign in to leave a comment.