Overview: Running workflows in Terra

An introduction to workflows - one of two analysis modes you can use on the Terra platform. Workflows (aka pipelines) are a series of steps performed by an external compute engine - often used for automated, bulk analysis (such as aligning genomic reads).

Note: the procedure shown in the video for launching a workflow from a Terra data table is slightly out of date. Once you've selected your data, click on Open With... rather than the circular icon shown in the video.

Overview: Running a workflow in a Terra workspace

Workflows run on Terra are written in Workflow Description Language (WDL), a workflow processing language that is easy for humans to read and write. Running a workflow (also called a pipeline) in a Terra workspace requires the following elements:

"Can compute" access to the workspace that contains the workflow

You need permission to do any operations that have a Google Cloud cost (i.e., run workflows) in a workspace. You can do this if someone shares a workspace with you with "Can-Compute Writer" privileges. If you create or copy a workspace using your own Billing project, you own the workspace and can run workflows on it by default.

One or more workflows

If you clone a workspace that already contains workflows (for example, one of the Featured Workspaces in the Terra Library), these tools will be in your copy as well. If the Workflows page is empty, you can import workflows from the Terra library (code and workflows section), Dockstore, or the Broad Methods Repository.

Input data

Input data files can be located in the workspace's storage bucket (Google bucket) or an external bucket. The workflow can pull the data in these files into its virtual machine (VM) using file URLs that are stored in the workspace's data tables or specified when configuring the workflow.

What happens when you run a workflow in Terra?

A workflow in its simplest form is a task consisting of

Path(s) of input files to read from Cloud Storage.
A Docker image to create a container for the workflow to run in.
Commands (the workflow) to run in the Docker image.
Cloud resources to use (number of CPUs, amount of memory, disk size and type).
Path(s) of output files/directories to write to Cloud Storage.

To run a workflow in Terra, you will

Specify the input files' paths from Cloud Storage.
Specify runtime options, including the Docker image.
Submit the workflow to Terra.

Behind the scenes, Terra takes care of the details

Terra sends the built-in Cromwell server a packet of information containing the workflow code and inputs.
Cromwell - a Workflow Management System geared towards scientific workflows - parses the workflow and starts dispatching individual jobs to Google Batch.
Batch executes the tasks on the Google Compute Engine (GCE) and writes the output to the Workspace bucket.

Google Batch will

Create a Compute Engine virtual machine (VM).
Download the Docker image.
Download the input files.
Run a new Docker container with the specified image and workflow commands.
Upload the output files to the Terra workspace.
Destroy the VM.

Overview of workflow submission in Terra from Genomics in the Cloud
(by Geraldine A. Van der Auwera and Brian D. O'Connor - O'Reilly Press)

Cromwell and Batch defined Cromwell is an open source (BSD 3-clause) execution engine written in Java that supports running WDL on three types of platform: a local machine (e.g., your laptop), a local cluster/compute farm accessed via a job scheduler (e.g., GridEngine), or a cloud platform (e.g., Google Cloud or Amazon AWS).

Google Batch API (aka "Batch") is a Google Cloud service that provides an easy way to launch and monitor tasks running in the cloud.

Practice pipelining with the Terra on GCP Quickstart Part 2: Workflows

One way to get up and running quickly is to clone a Featured Workspace with an interesting workflow.

To learn the basics of running a workflow in Terra, follow the Terra on GCP Quickstart Workflows tutorial. It includes everything you need to get hands-on experience running workflows.

You'll first run a preconfigured workflow, then set up and run the same workflow from a blank configuration card. As a bonus, you can run a follow-up third workflow to analyze data generated by the first exercises.

The Workflows tutorial is the second in a series of three Quickstart tutorials that introduce you to running analyses on Terra through a mock study of the correlation between height and grades for a cohort of 7th, 8th, and 9th graders.

Tutorial workspace | Step-by-step guide

To get started, clone the Terra GCP Quickstart workspace to your own billing account and work through the three exercises listed below.

Workflows tutorial flow Diagram showing the three main sections in the T101 Workflows Quickstart tutorial. Section 1 demonstrates how to run a pre-configured workflow using a data table called 'student.' Section 2 demonstrates how to configure and run a workflow using the same data table. Section 3 is a bonus section that demonstrates how to run a follow-up workflow on the data output by sections 1 and 2.

Three steps to complete the workflows tutorial

Calculate the students' average GPA by running a pre-configured workflow on data in the student table.
Calculate the students' average GPA by setting up and running a workflow from scratch on data in the student table.
(optional bonus) Calculate the class average GPA by setting up and running a workflow on generated data from part 2.

The step-by-step guide walks through these exercises in greater detail.