Pipelining with workflows

Allie Hajian
  • Updated

An introduction to workflows - one of two analysis modes you can use on the Terra platform. Workflows -- aka pipelines -- are a series of steps performed by an external compute engine. They are often used for automated, bulk analysis such as aligning genomic reads. Pipelines run on Terra are written in Workflow Description Language (WDL), a workflow processing language that is easy for humans to read and write. 

Overview: Workflows versus interactive analysis

How do you choose which analysis mode to use when analyzing data in Terra? Broadly, it depends on your use case - in particular, whether your analysis can be automated to run in the background or needs interaction to run. 

Three reasons to use a workflow instead of an interactive analysis 1. You want to run multiple tools/commands in a specific order
2. You want to be able to run it multiple times in the same way (maybe with a few parameter changes)
3. You want to run something that takes a long time and doesn't require your input once it's running

In short, if you can determine up front all the inputs /parameter values that need to be applied, and you want to be able to hit play and go out for lunch, use a workflow.

Practice pipelining with the Terra on GCP Workflows tutorial

To learn the basics of running a workflow in Terra, try the Workflows Quickstart tutorial, a self-guided tutorial that includes everything you need to get hands-on workflows experience. The Terra on GCP Workflows Quickstart is the second in a series of three tutorials that walk through a mock study of the correlation between height and grades for a cohort of 7th, 8th, and 9th graders.

Hands-on workflows practice

You will first run a preconfigured workflow, then set up and run the same workflow from a blank configuration card. As a bonus, you can run a follow-up third workflow to analyze data generated by the first exercises.

Terra on GCP Quickstart tutorial workspace | Step-by-step guide

Work through the exercises in your own copy of the Terra on GCP Quickstart workspace. You should complete the Terra (GCP) Quickstart: Data tables tutorial first. 

Workflows quickstart flowDiagram showing the three main steps to complete the T101 Workflows Quickstart tutorial. Step 1 is 'Run preconfigured workflow (student data)'. Step 2 is 'Configure and run workflow (student data)'. Step 3 is '(bonus) Run a follow-up workflow on output data'. Each step is represented by a blue rectangle with blue arrows connecting the steps, in order.

What do you need to run a workflow in a Terra workspace?

Running a workflow (pipeline) in a Terra workspace requires the following: 

"Can compute" access to the workspace  

You need permission to do any operations that have a Google Cloud cost (e.g., run workflows) in a workspace. You can do this if someone shares a workspace with you as "can-compute writer." If you create or copy a workspace using your own Billing project, you are the owner, by default, and can run  workflows.

One or more workflows (WDLs)

If you clone a workspace that already contains workflows (see Featured Workspaces in the Library), these tools will be in your copy of that workspace. If the Workflows tab of your workspace is empty, you can import workflows from the code and workflows section of the Terra Library.

Input data

Input data files can be located in the workspace Google bucket or an external bucket or data repository, and linked to the workspace by metadata in the data table.

What happens when you run a workflow in Terra?

A workflow in its simplest form is a task consisting of

  • Input file path(s) to read from Cloud Storage
  • A Docker image to run
  • Commands (the workflow) to run in the Docker image
  • Cloud resources to use (number of CPUs, amount of memory, disk size, and type)
  • Output file/directory path(s) to write to Cloud Storage

To run a workflow in Terra, you will

  • Specify the path(s) of input files from Cloud Storage
  • Specify compute parameters (disk size, cost-saving options, etc.)
  • Submit the workflow to Terra

Behind the scenes, Terra takes care of the details

  • Terra communicates with Google Cloud to set up the container (Docker) to run the workflow code.
  • Terra sends the built-in Cromwell server a packet of information containing the workflow code and inputs.
  • Cromwell - a Workflow Management System geared towards scientific workflows - parses the workflow and starts dispatching individual jobs to PAPI (the Pipelines API).
  • PAPI executes the tasks on the Google Compute Engine (GCE) and writes the output to the Workspace bucket.

The Pipelines API will

  • Create a Compute Engine virtual machine
  • Download the container (i.e., Docker) image
  • Download the input files
  • Run a new Docker container with the specified image and commands
  • Upload the output files
  • Destroy the Compute Engine virtual machine (VM)

Diagram schematizing what happens when you run a workflow on Terra. A green hexagon represents an individual workspace in Terra. Within this workspace are the workflow's inputs, commands, and data. Arrows indicate these the workflow's commands can come from Dockstore, the Broad Methods Repository, or the Terra Library. When a workflow is run, it is sent to Terra's exercution engine, Cromwell. Cromwell then starts individual jobs on a virtual machine using Google's Pipelines API (PAPI). Data for these jobs are pulled from the Terra workspace, or from an external Google Cloud bucket. If any jobs output new data, those outputs are writter back to the Terra workspace.
Overview of workflow submission in Terra from Genomics in the Cloud" by Geraldine A. Van der Auwera and Brian D. O'Connor (O'Reilly press)

Bonus: What are Cromwell and PAPI?Cromwell is an open source (BSD 3-clause) execution engine written in Java that supports running WDL on three types of platforms: local machines (e.g., your laptop), a local cluster/compute farm accessed via a job scheduler (e.g., GridEngine), or a cloud platform (e.g., Google Cloud or Amazon AWS).

Pipelines API (aka "PAPI") is a Google Cloud service that provides an easy way to launch and monitor tasks running in the cloud.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.