An introduction to workflows - one of two analysis modes you can use on the Terra platform. Workflows (aka pipelines) are a series of steps performed by an external compute engine - often used for automated, bulk analysis (such as aligning genomic reads). Pipelines run on Terra are written in Workflow Description Language (WDL), a workflow processing language that is easy for humans to read and write.
Overview: Running a workflow in a Terra workspace
To run a workflow (pipeline) in a Terra workspace requires the following.
"Can compute" access to the workspace
You need permission to do any operations that have a Google Cloud cost (i.e., run workflows) in a workspace. You can do this if someone shares a workspace with you as "Can-Compute Writer." If you create or copy a workspace using your own Billing project, you are the Owner, by default, and can run workflows.
One or more workflows
If you clone a workspace that already contains workflows (see Showcase workspaces in the Library), these tools will be in your copy as well. If the Workflows tab of your workspace is empty, you can import workflows from the Terra library (code and workflows section).
Input data files can be located in the workspace Google bucket or an external bucket, and linked to the workspace by metadata in the data table.
What happens when you run a workflow in Terra?
A workflow in its simplest form is a task consisting of:
- Path(s) of input files to read from Cloud Storage.
- A Docker image to run.
- Commands (the workflow) to run in the Docker image.
- Cloud resources to use (number of CPUs, amount of memory, disk size and type).
- Path(s) of output files/directories to write to Cloud Storage.
To run a workflow in Terra, you will:
- Specify the path(s) of input files from Cloud Storage.
- Specify runtime options, including the Docker image.
- Submit workflow to Terra.
Behind the scenes, Terra takes care of the details:
- Terra sends the built-in Cromwell server a packet of information containing the workflow code and inputs.
- Cromwell - a Workflow Management System geared towards scientific workflows - parses the workflow and starts dispatching individual jobs to PAPI (the Pipelines API).
- PAPI executes the tasks on the Google Compute Engine (GCE) and writes the output to the Workspace bucket.
The Pipelines API will:
- Create a Compute Engine virtual machine.
- Download the Docker image.
- Download the input files.
- Run a new Docker container with the specified image and command.
- Upload the output files.
- Destroy the Compute Engine virtual machine (VM)
Overview of workflow submission in Terra from Genomics in the Cloud by Geraldine A. Van der Auwera and Brian D. O'Connor (O'Reilly press)
Cromwell and PAPI
Cromwell is an open source (BSD 3-clause) execution engine written in Java that supports running WDL on three types of platform: local machine (e.g., your laptop), a local cluster/compute farm accessed via a job scheduler (e.g., GridEngine) or a cloud platform (e.g., Google Cloud or Amazon AWS).
Pipelines API (aka "PAPI") is a Google Cloud service that provides an easy way to launch and monitor tasks running in the cloud.
Practice pipelining with the Workflows Quickstart
One way to get up and running quickly is to clone and run the workflows in a featured Showcase workspace. The Workflows Quickstart tutorial is a self-guided tutorial that includes everything you need to get hands-on experience running workflows.
Copy the Terra Workflows Quickstart workspace to your own billing account and work through the three exercises.
Part 1 - Run a preconfigured workflow on a single sample from the Workflows page
Part 2 - Set up and run a workflow on two samples (you'll create a set in the process)
Part 3 - Run downstream analysis on a set of samples