Overview: Workflows (pipelining)

User Ed
  • Updated

Learn about workflows - one of two analysis modes you can use on the Terra platform. Workflows -- aka pipelines -- are a series of steps performed by an external compute engine. They are often used for automated, bulk analysis, such as aligning genomic reads. Pipelines run on Terra are written in Workflow Description Language (WDL), a workflow processing language that is easy for humans to read and write.

Overview: Workflows versus interactive analysis

How do you choose which analysis mode to use when analyzing data in Terra? Broadly, it depends on your use case - in particular, whether your analysis can be automated to run in the background or needs interaction to run.

Three reasons to run a workflow instead of an interactive analysis1. Run multiple tools/commands in a specific order
2. Run it multiple times in the same way (maybe with a few parameter changes)
3. Run something that takes a long time and doesn't require your input once it's running

In short, if you can determine up front all the inputs /parameter values that need to be applied, and you want to be able to hit play and go out for lunch, use a workflow.

What do you need to run a workflow in Terra?

Running a workflow (pipeline) in a Terra workspace requires the following.

Permissions: You are a workspace owner or writer

You need permission to do any operations that have an Azure Cloud cost in a workspace, including running a workflow. 

One or more workflows (WDLs)

If you clone a workspace created after December 1, 2023 that includes one or more workflows (see Featured Workspaces in the Library), the workflows will be in your copy of that workspace. You will see them once you launch the Workflows app.

If the Workflows tab of your workspace is empty, you can choose from a list of curated workflows or import workflows from GitHub or Dockstore. See How to find a workflow for more details and step-by-step instructions.

Input data in a data table

Input data files can be located in workspace cloud storage, external cloud storage, or a data repository and linked to the workspace by metadata in the data table.

Hands-on practice: Intro to Terra on Azure Workflows Quickstart

The Workflows Quickstart is a self-guided tutorial that includes everything you need to get hands-on running workflows in Terra on Azure. It’s the second in a series of three quickstart tutorials intended to help you get up to speed on basic Terra on Azure functionality without spending a lot of time or money.

Intro to Terra Quickstart  workspace | Terra on Azure - Workflows Quickstart Guide

What happens when you run a workflow in Terra?

  • A workflow consists of

    • Input file path(s) to read from Cloud Storage
    • A Docker image to run (includes all the libraries and dependencies)
    • Commands (the workflow) to run in the Docker image
    • Cloud resources to use (number of CPUs, amount of memory, disk size, and type)
    • Output file/directory path(s) to write to Cloud Storage

    To run a workflow in Terra, you will

    • Choose the data to run on
    • Specify the path(s) of input files from Cloud Storage
    • Specify compute parameters (disk size, cost-saving options, etc.)
    • Specify whether and where you want to write output metadata to the input table
    • Submit the workflow to Terra

    Behind the scenes, Terra takes care of the details

    • Communicates with Azure Cloud to set up the container (Docker) to run the workflow code
    • Sends the built-in Cromwell server a packet of information containing the workflow code and inputs
    • Cromwell - a Workflow Management System geared towards scientific workflows - parses the workflow and starts dispatching individual tasks to the Task Execution Service (TES).

    Each user will have their own Cromwell service in the workspaceAll collaborators can see each other's submissions and workflow configurations.

    Note that running multiple Cromwells in the same workspace will increase its "size" and may increase your costs. See Overview: Costs and billing (Azure) for more details.

    The Task Execution Service (TES) will

    • Manage pools of VMs in Azure Batch to execute tasks
    • Call on Azure Batch to execute the tasks
    • Ensure VMs are only running when needed by user tasks

    Azure Batch will

    • Download the container (i.e., Docker) image
    • Localize the input files to the virtual machine (VM)
    • Run a new Docker container with the specified image and commands
    • Upload the output files

    Bonus: What are Cromwell and Azure Batch?Cromwell is an open source (BSD 3-clause) execution engine written in Java that supports running WDL on three types of platforms: local machines (e.g., your laptop), a local cluster/compute farm accessed via a job scheduler (e.g., GridEngine), or a cloud platform (e.g., Azure, Google Cloud or Amazon AWS).

    Azure Batch is an Azure Cloud service that provides an easy way to launch and monitor tasks running in the cloud.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.