Workflow setup: VM and other options

Allie Hajian
  • Updated

This article outlines workflow runtime options (including cost saving options) - what they are and how to specify them. If this is your first time running a workflow, the default runtime options are usually adequate. 

Workflow options overview

Terra offers several ways to adjust the way your workflow runs. You will configure all of these runtime options in the workflow submission form. 

Screenshot showing an example workflow submission form, highlighting the configuration options available on this form. An orange rectangle and the number '1' highlight the first section of this form, where you can select your workflow's version and see its source and synopsis. Another orange rectangle and the number '2' highlight the second section, which contains checkboxes that you can use to select whether to use call caching, delete intermediate outputs, use reference disks, retry with more memory, and ignore empty outputs.

The configuration form displays default values provided by the workflow author. 

Running your first workflow? Use the defaults!If you are just getting familiar with running a workflow, you can always use the default runtime options. These are set up to make it easiest and to save money for most users. 

1. Workflow information (snapshot, source and synopsis)

In this first section of the workflow submission form, you can use the snapshot drop-down menu to see all available versions of your workflow. You can choose to use the most up-to-date version or a previous version (if you need to maintain consistency, for example). Terra will automatically run the version you choose.

This section also lists the workflow tools repository (source) and a synopsis (if available). 

2. Money-saving options

There are several features in Terra designed to help save money when running a workflow. 

2.1. Call caching

Call caching allows Terra's execution engine (Cromwell) to detect when a job has been run in the past so that it doesn't have to re-compute results. The call caching feature in Terra can save you time and money when you are repeating all or parts of a workflow analysis. 

2.2. Delete intermediate outputs

Deleting intermediate outputs allows you to save storage costs by automatically deleting outputs from intermediate steps when the workflow successfully completes. This feature is most useful when these intermediate outputs are not used in a downstream analysis.

Note that complex workflows can have a large number of intermediate outputs, which can dramatically increase the storage costs of a project. For example, intermediate files made up roughly 85% of the storage costs for a recent large-scale project, even though no one ever used these files.

Call caching and deleting intermediate outputs cannot be combined.These two options save storage costs in two different ways. 

To learn more about call caching and when to use it, see this article.

To learn how to save storage costs by deleting intermediate inputs, see this article

2.3. Reference disks

If your workflow uses human or mouse genome reference files (e.g., HG 19 or MM 10), Terra can automatically attach a disk containing HG 19/HG 38 references to your Google Virtual Machine. If the checkbox labeled ‘Use Reference Disks’ is selected, the execution engine will examine the job inputs to see if any of them correspond to reference inputs available on a reference disk image. This saves time and compute resources that your workflow would otherwise spend localizing large reference inputs. 

For more details, including the full reference disk manifests, see Reference Disks in Terra

2.4. Retry with more memory

If a task is failing because your VM is running out of memory, Terra will automatically retry it with more memory if this option is selected and maxRetries is greater than 0 in your WDL script.

For more details, see the Out of Memory Retry documentation. 

2.5. Ignore empty outputs

If your workflow outputs a null or empty value to a data table, selecting this option will prevent Terra from creating a new column to store that empty output. This can prevent your tables from becoming too large and sparse, and therefore makes it easier to find the interesting data within your tables.

Video and tutorial workflow resources 

To learn more about using data tables to organize your data and scale your
analysis, see Managing data with workspace tables.

To understand how to adjust data tables, see How to modify and edit data tables.

For hands-on practice with data tables, try the Data Tables QuickStart.

To learn more about how to update workflows to the latest version, see Updating workflows to the latest version.

To see a video tutorial on configuring a workflow, see this video walkthrough of the Workflows Quickstart - Part II

To learn about best practices to run your workflow at scale, see the WARP pipelines team's guidelines for cost optimization.

Hands-on practice setting up and running a workflow analysis (Note: To run these practice exercises you will need to clone the linked workspace to your own billing project)
To practice setting up and running workflows, work through the Terra-
Workflows-QuickStart
 workspace. It should take about half an hour to complete the
hands-on tutorial and cost less than a dime (GCP costs).

Was this article helpful?

1 out of 2 found this helpful

Comments

0 comments

Please sign in to leave a comment.