How to set up a workflow analysis

Allie Hajian
  • Updated

This article explains how to set up (configure) your workflow analysis in Terra. Note that this article is intended primarily for analyses using data from the workspace data table. Differences for using direct paths are noted in hint boxes.     

To learn how to automate some of the setup process by using a JSON file (especially useful if you anticipate using much the same configurations many times), see Getting workflows up and running faster with a JSON file. 

Workflow setup overview

To run a workflow on Terra, you will first set up all the options and input variables in the configuration form (also called the "submission form").  

Select workflow options
whether to use caching or delete intermediate files, whether to get inputs from the data table or use full paths, etc.

Specify inputs 
i.e. reference files, compute parameters, and data file names and locations

Define output options (optional)
Designate if you want the workflow to write links to the output data files (stored in the workspace bucket by default) to the data table

The workflow configuration form: Step-by-step instructions

You'll set up a workflow to run by filling in or modifying the configuration form (screenshot below). The form lists Inputs and settings the workflow expects, and displays default values provided by the workflow author. 

How to get to the workflow configuration form

You can start an analysis one of two ways: by first selecting the data to analyse and then choosing the workflow or by first choosing the workflow and then selecting the data. 

  • Start by selecting the data

    1. Go to the Data page
    2. Select data
    to analyze
    3.
    Click three vertical dots (top right)
    4.
    Choose Open with > Workflows

  • Start by selecting the workflow

    1. Go to the Workflows page
    2.
    Click the name of the workflow from the available cards

The configuration form will look roughly like the screenshot below. What you see may differ slightly from the screenshots depending on how you get to the form and which options you choose. Scroll down for an explanation of each option and step-by-step instructions. 

Configure-workflow_Configuration-card_Screen_shot.png

1. Select the workflow snapshot (version)

You will see all available versions in this dropdown. You can choose to use the most up-to-date version of the workflow, or a previous version (if you need to maintain consistency, for example). Terra will automatically run the version you choose.

2. Choose to use full paths or the data table for inputs

Advantages of using the data table

Data tables make it easier to scale and automate 
Using the data table lets you reference multiple files from the data table without hard coding values or having to adjust your workflow configuration when you add more data to the table.

Data tables keep all associated data together
Data files are connected no matter where in the cloud they reside, even data from different sources, including generated data. 

  • Data table configuration

    If you choose this option, your next steps will be to select the "root entity type" (the table that holds the input data) and the input data to analyze. Your configuration form will look like this.

    Configure-workflows_Use-table_Screen_shot.png
  • Full file paths configuration

    If you choose this option, you can go straight to step five of the configuration form (call caching and delete intermediate files options). Your configuration form will look like this.

    Configure-workflows_File-paths_Screen_shot.png

When NOT to use a data table Although we recommend using data tables, there are situations where you may not want to: if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little set up as possible.

To learn more about setting up a data table, see Organizing data with workspace tables.

3. Select root entity type from dropdown (data table only)

The root entity type is the smallest piece of data a workflow can use as input. Selecting from this dropdown (Step 1 in the form) tells the workflow where to go (which table) for links for the input data.

How to know the root entity type If you can run your workflow on a single entity (like a specimen or  sample) 
The root entity type is that entity (i.e. specimen or sample)

If your workflow takes an array as input and cannot run on a single file
The root entity type is a set table (i.e. sample_set or specimen_set)

If you're running a somatic workflow (on tumor/normal pairs)
The root entity type is pair 

Hint: what entity type does your workflow accept?

Find the input entity type your WDL expects in the Inputs section of the workflow configuration form. 

  • If the Input Type is "File", your root entity will be a table of single entities (i.e. sample_set or specimen_set):Configure-workflows_Type-File_Screen_shot.png

  • If the Input Type is "Array[File]", the root entity type is an entity_set table or a pairs table (somatic workflows):Configure-workflows_Type-Arrays_Screen_shot.png

  • If you are running a somatic workflow, the input Type will be Array[File]:Configure-workflows_Type-Pairs_Screen_shot.png

Editing expected input entity types (advanced topic)

To edit a workflow script, you will need to work outside Terra. To learn more about creating and editing workflows, see Create, edit, and share a new workflow.

Editing the WDL script can change the expected input configuration.You will be able to see this by clicking on the workflow in the Workflows tab and looking at in the Inputs section.

  • The example below is from the workflow that generates a "Panel of Normals" (PoN). When generating a PoN, this WDL script expects some of the following input types.

    A set of BAM files representing the list of normal samples. Since the purpose of this workflow is to create a PoN from a set of files, this input is handled as an Array.

    A reference file. Since a single reference file can be useful in a variety of tasks, this input is handled as a File.

    The name of a database used for informing the PoN generation (in this case, the gnomAD database is used to inform the tool of the allelic fractions within this germline resource). Since this task does not need to localize the entire gnomAD database, it is sufficient to designate an input as String matching the name of the database. The name of the PoN file is also just a String.

    EntityTypes.png

4. Select data to analyze (only for inputs from data table)

If you are using full paths for your input data, this option will not appear on your configuration form and you can skip down to 5.Call caching/Delete intermediate outputs options.  Note: your setup form will look slightly different depending on whether you started your analysis from the data page or the Workflows page. 

Option 1: Starting from the Data table (example screenshots)

If you started by selecting data, these should already be filled in. Beside the blue "Select Data" button you should see the data you have selected. Open the section below that corresponds to your use-case for a screenshot of what to expect, depending on your input. 

  • Input is a single entity (or several single entities)

    Note that if you run on more than one data file, Terra will create a set of those particular entities by default. It will name the set "workflow-name" + "run date". To give the set a more meaningful name, you must start your analysis from the workflow configuration card.

    Configure-workflow_Select-data-specimen-default_Screem_shot.png 

  • Input is a set of single entities (from an entity_set table)

    Note that the root entity type is still the single entity. Terra will submit as many jobs as there are members in the set to run in parallel:
    Configure-workflow_Select-data-specimens-run-set_Screen_shot.png
  • Input is an array of entities

    If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. In the screenshot below, the workflow is running one job on an array of entities defined in the specimen_set table)
    Configure-workflow_Select-data-specimen-set_Screen_shot.png
  • Input is a tumor/normal pair (somatic workflows)

    The root entity type ispair. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create a pair_set table that includes those two specific pairs:
    Configure-workflow_Data-table-input-Pairs_Screen_shot.png  

Option 2: Starting from the workflows page (example screenshots)

If you started your analysis setup from the workflow page, you will first need to select the data to analyze by clicking the blue button in Step 2 of the configuration form.

Configure-workflows_Select-data_Screen_shot.png

Once you do, you'll be taken to a form to select data. Click each use-case below to see screenshots of what to expect.

  • Input is a single entity (or several single entities)

    Terra will process all the entities in the data table by default. You can also select exactly which entities to process. If you analyze more than one data file, Terra will create a set of those inputs and you'll be able to name the set containing those particular entities:
    Configure-workflows_Select-data_Specimen_Step-1_Screen_shot.png

    After making your selection, make sure to click OK at the bottom of the form. Your workflow configuration form will now look like this:
    Configure-workflows_Select-data_Specimens_Step-2_Screen_shot.png

  • Input is a set of single entities (data files) from an entity_set table

    Note that the root entity type is still the single entity. Your Select Data form will look like this (if you have some entity_sets already defined): Configure-workflows_Select-data_Specimens-in-set_Step-1_Screen_shot.png

    Terra will submit as many jobs as there are members in the set to run in parallel:
    Configure-workflow_Select-data-specimens-run-set_Screen_shot.png

  • Input is an array of single entities

    In this case, the workflow accepts an array (set) of entities as input. The root entity type is entity_set. Your Select Data form will look like this:

    Configure-workflows_Select-data-arrays-input_Screen_shot.png

    If you choose one set from the specimen_set table, your configuration form will look like this:Configure-workflow_Select-data-specimen-set_Screen_shot.png

  • Input is a tumor/normal pair (somatic workflows)

    You can choose exactly which tumor-normal pairs you want to analyze in the Select Data form:
    Configure-workflows_Select-data-from-workflow_Pairs_Screen_shot.png
    Note that if you select more than one pair, you can name the new set that Terra will automatically generate.

    Terra will analyze the selected tumor/normal pairs in parallel and will create a pair_set that includes those three selected pairs:
    Configure-workflow_Data-table-input-Pairs_Screen_shot.png

5. Call caching/Delete intermediate outputs options

Configure-workflows_call-caching-option_Screen_shot.png

These two options save storage costs in two different ways, and cannot be combined.

To learn more about call caching and when to use it, see this article.

To learn how to save storage costs by deleting intermediate inputs, see this article

6. Configure inputs

Attributes are the integers, strings, or files that correspond to input variables in the workflow. You'll specify inputs by choosing filling in the Attributes field in the setup form.   

Configure-workflows_Attributes_Screen_shot.png

Some common attribute formats Integer - No formatting required
String - Quotes required. e.g. "my string"
Boolean- Quotes required. Case insensitive so"true" or "TRue" or "TrUE"  are the same.
File - This type can be referenced from the Google bucket, data model, or workspace attribute section.
Array[X] - Lists of these attributes can be entered with a comma between each item. e.g. "a","b","c" or 1,2,3 or "true","True","TruE","TRUE"

The format for input data files in the cloud depends on whether are using the data table (directions below) or direct paths (scroll down for directions) from step 2 above. 

Option 1: Format for input data in the data table

If you're using the data table for inputs, you'll need to connect the attributes field to specific columns in the data table or files in the workspace resource table. Each of these options has a particular format. If you start typing in the proper format, all available options will appear in a dropdown. 

The exception is if you are using nested variables (see the advanced formatting section below for formatting arrays and tumor/normal pairs). 

Step 1: Specify input data

1.1. For each variable that needs to be connected to a column in the data table, start typing this. in the attribute field.

1.2. Once you start typing, you'll see a dropdown with all the available options from the
relevant table. Choose the right variable from the dropdown.

What is in the dropdown?The formating this. points to whatever table you selected as the "root entity type" in the configuration form. The dropdown menu will list all columns containing data or metadata from the root entity table.

Single entity example
In the Screenshot below, there are five items in the dropdown after typing this. in the InputCram attribute field (circled). Each corresponds to a column in the root entity table. Scroll down to select the one corresponding to the InputCram variable, this.CRAM.    

Configure-workflows_this.-formatting_Screen_shot.png

If you don't see the right input variable in the dropdown Check your root entity type to make sure you specified the right table. This can be tricky if you are using interconnected tables!

For example, if you're running multiple workflows in parallel on a group in a specimen_set table, the entity type is specimen. You only use the soecimen_set to choose what specimens to process. It's not where the input files are! 

Advanced formatting examples (other inputs)

The this. tells Terra to look first in the table you set as your root entity. In the case of nested tables, such as arrays and pairs inputs, the root entity table gives the ID of the entity. To indicate where to go to get the data requires additional formatting. 

  • Array input example
    In the case of array inputs, the root entity type is a _set table (for example, a specimen_set), but the data files are actually in the single entity table (i.e. the specimen table).

    In this case, you will use the format this.specimens.data-file-name 

  • Tumor/normal pairs example
    The pair table contains a control_sample_id, a case_sample_id, and their corresponding BAM files. Your WDL task requires both the case_sample_bam and the case_sample_bam input.

    You'd use this.case_sample.case_sample_bam and this.control_sample.case_sample_bam where case_sample and control_sample are columns in the pair table. 

This formatting gives you the flexibility to reference any entity, including using nested tables. If your desired input is a single file, the syntax simply points directly at the file. If your desired input is a set of files nested inside of a folder, the syntax must first point to the correct folder, and then point to the desired files within. Looking at the Type and Attributes columns serves as a quick way to check how your workflow is set up.

Step 2: Specify workspace-wide inputs (the workspace data table)

Storing an input file as a workspace attribute in the Workspace data table is convenient if you are using a file over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. You can call this by typing workspace. plus the attribute key (i.e. workspace.ref_fasta or workspace.ref_dict). 

2.1. For each workspace-wide variables (like the genome reference sequence file, for example, or
the GATK Docker), start typing workspace.(make sure to include the period!).

2.2. Select the workspace resource file from the dropdown.

Configure-workflow_Specify-workspace-data_Screen_shot.png

Some workflows may require additional types of input You may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task, for example. You will find fields for these options, if available, in the Inputs section of the configuration form.

Option 2: Format when using direct paths as inputs

Use "gs://url-to-file-in-bucket" to reference a file in a Google bucket directly. 

Formatting requirement - The quotes are necessary if you are directly referencing a file URL.
Workflow_hardcoded_attribute_Screen_Shot.png

7. Configure outputs (inputs from data table only)

When using the data table, you can choose what you want to do with the workflow outputs. Generated files are stored in the workspace bucket by default, but you can have the workflow write links to the output files right in the data table. You'll specify under what column name the outputs will be added to the data table - including a new column (for a newly generated data file).

You do this in the Outputs tab using the same formatting as for inputs.

7.1. For each output variable, start typing this. in the attribute field.

Once you start typing, you'll see a dropdown that lists all columns in the root entity data table. 

7.2. Choose an existing column or type in a new name to add a new column of data to the table. 

Configure-workflow_Write-outputs-to-data-table_Screen_shot.png 

Why write outputs to the data table? Writing to the data table associates generated output with the input data file (the output files are written alongside the input files in the table), and helps organize your outputs in a way that is meaningful to you. It also makes it easy to use the data for downstream analysis.

Where are generated output files stored?

Either way, your analysis outputs are stored in the workspace Google bucket by default.

Note: The default folders are named by the submission ID, and finding the files in folders designated by random strings of numbers and letters (i.e. 83add0f2-ae9b-4a97-9995-104b82e5631f) can be challenging. Compare the examples below.

Outputs in Google bucket (file folder is random string)
Managing-data-with-tables_Generated-data-in-bucket_Screen_shot.png

 

Outputs in the data table (clear associations)
Here's the same output file in the data table. Running the workflow generated teh aligner_output_crai and aligner_output_cram columns. Note that the unique collaborator_sample_id references the entire row, associating the generated data with the primary data.

Managing-data-with-tables_Generated-data_Screen_shot.png

When you might not want to write outputs to the tableAlthough we recommend using data tables, there are situations where you may not want to: if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little set up as possible.

Be careful not to overwrite in the data table If you use the same output name for multiple runs, Terra will overwrite the links in the data table with the most recent output data link. Note that data from a previous run will still exist in the workspace bucket, but it will be harder to find. 

To be able to compare results from different configurations, you'll want to give your outputs a name that indicates which is which.
configure-workflows_Multiple-test-outputs_Screen_shot.png

How to verify workflow output files

If your output attributes have the format "this.your_filename", the workflow will write output metadata to the "your_filename" column of the data table. You'll see the additional metadata for these output files in the data table after a successful run. 

For example, after completing Exercise 1 in the Workflows-QuickStart tutorial, you'll see the sample table now contains three new columns. Each column corresponds to a different output filetype: outputBai, outputBam, and output_validation_report. The cells include links to files in the workspace Google bucket for each sample. This data is now available for downstream use by other workflows in your workspace (see Exercise 3 of the Workflows Quickstart). 

Whether or not you write to the data table, you can find the output files in your workspace Google bucket by clicking on the "Files" icon in the left column of the Data tab:Data-Google-bucket-Files_Screen_Shot.png

Note about output file folder names Each time you Launch a workflow, Terra will assign a unique submission ID to that submission. This submission ID is also the name of the output folder in the workspace Google bucket. Outputs from multiple submissions of the same workflow in the same workspace will not be overwritten since they are in different submission ID folders.

Isn't there an easier way? Yes! Use JSONs

It's tedious (not to mention error-prone) to type in every attribute by hand. JSON files can vastly simplify the process, To learn how to use a JSON to configure files so you don't have to do it manually, see Getting workflows up and running faster with a JSON file. It's especially useful if you anticipate running much the same configuration many times over. 

One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of the guesswork out of configuring someone else's workflows to run on your own data.

Video and tutorial workflow resources 

To learn more about using data tables to organize your data and enable you to scale your
analysis, see Managing data with workspace tables.

To understand how to adjust data tables, see Making, modifying, and deleting tables.

For hands-on practice with data tables, try the Data Tables QuickStart.

To learn more about how to update workflows to the latest version, see this article.

To see a video tutorial on configuring a workflow, click here

Hands-on practice setting up and running a workflow analysis
To practice setting up and running workflows, work through the Terra-
Workflows-QuickStart
 workspace. It should take about half an hour to complete the
hands-on tutorial and cost less than a dime (GCP costs).

(Note that to run the exercises you will need to clone the workspace to your own billing
project.) 

Was this article helpful?

1 out of 2 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.