How to set up a workflow analysis

Allie Hajian
  • Updated

This article explains how to set up (configure) your workflow analysis in the Terra UI. Note that this article is intended primarily for analyses using data from the workspace data table. Differences for using direct paths are noted in "hints" boxes.     

To learn how to automate some of the setup process by using a JSON file (especially useful if you anticipate using much the same configurations many times), see this article. 

Workflow setup overview

To run a workflow on Terra, you will "configure" the workflow, which just means setting up all the variables and options in the configuration form (also called the "submission form").  

  • Select workflow options such as whether to use caching or delete intermediate files, whether to get inputs from the data table or use full paths, etc. 

  • Specify inputs (i.e. reference files, compute parameters, and data file names and locations)

  • Define output options (optional) - You can designate if you want the workflow to write links to the output data files to the data table

The workflow configuration form: Step-by-step instructions

You'll set up a workflow to run by filling in or modifying the configuration form (screenshot below). The form lists Inputs and settings the workflow expects, and displays default values provided by the workflow author. 

How to get to the workflow configuration form

Start by selecting data   Start by selecting workflow

1. Go to Data page
2. Select data to analyze
3. Click three vertical dots (top right)
4. Choose Open with > Workflows

or

1. Go to Workflows page
2. Click name of the workflow from the available cards

What you see may differ slightly from the screenshots below depending on how you get to the form and which options you choose. The configuration form will look roughly like this:

Configure-workflow_Configuration-card_Screen_shot.png

1. Select the workflow snapshot (version)

You will see all available versions in this dropdown. You can choose to use the most up-to-date version of the workflow, or a previous version (if you need to maintain consistency, for example). Terra will automatically run the version you choose in the UI.

G0_icon-read-me.png To learn more about how to update workflows to the latest version, see this article.

2. Choose whether to use full paths or the data table for inputs

G0_icon-tip.png


Advantages of using the data table

  Data tables make it easier to scale and automate 
Using the data table helps reference multiple files from the data table without hard coding values or having to adjust your workflow configuration when you add more data to the table.

Data tables keep all associated data together
No matter where in the cloud it resides, even data from different sources 

Although we recommend using data tables, there are situations where you may not want to: if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little set up as possible.

Note: the interface form will change slightly based on your choice.

Full file paths configuration
If you choose this option, you can go straight to step five of the configuration form (call caching and delete intermediate files options). Your configuration form will look like this:

Configure-workflows_File-paths_Screen_shot.png

Data table configuration
If you choose this option, your next steps will be to select the "root entity type" (the table that holds the input data) and the input data to analyze. Your configuration form will look like this:

Configure-workflows_Use-table_Screen_shot.png

G0_icon-read-me.png


Additional data tables resources

  To learn more about using data tables to organize your data and enable you to scale your
analysis, see this article.

To understand how to make, modify, or delete data tables, see this article

For hands-on practice with data tables, try the Data Tables QuickStart.

3. Select the root entity type from the dropdown (data table only)

G0_tip-icon.png


What's the "root entity type" of your workflow

  The "root entity type" is the smallest piece of data a workflow can use as input. Selecting from this dropdown (Step 1 in the form) tells the workflow where to go (which table) for links for the input data.
  • If you can run your workflow on a single entity (like a specimen or  sample) 
    The root entity type is that entity (i.e. specimen or sample)

  • If your workflow takes an array as input and cannot run on a single file
    The root entity type is a set table (i.e. sample_set or specimen_set)

  • If you're running a somatic workflow (on tumor/normal pairs)
    The root entity type is pair 


Hint: what entity type does your workflow accept?

Find the input entity type your WDL expects in the Inputs section of the workflow configuration form. 

  • If the Input Type is "File", your root entity will be a table of single entities (i.e. sample_set or specimen_set):Configure-workflows_Type-File_Screen_shot.png

  • If the Input Type is "Array[File]", the root entity type is an entity_set table or a pairs table (somatic workflows):Configure-workflows_Type-Arrays_Screen_shot.png

  • If you are running a somatic workflow, the input Type will be Array[File]:Configure-workflows_Type-Pairs_Screen_shot.png
icon-tip.png


Editing expected input entity types

  To edit a workflow script, you will need to work outside the Terra UI. To learn more about creating and editing workflows, see this article.

Editing the WDL script can change the expected input configuration. You will be able to see this by clicking on the workflow in the Workflows tab and looking at in the Inputs section.

Editing expected input entity types (example)

The example below is from the workflow that generates a "Panel of Normals" (PoN). When generating a PoN, this WDL script expects some of the following input types:
  • A set of BAM files representing the list of normal samples. Since the purpose of this workflow is to create a PoN from a set of files, this input is handled as an Array.
  • A reference file. Since a single reference file can be useful in a variety of tasks, this input is handled as a File.
  • The name of a database used for informing the PoN generation (in this case, the gnomAD database is used to inform the tool of the allelic fractions within this germline resource). Since this task does not need to localize the entire gnomAD database, it is sufficient to designate an input as String matching the name of the database. The name of the PoN file is also just a String.
EntityTypes.png

4. Select data to analyze (only for inputs from data table)

If you are using full paths for your input data, this option will not appear on your configuration form. Note: your setup form may look slightly different depending on whether you started your analysis from the data tab or the Workflows tab. 

(option 1) Starting from the Data table (example screenshots)

If you started by selecting data, these parts should already be filled in. Beside the blue "Select Data" button you should see the data you have selected. Open the section below that corresponds to your use-case for a screenshot of what to expect.

Input is a single entity (or several single entities)

Note that if you run on more than one data file, Terra will create a set of those particular entities by default. It will name the set "workflow-name" + "run date". To give the set a more meaningful name, you must start your analysis from the workflow configuration card.

Configure-workflow_Select-data-specimen-default_Screem_shot.png 

Input is a set of single entities (data files) from an entity_set table

Note that the root entity type is still the single entity. Terra will submit as many jobs as there are members in the set to run in parallel:
Configure-workflow_Select-data-specimens-run-set_Screen_shot.png

Input is an array of entities

If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. In the screenshot below, the workflow is running one job on an array of entities defined in the specimen_set table)
Configure-workflow_Select-data-specimen-set_Screen_shot.png

Input is a tumor/normal pair (somatic workflows)

The root entity type ispair. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create a pair_set table that includes those two specific pairs:
Configure-workflow_Data-table-input-Pairs_Screen_shot.png  

(option 2) Starting from the workflows page (example screenshots)

If you started your analysis setup from the workflow page, you will first need to select the data to analyze by clicking the blue button in Step 2 of the configuration form.

Configure-workflows_Select-data_Screen_shot.png

Once you do, you'll be taken to a form to select data. Click each use-case below to see screenshots of what to expect.

Input is a single entity (or several single entities)

Terra will process all the entities in the data table by default. You can also select exactly which entities to process. If you analyze more than one data file, Terra will create a set of those inputs and you'll be able to name the set containing those particular entities:
Configure-workflows_Select-data_Specimen_Step-1_Screen_shot.png

After making your selection, make sure to click OK at the bottom of the form. Your workflow configuration form will now look like this:
Configure-workflows_Select-data_Specimens_Step-2_Screen_shot.png

Input is a set of single entities (data files) from an entity_set table

Note that the root entity type is still the single entity. Your Select Data form will look like this (if you have some entity_sets already defined): Configure-workflows_Select-data_Specimens-in-set_Step-1_Screen_shot.png

Terra will submit as many jobs as there are members in the set to run in parallel:
Configure-workflow_Select-data-specimens-run-set_Screen_shot.png

Input is an array of single entities

In this case, the workflow accepts an array (set) of entities as input. The root entity type is entity_set. Your Select Data form will look like this:

Configure-workflows_Select-data-arrays-input_Screen_shot.png

If you choose one set from the specimen_set table, your configuration form will look like this:Configure-workflow_Select-data-specimen-set_Screen_shot.png

Input is a tumor/normal pair (somatic workflows)

You can choose exactly which tumor-normal pairs you want to analyze in the Select Data form:Configure-workflows_Select-data-from-workflow_Pairs_Screen_shot.png
Note that if you select more than one pair, you can name the new set that Terra will automatically generate.

Terra will analyze the selected tumor/normal pairs in parallel and will create a pair_set that includes those three selected pairs:
Configure-workflow_Data-table-input-Pairs_Screen_shot.png

5. Call caching/Delete intermediate outputs options

Configure-workflows_call-caching-option_Screen_shot.png

These two options save storage costs in two different ways, and cannot be combined.

To learn more about call caching and when to use it, see this article. To learn how to save storage costs by deleting intermediate inputs, see this article

6. Configure Inputs

G0_icon-tip.png


Configuring inputs and outputs in the UI

  You'll tell Terra where to get the inputs by filling in the Attributes field in the setup form. Attributes are the integers, strings, or files that correspond to variables in the workflow. You'll specify them by choosing the Inputs tab in the configuration form and filling in the field in the last column of the Inputs or Outputs tab.  

Configure-workflows_Attributes_Screen_shot.png

What formatting you use will depend on whether you use the data table (directions below) or direct paths (scroll down). 

Configure-workflows_Inputs-tab_Screen_shot.png

Using the data table

You'll connect the input assignments on the Inputs tab to specific columns in the data table or variables in the workspace resource table.  

icon-warning2.png


Formatting requirements for input attributes

  For each variable that needs to be connected to a column in the data table, start typing the prefixthis. in the attribute field.

For workspace-wide variables (like the genome reference sequence file, for example, or
the GATK Docker), start typing workspace. 

Once you start typing, you'll see a dropdown with all the available options from the
relevant table:

  • this. points to whatever table you selected as the "root entity type" in the
    configuration form

  • workspace.always points to workspace resources in the "Workspace Data" table.

Single entity example
In the Screenshot below, there are five items in the dropdown after typing this. in the attribute field. Each corresponds to a column in the root entity table. Scroll down to select the one corresponding to the InputCram variable, this.CRAM.    

Configure-workflows_this.-formatting_Screen_shot.png

 

G0_icon-tip.png


Advanced formatting examples (other inputs)

  The this. tells Terra to look in the table you set as your root entity. Additional formatting tells your workflow exactly where the data are for arrays and pairs inputs. 

Array input example
In the case of array inputs, the root entity type is a _set table, but the data files are actually in the single entity table.

In this case, you will use the format this.specimens.data-file-name 

Tumor/normal pairs example
The pair table contains a control_sample_id, a case_sample_id, and their corresponding BAM files. Your WDL task requires the case_sample_bam input. You'd use this.case_sample_id.case_sample_bam.  

This expression gives you the flexibility to reference any entity, including using nested tables. If your desired input is a single file, the syntax simply points directly at the file. If your desired input is a set of files nested inside of a folder, the syntax must first point to the correct folder, and then point to the desired files within. Looking at the Type and Attributes columns serves as a quick way to check how your workflow is set up.


G0_icon-tip.png


If you don't see the right input variable in the dropdown

  Check your root entity type to make sure you specified the right table. This can be tricky if you are using interconnected tables!

For example, if you're running multiple workflows in parallel on a group in a specimen_set table, the entity type is specimen. You only use the soecimen_set to choose what data to process. 


G0_icon-tip.png


Some workflows may require additional types of input

  You may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task, for example. You will find fields for these options, if available, in the Inputs section of the configuration form.

Workspace-wide inputs (the workspace data table)

Storing an input file as a workspace attribute in the Workspace data table is convenient if you are using a file over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. You can call this by typing workspace. plus the attribute key (i.e. workspace.ref_fasta or workspace.ref_dict). 

G0_icon-tip.png


Workspace attribute formatting

  Integer - No formatting required

String - Quotes required. e.g. "my string"

Boolean- Quotes required. Case insensitive so"true" or "TRue" or "TrUE"  are the same.

File - This type can be referenced from the Google bucket, data model, or workspace attribute section.

Array[X] - Lists of these attributes can be entered with a comma between each item. e.g. "a","b","c" or 1,2,3 or "true","True","TruE","TRUE"

Using direct paths as inputs

Use "gs://url-to-file-in-bucket" to reference a file in a Google bucket directly. 

Formatting requirement - The quotes are necessary if you are directly referencing a file URL.
Workflow_hardcoded_attribute_Screen_Shot.png

 

7. Configure outputs (data table only)

When using the data table, you'll be able to choose what you want to do with the workflow outputs. Generated files are stored in the workspace bucket by default, but you can have the workflow write links to the output files right in the data table. Specifically, you'll determine under what column name the outputs will be added to the data table. You can click the "Use defaults" link to use the output variable names, or you can specify something different, either a column that already exists, or a new name that the system will use to create a new column. 
G0_icon-tip.png


Why write outputs to the data table?

  Writing to the data table allows you to associate generated output with the input data file,
as well as organize your outputs in a way that is meaningful to you and makes it easy to
use the data for downstream analysis.

Where are generated output files stored?
Either way, your analysis outputs are stored in the workspace Google bucket by default. Note: The default folders are named by the submission ID, and finding the files in folders
designated by random strings of numbers and letters (i.e. 83add0f2-ae9b-4a97-9995-
104b82e5631f
) can be challenging. See the example below. 

Workflow outputs in the Google bucket (file folder is random string)
Managing-data-with-tables_Generated-data-in-bucket_Screen_shot.png

Workflow outputs in the data table (clear associations)
Here's the same output file in the data table. Running the workflow generated teh aligner_output_crai and aligner_output_cram columns. Note that the unique collaborator_sample_id references the entire row, associating the generated data with the primary data.

Managing-data-with-tables_Generated-data_Screen_shot.png

When you might not want to write outputs to the table
Although we recommend using data tables, there are situations where you may not want to: if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little set up as possible.


icon-warning2.png


Considerations for naming outputs

  If you use the same output name for multiple runs, Terra will overwrite the links in the data table with the most recent output data link. Note that data from a previous run will still exist in the workspace bucket, but it will be harder to find. 

To be able to compare results from different configurations, you'll want to give your outputs a name that indicates which is which:

configure-workflows_Multiple-test-outputs_Screen_shot.png   

How to verify workflow output files

If your output attributes have the format "this.your_filename", the workflow will write output metadata to the "your_filename" column of the data table. You'll see the additional metadata for these output files in the data table after a successful run. 

For example, after completing Exercise 1 in the Workflows-QuickStart practice workspace, you will see that the sample table now contains three extra columns. These include links to files in the workspace Google bucket - outputBai, outputBam, and output_validation_report. This data is now available for downstream use by other workflows in your workspace (as you'll see in Exercise 3). 

Whether or not you write to the data table, you can find the output files in your workspace Google bucket by clicking on the "Files" icon in the left column of the Data tab:Data-Google-bucket-Files_Screen_Shot.png

Note about output file folders: Each time you Launch a workflow, Terra will assign a unique submission ID to that submission. This submission ID is also the name of the output folder in the workspace Google bucket. Outputs from multiple submissions of the same workflow in the same workspace will not be overwritten since they are in different submission ID folders.

G0_icon-read-me.png


Video and tutorial workflow resources 

  To see a video tutorial of configuring a workflow, click here

Hands-on practice setting up and running a workflow analysis
To practice modifying attributes and running workflows, work through the Terra-
Workflows-QuickStart
 workspace. It should take about half an hour to complete the
hands-on tutorial. 

Note that to run the exercises you will need to clone the workspace to your own billing
project. 

Isn't there an easier way? Yes! Use JSONs

It's tedious (not to mention error-prone) to type in every attribute by hand. JSON files can vastly simplify the process, To learn how to use a JSON to configure files so you don't have to do it manually, see this article. It's especially useful if you anticipate running much the same configuration many times over. 

One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of the guesswork out of configuring someone else's workflows to run on your own data.

Video and tutorial workflow resources 

  • To see a video tutorial of configuring a workflow, click here
  • Hands-on practice setting up and running a workflow analysis
    To practice modifying attributes and running workflows, work through the Terra-
    Workflows-QuickStart
     workspace. It should take about half an hour to complete the
    hands-on tutorial. 
    • Note that to run the exercises you will need to clone the workspace to your own billing
      project. 

Was this article helpful?

1 out of 2 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.