How to set up a workflow analysis
FollowThis article explains how to set up (configure) your workflow analysis in the Terra UI. Note that this article is intended primarily for analyses using data from the workspace data table. Differences for using direct paths are noted in "hints" boxes.
To learn how to automate some of the setup process by using a JSON file (especially useful if you anticipate using much the same configurations many times), see this article.
Contents
Workflow setup (configuration) overview
The workflow configuration form
Configure Inputs in the UI
Configure Outputs (data table option only)
How to verify workflow output files
Next steps - Resources and Hands-on workflows practice
Isn't there a simpler way? Yes! Use JSONs
Workflow setup overview
To run a workflow on Terra, you will "configure" the workflow, which just means setting up all the variables and options in the configuration form.
- Select workflow options such as whether to use caching or delete intermediate files, whether to get inputs from the data table or use full paths, etc.
- Specify inputs (i.e. reference files, compute parameters, and data file names and locations)
- Define output options (optional) - You can designate if you want the workflow to write links to the output data files to the data table
How to set up your workflow step-by-step - The configuration form
You'll set up a workflow to run by filling in or modifying the configuration form (screenshot below). The form lists Inputs and settings the workflow expects, and displays default values provided by the workflow author.
How to get to the workflow configuration form
- Select data in the Data page and choose "Open with workflow" after clicking the three vertical dots at top right
- Select the workflow you want to run from the Workflows tab of your workspace
What you see may differ slightly from the screenshots below depending on how you get to the form and which options you choose. The configuration form will look roughly like this:
1. Select the workflow snapshot (version)
You will see all available versions in this dropdown. You can choose to use the most up-to-date version of the workflow, or a previous version (if you need to maintain consistency, for example). Terra will automatically run the version you choose in the UI.
To learn more about how to update workflows to the latest version, see this article. |
---|
2. Choose whether to use full paths or the data table for inputs
|
|
---|---|
Although we recommend using data tables, there are situations where you may not want |
Full file paths configuration
If you choose this option, you can go straight to step five of the configuration form (call caching and delete intermediate files options). Your configuration form will look like this:
Data table configuration
If you choose this option, your next steps will be to select the "root entity type" (the table that holds the input data) and the input data to analyze. Your configuration form will look like this:
|
|
---|---|
To learn more about using data tables to organize your data and enable you to scale your analysis, see this article. To understand how to make, modify, or delete data tables, see this article. For hands-on practice with data tables, try the Data Tables QuickStart. |
3. Select the root entity type from the dropdown (data table only)
|
|
---|---|
The "root entity type" is the smallest piece of data a workflow can use as input. Selecting
|
Hint: what entity type does your workflow accept?
Find the input entity type your WDL expects in the Inputs section of the workflow configuration form.
- If the Input Type is "File", your root entity will be a table of single entities (i.e.
sample_set
orspecimen_set
): - If the Input Type is "Array[File]", the root entity type is an entity_set table or a pairs table (somatic workflows):
- If you are running a somatic workflow, the input Type will be Array[File]:
4. Select data to analyze (only for inputs from data table)
If you are using full paths for your input data, this will not appear on your configuration form.
If you're using using data tables, your configuration form will look slightly different depending on whether you started your analysis from the data tab or the Workflows tab.
Starting a workflow analysis from the Data table (examples)
If you started by selecting your data in the data tab, these parts should already be filled in. Beside the blue "Select Data" button you should see the data you have selected. Open the section below that corresponds to your use-case for a screenshot of what to expect.
Input is a single entity (or several single entities)
Input is a set of single entities (data files) defined in an entity_set table

Input is an array of entities
specimen_set
table)
Input is a tumor/normal pair (somatic workflows)
pair
. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create a pair_set
table that includes those two specific pairs:
Starting a workflow analysis from the workflows page
If you started your analysis setup from the workflow configuration card, you will need to select the data to analyze by clicking the blue button in Step 2 of the configuration form.
Once you do, you'll be taken to a form to select data. Click each use-case below to see screenshots of what to expect.
Input is a single entity (or several single entities)

After making your selection, make sure to click OK at the bottom of the form. Your workflow configuration form will now look like this:
Input is a set of single entities (data files) from an entity_set table

Terra will submit as many jobs as there are members in the set to run in parallel:
Input is an array of single entities
If you choose one set from the specimen_set
table, your configuration form will look like this:
Input is a tumor/normal pair (somatic workflows)

Note that if you select more than one pair, you can name the new set that Terra will automatically generate.
Terra will analyze the selected tumor/normal pairs in parallel and will create a pair_set
that includes those three selected pairs:
5. Call caching/Delete intermediate outputs options
These two options save storage costs in two different ways, and cannot be combined.
To learn more about call caching and when to use it, see this article. To learn how to save storage costs by deleting intermediate inputs, see this article.
|
|
---|---|
Attributes are the integers, strings, or files that correspond to variables in the workflow. |
6. Configure Inputs
You'll configure workflow inputs by filling in the attributes field in the Inputs tab of the configuration form. What formatting you use will depend on whether you use the data table (directions below) or direct paths (scroll down).
Using the data table
You'll connect the input assignments on the Inputs tab to specific columns in the data table or variables in the workspace resource table.
|
|
---|---|
For each variable that needs to be connected to a column in the data table, start typing For workspace-wide variables (like the genome reference sequence file, for example, or Once you start typing, you'll see a dropdown with all the available options from the
|
Single entity example
In the Screenshot below, there are five items in the dropdown after typing this.
in the attribute field. Each corresponds to a column in the root entity table. Scroll down to select the one corresponding to the InputCram variable, this.CRAM
.
|
|
---|---|
The Array input example In this case, you will use the format Tumor/normal pairs example This expression gives you the flexibility to reference any entity, including using nested |
|
|
---|---|
Check your root entity type to make sure you specified the right table. This can be tricky. For example, if you're running multiple workflows in parallel on a group in a specimen_set |
|
|
---|---|
You may need to select one of several possible analysis options in the case of branched |
Workspace-wide inputs (the workspace data table)
Storing an input file as a workspace attribute in the Workspace data table is convenient if you are using a file over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. You can call this by typing workspace.
plus the attribute key (i.e. workspace.ref_fasta
or workspace.ref_dict).
|
|
---|---|
|
Using direct paths as inputs
Use "gs://url-to-file-in-bucket" to reference a file in a Google bucket directly.
Formatting requirement - The quotes are necessary if you are directly referencing a file URL.
|
|
---|---|
To edit a workflow script, you will need to work outside the Terra UI. To learn more about Editing the WDL script can change the expected input configuration. You will be able to |
Editing expected input entity types (example)
- A set of BAM files representing the list of normal samples. Since the purpose of this workflow is to create a PoN from a set of files, this input is handled as an Array.
- A reference file. Since a single reference file can be useful in a variety of tasks, this input is handled as a File.
- The name of a database used for informing the PoN generation (in this case, the gnomAD database is used to inform the tool of the allelic fractions within this germline resource). Since this task does not need to localize the entire gnomAD database, it is sufficient to designate an input as String matching the name of the database. The name of the PoN file is also just a String.

7. Configure outputs (data table only)
|
|
---|---|
Where are generated output files stored? Note: The default folders are named by the submission ID, and finding the files in folders Advantages of using a data table Although we recommend using data tables, there are situations where you may not want |
Compare output default storage versus writing to the data table

Workflow outputs in the data table (clear associations)
Here's the same output file in the data table. Running the workflow generated teh aligner_output_crai and aligner_output_cram columns. Note that the unique collaborator_sample_id
references the entire row, associating the generated data with the primary data.
|
|
---|---|
If you use the same output name for multiple runs, Terra will overwrite the links in the To be able to compare results from different configurations, you'll want to give your |
How to verify workflow output files
If your output attributes have the format "this.your_filename", the workflow will write output metadata to the "your_filename" column of the data table. You'll see the additional metadata for these output files in the data table after a successful run.
For example, after completing Exercise 1 in the Workflows-QuickStart practice workspace, you will see that the sample table now contains three extra columns. These include links to files in the workspace Google bucket - outputBai, outputBam, and output_validation_report. This data is now available for downstream use by other workflows in your workspace (as you'll see in Exercise 3).
Whether or not you write to the data table, you can find the output files in your workspace Google bucket by clicking on the "Files" icon in the left column of the Data tab:
Note about output file folders: Each time you Launch a workflow, Terra will assign a unique submission ID to that submission. This submission ID is also the name of the output folder in the workspace Google bucket. Outputs from multiple submissions of the same workflow in the same workspace will not be overwritten since they are in different submission ID folders.
Resources and next steps
|
|
---|---|
|
|
|
---|---|
To practice modifying attributes and running workflows, work through the Terra- Note that to run the exercises you will need to clone the workspace to your own billing |
Isn't there an easier way? Yes! Use JSONs
It's tedious (not to mention error-prone) to type in every attribute by hand. JSON files can vastly simplify the process, To learn how to use a JSON to configure files so you don't have to do it manually, see this article. It's especially useful if you anticipate running much the same configuration many times over.
One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of the guesswork out of configuring someone else's workflows to run on your own data.
Comments
0 comments
Please sign in to leave a comment.