Configuring a workflow means specifying the values or files for the workflow variables your workflow needs to run. This includes defining Inputs (i.e. reference files, compute parameters, and input data file names and locations) and Outputs, for example.
This article explains how use the Terra interface to configure your workflow to run. To learn how to use a JSON file to configure so you do not have to do it manually (especially useful if you anticipate using much the same configurations many times), see this article.
- What is there to configure, in order to run a workflow?
- How to configure the workflow card
- What are input and output attributes?
- Input, output and workspace attribute formats by type
- Setting workflow inputs
- Configuring outputs to write to a data table
- Practice configuring workflow inputs and outputs
- QuickStart Exercise 2: Setting inputs and outputs in the Terra interface
- How to verify workflow output files
- Next steps - Hands-on workflows practice
- Isn't there a simpler way? Yes! Use JSONs
What is there to configure in order to run a workflow?
Usually you will configure at least two things: 1) you need to set any parameters expected by the tools that are not already specified within the workflow script, and of course 2) you need to tell the system what input data to feed to the workflow to analyze. Some workflows may require additional types of input: you may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task, for example.
You have the option of setting up the configuration with or without using the Terra data table. Although we recommend using them, there are situations where you may not want to use tables: if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little set up as possible.
How to configure the workflow card
You can turn this ability on or off within the workflow card form. Either way, your analysis outputs will be placed in the workspace Google bucket after execution. One advantage of using data tables when running an analysis is the ability to specify which table will reference your outputs after execution. This helps organize your outputs in a way that is meaningful to you, as you can annotate the table column headers.
Selecting the workflow configuration card lists what inputs and parameter settings the workflow expects from you, and displays default values provided by the workflow author. You must enter values for required inputs, and you can enter values for optional inputs if you like (but you don't have to). There is also a tab to set Outputs. To learn more about how to configure a method, see this tutorial.
One key advantages to how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of the guesswork out of configuring someone else's workflows to run on your own data.
What are input and output attributes?
Attributes are the integers, strings, or files that correspond to variables in the workflow. They are in the last column of the Inputs or Outputs tab (see screenshot of Inputs above).
Input, output and workspace attribute format per type
Input and output attribute types determine the formatting requirements in the workflows card. See the Inputs diagram above for examples.
Workspace attributes have slightly different formatting:
- Integer - No formatting required.
- String - Quotes required. e.g.
- Boolean- Quotes required. Case insensitive so
"TrUE"are the same.
- File - can be referenced from the Google bucket, data model, or workspace attribute section. See the Referencing files section below for details.
- Array[X] - Lists of these attributes can be entered with a comma between each item. e.g.
Setting workflow inputs
Configuring Inputs from a Google bucket (direct reference)
Use "gs://url-to-file-in-bucket" to reference a file in a Google bucket directly. Please note the quotes are necessary if you are directly referencing a file URL, but the quotes are not necessary if you reference a file using the data table or workspace data table below.
Configuring Inputs from a data table
Suitable for referencing several files from the data table without hard coding values or having to adjust your workflow configuration when you add more data to the table. You can call the files listed under the name of a column with the format
this. plus the column title. The keyword
this. tells Terra to look in the table you set as your root entity. If you set your root entity to "sample" when you imported the workflow to your workspace, then Terra will look in the "sample" table for an attribute (a column) with the name you specify. e.g.
this.sample_id would look in the "sample" table for the "sample_id" attribute.
This expression also gives you the flexibility to dive into attributes that exist on any entity that the workflow config is running on. For example, say your workflow is to be run on a pair. The pair table contains a control_sample_id, a case_sample_id, and their corresponding BAM files. Say your WDL task requires the case_sample_bam input. You'd use
Configuring Inputs from the workspace data table
Storing an input as a workspace attribute in the Workspace data table is convenient if you are using a file over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. You can call this by typing
workspace. plus the attribute key. For example,
workspace.ref_dict. If you type
workspace. into the workflow configuration, all the workspace attributes available will auto-populate below. See how to format workspace attribute values here.
Configuring outputs to write to the data table
Writing outputs to the data table is optional because your outputs will go directly into your bucket by default. If you want links to the output file destination in your data table, you need to configure it in the workflow card Outputs attributes using the same formatting(workspace., this., etc.).
To do this, determine the name of the column or use a pre-existing column and type
this. in front of it. For example,
this.analysis_ready_bam will output the BAM to the column called analysis_ready_bam in the sample table (if you chose to run this method on a sample). If the column header doesn't yet exist, the workflow will create it after executing the submission.
Practice configuring workflow inputs and outputs
As a reminder, the inputs and outputs are defined in the WDL. Terra interprets the WDL and provides you an input "form" to fill out. The outputs part of the form is optional. To access the "form" with more details on the inner workings of the WDL, click on the workflow card from within the Workflows page.
This example is an example of what you will see after clicking on the workflow card. There are separate forms for the WDL script, inputs, and outputs:
How many workflows and tasks do you see?
Answer: The WDL includes one workflow (CramToBamFlow) with two tasks (CramToBamTask and ValidateSameFile). There are inputs listed for the workflow and separately for the tasks, followed by outputs for the workflow.
In the Inputs tab, you can tell the difference between workflow inputs and task inputs by looking at the name in the first column. The first four inputs named CramtoBamFlow, are the workflow inputs according to the WDL.
Workflows QuickStart: Specifying inputs and outputs in the Terra interface
To get a hands-on feel for configuring workflows, make your own copy of the Terra-Workflows-QuickStart. You will have to configure the workflow by filling in the attributes form in Exercise 3.
In the 'Inputs" and "Outputs" tabs of the configuration form, you can add (or modify) attribute values by typing the attributes in the right hand column and saving. See the screenshot below for an example of what your inputs will look like before and after you have configured the workflow:
Don't forget to hit the Save button after filling in the attributes!
You will go through a similar process for the Outputs.
By default, output data will be written to the workspace Google bucket. Note that if you want to write metadata for the output files to the workspace data table, you will need to use the format "this.filename" in the Output attribute.
Successful workflow runs require matching attributes
Example: Matching sample_ids in the Data Table
It's straightforward to check that the attributes in the data table and the workflow card match. In the Workflow inputs, the variable "SampleName" corresponds to the attribute "this.sample_id" (see screenshot):
The string prefix "this." tells us that the variable is in the Data Table. So the sample_id in the Table of the Data tab will be used as the SampleName in the task CramToBamFlow. Looking at the data table (screenshot below) confirms that the string "sample_id" is the header for the samples. Because it matches the attribute, the workflow will be able to find the right input file when launched, and the run will succeed.
How to verify workflow output files
If your output attributes have the format "this.your_filename", the workflow will write output metadata to the "your_filename" column of the data table. You'll see the additional metadata for these output files in the data table after a successful run.
For example, after completing Exercise 1 in the Workflows-QuickStart practice workspace, you will see that the sample table now contains three extra columns. These reference files in the workspace Google bucket - outputBai, outputBam, and output_validation_report. This data is now available for downstream use by other workflows in your workspace (as you'll see in Exercise 3).
Whether or not you write to the data table, you can find the output files in your workspace Google bucket by clicking on the "Files" icon in the left column of the Data tab:
Note about output file folders: Each time you Launch a workflow, Terra will assign a unique submission ID to that submission. This submission ID is also the name of the output folder in the workspace Google bucket. Outputs from multiple submissions of the same workflow in the same workspace will not be overwritten since they are in different submission ID folders.
- To see a video tutorial of configuring a workflow, click here.
- To practice modifying attributes and running workflows, work through the Terra-Workflows-QuickStart workspace. It should take about half an hour to complete the hands-on tutorial.
Note that to run the exercises you will need to clone the workspace to your own billing project.
Isn't there an easier way? Yes! Use JASONs
It's tedious (not to mention error-prone) to type in every attribute by hand. JSON files can vastly simplify the process, To learn how to use a JSON to configure files so you don't have to do it manually, see this article. It's especially useful if you anticipate running much the same configuration many times over.