How to configure workflow inputs

Allie Hajian
  • Updated

This article explains how to set up (configure) and run a workflow using inputs from a table.

Workflow setup: Inputs overview

To run a workflow on Terra, you need to specify all required workflow input variables. This article walks through the setup process when using inputs stored in the data table.

To learn details of how to set up inputs with direct paths (the URL for the data file in the cloud), scroll down to the bottom of the article or click here.

Inputs options

You can specify the data that your workflow operates over in two ways:

  • Data table columns. In most cases, we recommend specifying your workflow inputs by referring to columns in a Terra data table within the same workspace. Using data tables makes it easy to scale and automate your analysis, because you can reference multiple files without relying on error-prone hard-coded values.
    You also won't have to adjust your workflow's configuration if you add more data to the table. In addition, data tables keep your workflow inputs and outputs together, no matter where in the cloud they reside. 
  • Hard-coded file paths. Alternatively, you can hard-code paths to your files' location on the cloud.

When NOT to use a data table Although we generally recommend using data tables, you may not want to if:
- you cannot fit your data into the data table in a way that makes sense for your analysis, or
- you want to test a new method in Terra quickly, with as little setup as possible

To learn more about setting up a data table, see Organizing data with workspace tables.

Step 1: Select data

Note: These instructions are for running a workflow with inputs from a data table. For details of how to set up inputs with direct paths, scroll down to the bottom of this section. You can start by selecting data (on the Data page) or by selecting the tool (on the Workflows page).

  • 1.1. Go to the Data tab.

    1.2. Select the table that includes links to the input data files.

    1.3. Select the rows with the entities to analyze.

    1.4. Click Open with.

    1.5. Choose Workflows.

    1.6. Choose the workflow to run to expose the workflow configuration form.

  • 1.1 Go to the Workflows page

    1.2. Click the name of the workflow from the available cards to expose the configuration form.
    Gif of a Terra workspace showing the steps to open a workflow configuration page from the Workflows tab. First, the Workflows tab is selected followed by the the workflow to be run. The workflow configuration page opens where you can select data and specify inputs and outputs.

    1.3. Select the root entity type from the Step 1 dropdown. The root entity type is the table that contains the inputs required by the workflow.

    Selecting the right root entity table Note: The dropdown includes all the tables in your workspace. If you have more than one table and don't know which one is right, see Selecting the root entity type or When to use a set table for workflow inputs for guidance. 

    1.4. Click the blue Step 2 Select Data button in the configuration form to select the specific entities to analyze.

    Screenshot of workflow configuration card showing step 1 - select root entity type - and step 2 - select data with a circle around step 2

    1.5. Follow the prompts in the form to select the data to analyze. See screenshots of what to expect for different use cases below.

    Single entities (samples, specimens etc.)

    If your workflow runs on a single entity, you can process all rows of data or choose specific rows to process. If you analyze more than one data file, Terra creates a set of those inputs and you can name the set containing those particular entities:

    Screenshot of select data popup with radio button process all two rows selected and a field labeled all two specimens will be saved as a specimen set containing the set name AH-test-11-22-2020

    After making your selection, click OK at the bottom of the form. 

    A group of single entities

    If your workflow runs on a single entity, but you are running on a set of single entities, the root entity type is still the single entity table. You can choose to run on all entities, pick specific entities to run on, or choose to run on the entities in a predefined set. Terra will submit as many jobs as there are members in the set to run in parallel. Your Select Data form will look like this (assuming you have some subsets already defined as entity_set tables).

    Screenshot of select data popup with the radio button choose existing sets selected and three sets listed below in the column specimen_set_id

    Arrays

    If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. See When to use sets as inputs to a workflow for more information. Your Select Data form will look like this.

    Screenshot of select data popup with the radio button choose existing sets selected and three sets listed below in the column specimen_set_id. The box to the left of the human_all set is checked off

    Tumor/normal pairs (somatic workflows)
    You can choose exactly which tumor-normal pairs you want to analyze in the Select Data form:

    Screenshot of select data popup with the process all three radio button selected and the field titled all pairs will be saved as a new set with the name filled in with test-set-11-22-2020

    Note: If you select more than one pair, you can name the new set that Terra will automatically generate. Terra will analyze the selected tumor/normal pairs in parallel and will create a pair_set that includes those selected pairs.

    1.7. Click the blue "OK" button at the bottom right to confirm your data selection. 

Verify data selection

Beside the blue "Select Data" button, you should see the data you selected. Click the appropriate tab below to see a screenshot of what to expect. 

  • If your workflow runs on a single entity (or several single entities), your form will look like this when you've selected the data.

    Screenshot of the top of the workflow configuration form with a circle around step 2, select data and the note 3 selected specimens will create a new set named 1-single-input-workflow-2020-11-12T20-41-34

    Note: If you run on more than one data file, Terra creates a set of those particular entities by default. It names the set "workflow-name" + "run date". To give the set a more meaningful name, start your analysis from the workflow configuration card.

  • If your workflow runs on single entities and you run a set of single entities, the root entity type is still the single entity. Terra will submit as many jobs as there are members in the set to run in parallel. Your form will look like this.

    Screenshot of the top of the workflow configuration form with a circle around step 2, select data and the note specimens from one set
  • If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. In the screenshot below, the workflow is running one job on an array of entities defined in the specimen_set table.

    Screenshot of the top of the workflow configuration form with a circle around step 2, select data and the note one selected specimen set
  • If you run a somatic workflow, the (typical) root entity type is pair. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create a pair_set table that includes those two specific pairs.

    Screenshot of the top of the workflow configuration form with a circle around step 2, select data and the note two selected pairs will create a new set named PairsSetInputMethod_2020-11-23T20-12-16
  • If you use file paths for your input data, you can enter the full path directly into the attribute field. Your configuration form will look like this.

    Screenshot of input tab of workflow configuration card with the attriubutes for the InputCram and RefDict variables circled and full gs paths in the attribute field

    Format when using direct paths as inputsUse "gs://url-to-file-in-bucket" to reference a file in a Google bucket directly. 

    Formatting requirement - The quotes are necessary if you directly reference a file URL. Closeup of hardcoded attribute with a circle highlighting the full path name included in double quotes in the attribute field

Step 2. Specify fixed workflow attributes

Attributes are the integers, strings, or files that correspond to input variables in the workflow. You specify inputs by choosing filling in the Attributes fields for all required variables in the setup form.   

Screenshot of inputs tab of workflow configuration card with attribute column circled at far right and four filled in attributes along with two blank attributes

2.1. Fill in fixed attributes. These include variables like disk or memory size or Docker image URLs.

Some common attribute formats Integer - No formatting required
String - Quotes required. e.g., "my string"
Boolean- Quotes required. Case insensitive so"true" or "TRue" or "TrUE"  are the same.
File - This type can be referenced from the Google bucket, data model, or workspace attribute section.
Array[X] - Lists of these attributes can be entered with a comma between each item. e.g., "a","b","c" or 1,2,3 or "true","True","TruE","TRUE"

Step 3. Specify flexible workflow attributes from a data table

3.1. For each variable that comes from a data table (either an entity table or the workspace data table), click into the attribute field.

3.2. Once you click in, you'll see a drop-down menu with all the available options from both the root entity table and the workspace table (i.e., workspace-level resources). Choose the right variable from the dropdown. HINT: look across the row to see what the variable is!

What is in the drop-down menu?Input data files
Attributes that begin with this. are taken from the table you selected as the "root entity type" in the configuration form. The drop-down menu will list all columns in the root entity table. 

Workspace-level resources
Attributes that begin with workspace. are from the workspace data table. 
Storing a file as a workspace attribute in the Workspace data table is convenient if you use it over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. Workspace data tables are specified by the format workspace. plus the attribute key (i.e. workspace.ref_fasta or workspace.ref_dict). 
The drop-down menu will list all workspace-level resource files in the workspace data table. 

  • Format: this.CRAM. In the Screenshot below, there are five items in the drop-down menu after clicking into the InputCram (circled) attribute field. Each corresponds to a column in either the root entity or the workspace data table. Scroll down to select the one corresponding to the InputCram variable, this.CRAM.    

    Screenshot of input tab of workflow configuration form with the first variable - InputCram - circled and the entry this.cram circled in the dropdown menu



  • Required format (must be typed in exactly like the example)

    this.your-entity+s.your-variable-name

    Screenshot of input tab of workflow configuration form with attributes for the first two variables - r1-fastq and sample_id - circled and the entries this.specimens.r1_fasta and this.specimens.specimens_id circled in the dropdown menu

    If your workflow runs on an array of entities, the format is slightly different!Note: This option will not show up automatically in the drop-down menu.

    To learn more, see Configuring workflow inputs: sets and pairs tables.

  • Required format (must be typed in exactly as shown)

    this.case_sample.your-variable-name or this.normal_sample.your-variable-name

    Screenshot of input tab of workflow configuration form with attributes for the first two variables - tumor_reads and tumor_reads_index - circled and the values this.case_sample.bam and this.case_sample.bam_index in the attributes field.

    If you run a somatic workflow, the format is slightly differentNote: This option will not show up automatically in the drop-down menu. 

    To learn more, see Configuring workflow inputs: sets and pairs tables.

If you don't see the right input in the drop-down menu, check your root entity type to make sure you specified the right table. This can be tricky if you are using interconnected (nested) tables!

For example, if you run multiple workflows in parallel on a group in a specimen_set table, the entity type is specimen. You only use the specimen_set to choose what specimens to process. It's not where the input files are, so it's not the root entity type! 

  • If you import data from the Terra Data Repository, Gen 3, or other repository, take note of the formatting in the data table. 

    If you use data with a pfb or tdr prefix, you must include the prefix in the attribute field. For example, the table shown below contains column names that start with pfb: Screenshot of sample table with the column header pfb:cram_drs_uri in the second column containing links to the input data

    When specifying inputs from this data table in a workflow, the formatting should be this.pfb:COLUMN_NAME. Note: The proper format should show up in the drop-down menu.
    Screenshot of the inputs tab of the workflow configuration card with the attribute this.pfb:cram_drs_uricircled from the dropdown for the variable input_cram_file

Data in interdependent tables require more complex formatting. If your desired input is a single file, the syntax points directly at the file. If your desired input is a set of files nested inside a folder, the syntax must first point to the correct folder, then point to the desired files within. Looking at the Type and Attributes columns is a quick way to check how your workflow is set up.

Some workflows may require additional types of input You may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task. You will find fields for these options, if available, in the Inputs section of the configuration form.

Isn't there an easier way? Yes! Use a JSON parameter file

It's tedious (not to mention inviting errors) to type in every attribute by hand. JSON parameter files can vastly simplify the process. To learn how to use a JSON to configure files so you don't have to do it manually, see Getting workflows up and running faster with a JSON file. It's especially useful if you anticipate running the same configuration many times over. 

One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of guesswork out of configuring someone else's workflows to run on your own data.

Next steps: Video and tutorial workflow resources 

Hands-on practice setting up and running a workflow analysisTo practice setting up and running workflows, try the T101 Workflows QuickStart (click for guide) workspace. It should take about half an hour to complete the hands-on tutorial and cost less than a dime.

(Note: To run the exercises, you need to clone the workspace under your own billing project.) 

Was this article helpful?

1 out of 2 found this helpful

Comments

0 comments

Please sign in to leave a comment.