Workflow setup: Configuring inputs

Allie Hajian
  • Updated

This article explains how to set up (configure) and run a workflow using inputs from a table.

Workflow setup: Inputs overview

To run a workflow on Terra, you will need to specify all required workflow input variables. This article walks through the setup process using the data table for inputs. To learn details of how to set up inputs with direct paths (the URL for the data file in the cloud), scroll down to the bottom of the article or click here.

Inputs optionsInputs include data file names and locations as well as reference files, compute parameters, and data file names and locations.

Parameter files 
WDL workflows require special configuration files that tell them what parameters to use for different inputs, like file paths (URLs for cloud data), strings, etc. These configuration files are in a special format called JSON. See Getting workflows up and running faster with a JSON file.

Why use tables for workflow inputs

Using data tables for inputs makes it easy to scale and automate your analysis. With data tables, you can reference multiple files without hard coding values or having to adjust your workflow configuration when you add more data to the table.

Data tables keep all associated data together
Data files are connected no matter where in the cloud they reside, even data from different sources, including generated data. 

When NOT to use a data table Although we recommend using data tables, there are situations where you may not want to: if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little set up as possible.

To learn more about setting up a data table, see Organizing data with workspace tables.

Step 1: Select data

Note that these instructions are for running a workflow with inputs from the data table. For details of how to set up inputs with direct paths, scroll down to the bottom of the article. You can start by selecting data (on the Data page) or by selecting the tool (on the Workflows page).

  • 1.1. Go to the Data page.

    1.2. Select the table that includes links to the input data files.

    1.3. Select the rows with the entities to analyze.

    1.4. Click three vertical dots in the blue circle (top right).

    1.5. Choose Open with > Workflows.

    1.6. Choose the workflow to run to expose the workflow configuration form.

    workflow-config-form-from-Data-page.gif

  • 1.1 Go to the Workflows page

    1.2. Click the name of the workflow from the available cards. to expose the configuration form.
    Workflow-config-form-from-Workflows-page.gif

    1.3. Select the root entity type from the Step 1 dropdown. The root entity type is the table that contains the inputs required by the workflow.

    Selecting the right root entity table Note that the dropdown includes all the tables in your workspace. If you have more than one table and don't know which one is the right one, see Selecting the root entity type or When to use a set table for workflow inputs for guidance. 

    1.5. Click the blue Step 2 Select Data button in the configuration form to select the specific entities to analyze.

    Configure-workflows_Select-data_Screen_shot.png

    1.6. Follow prompts in the form to select the data to analyze. See screenshots what to expect for different use-cases below.

    Single entities (samples, specimens etc.)
    If your workflow will run on a single entity, you can process all rows of data or choose specific rows to process. If you analyze more than one data file, Terra will create a set of those inputs and you'll be able to name the set containing those particular entities:

    Configure-workflows_Select-data_Specimen_Step-1_Screen_shot.png

    After making your selection, make sure to click OK at the bottom of the form. 

    A group of single entities
    If your workflow will run on a single entity, but you are running on a set of single entities, the root entity type is still the single entity table. You can choose to run on all entities, pick specific entities to run on, or choose to run on the entities in a pre-defined set. Terra will submit as many jobs as there are members in the set to run in parallel. Your Select Data form will look like this (assuming you have some subsets already defined as entity_set tables).
    Configure-workflows_Select-data-arrays-input_Screen_shot.png

    Arrays
    If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. See When to use sets as inputs to a workflow for more information. Your Select Data form will look like this.

    Configure-workflows_Select-data_Specimens-in-set_Step-1_Screen_shot.png

    Tumor/normal pairs (somatic workflows)
    You can choose exactly which tumor-normal pairs you want to analyze in the Select Data form:

    Configure-workflows_Select-data-from-workflow_Pairs_Screen_shot.png

    Note that if you select more than one pair, you can name the new set that Terra will automatically generate. Terra will analyze the selected tumor/normal pairs in parallel and will create a pair_set that includes those selected pairs.

    1.7. Click the blue "OK" button at the bottom right to confirm your data selection. 

Verify data selection

Beside the blue "Select Data" button you should see the data you have selected. Click the appropriate tab below to see a screenshot of what to expect. 

  • If your workflow runs on a single entity (or several single entities), your form will look like this when you've selected the data. 

    Configure-workflow_Select-data-specimen-default_Screem_shot.png

    Note that if you run on more than one data file, Terra will create a set of those particular entities by default. It will name the set "workflow-name" + "run date". To give the set a more meaningful name, you must start your analysis from the workflow configuration card. 

  • If your workflow runs on single entities and you are running a set of single entities, the root entity type is still the single entity. Terra will submit as many jobs as there are members in the set to run in parallel. Your form will look like this.

    Configure-workflow_Select-data-specimens-run-set_Screen_shot.png
  • If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. In the screenshot below, the workflow is running one job on an array of entities defined in the specimen_set table).

    Configure-workflow_Select-data-specimen-set_Screen_shot.png
  • If you are running a somatic workflow, the (typical) root entity type is pair. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create a pair_set table that includes those two specific pairs.

    Configure-workflow_Data-table-input-Pairs_Screen_shot.png

Step 2. Specify fixed workflow attributes

Attributes are the integers, strings, or files that correspond to input variables in the workflow. You'll specify inputs by choosing filling in the Attributes fields for all required variables in the setup form.   

Set-up-workflow_Specify-attributes_Screen_shot.png

2.1. Fill in fixed attributes. These include variables like disk or memory size or Docker image URLs.

Some common attribute formats Integer - No formatting required
String - Quotes required. e.g. "my string"
Boolean- Quotes required. Case insensitive so"true" or "TRue" or "TrUE"  are the same.
File - This type can be referenced from the Google bucket, data model, or workspace attribute section.
Array[X] - Lists of these attributes can be entered with a comma between each item. e.g. "a","b","c" or 1,2,3 or "true","True","TruE","TRUE"

Step 3. Specify inputs from table

3.1. For each variable that comes from a data table (either an entity table or the workspace data table), click into the attribute field.

3.2. Once you click in, you'll see a dropdown with all the available options from both the root entity table and the workspace table (i.e. workspace-level resources). Choose the right variable from the dropdown. HINT: look across the row to see what the variable is!

What is in the dropdown?Input data files
Attributes that begin with this. are taken from the table you selected as the "root entity type" in the configuration form. The dropdown menu will list all columns in the root entity table. 

Workspace-level resources
Attributes that begin with workspace. are from the workspace data table. The dropdown menu will list all workspace-level resource files in the workspace data table. 

  • Format: this.CRAM. In the Screenshot below, there are five items in the dropdown after clicking into the InputCram (circled) attribute field. Each corresponds to a column in either the root entity or the workspace data table. Scroll down to select the one corresponding to the InputCram variable, this.CRAM.    

    Configure-workflow_Choose-attribute-from-dropdown_Screen_shot.png



  • Required format (must be typed in exactly like the example)

    this.your-entity+s.your-variable-name

    Configure-workflows_Arrays-as-input_Attributes-field_Screen_shot.png

    If your workflow runs on an array of entities, the format is slightly different!Note that this option will not show up automatically in the dropdown.

    To learn more, see Configuring workflow inputs: sets and pairs tables.

  • Required format (must be typed in exactly as shown)

    this.case_sample.your-variable-name or this.normal_sample.your-variable-name

    Configure-workflows_Pairs-attributes_Screen_shot.png

    If you are running a somatic workflow, the format is slightly differentNote that this option will not show up automatically in the dropdown. 

    To learn more, see Configuring workflow inputs: sets and pairs tables.

If you don't see the right input in the dropdown, check your root entity type to make sure you specified the right table. This can be tricky if you are using interconnected (nested) tables!

For example, if you're running multiple workflows in parallel on a group in a specimen_set table, the entity type is specimen. You only use the specimen_set to choose what specimens to process. It's not where the input files are, so it's not the root entity type! 

Working with data from a data repository

If you are importing data from the Terra Data Repository, Gen 3, or other repository, take note of the formatting in the data table. 

If you are using data with a pfb or tdr prefixConfigure-workflows-inputs_pfb-namespace-in-data-table_Screen_shot.png

You must include the prefix in the attribute field. Note that the proper format should show up in the dropdown menu.
Configure-workflow-inputs_pfb-namespace-in-dropdown_Screen_shot.png

Why use this formatting? This formatting gives you the flexibility to reference any entity, including using nested tables.

Note that data in interdependent tables will require more complex formatting. If your desired input is a single file, the syntax simply points directly at the file. If your desired input is a set of files nested inside of a folder, the syntax must first point to the correct folder, then points to the desired files within. Looking at the Type and Attributes columns serves as a quick way to check how your workflow is set up.

Why use workspace data tables?

Storing a file as a workspace attribute in the Workspace data table is convenient if you are using it over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. Workspace data tables are specified by the format workspace. plus the attribute key (i.e. workspace.ref_fasta or workspace.ref_dict). 

Some workflows may require additional types of input You may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task, for example. You will find fields for these options, if available, in the Inputs section of the configuration form.

 

Isn't there an easier way? Yes! Use JSONs

It's tedious (not to mention error-prone) to type in every attribute by hand. JSON files can vastly simplify the process, To learn how to use a JSON to configure files so you don't have to do it manually, see Getting workflows up and running faster with a JSON file. It's especially useful if you anticipate running much the same configuration many times over. 

One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of the guesswork out of configuring someone else's workflows to run on your own data.

Using full paths (direct links) for file inputs

If you are using file paths for your input data, you can enter the full path directly into the attribute field. Your configuration form will look like this.

Set-up-workflow_Input-files-full-paths_Screen_shot.png

Format when using direct paths as inputsUse "gs://url-to-file-in-bucket" to reference a file in a Google bucket directly. 

Formatting requirement - The quotes are necessary if you are directly referencing a file URL.
Workflow_hardcoded_attribute_Screen_Shot.png

Video and tutorial workflow resources 

Hands-on practice setting up and running a workflow analysisTo practice setting up and running workflows, work through the Terra-Workflows-QuickStart workspace. It should take about half an hour to complete the hands-on tutorial and cost less than a dime (GCP costs).

(Note that to run the exercises you will need to clone the workspace under your own billing project.) 

  • To learn more about using data tables to organize your data and enable you to scale your
    analysis, see Managing data with workspace tables.

  • To learn more about how to update workflows to the latest version, see this article.

  • To see a video tutorial on configuring a workflow, click here

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.