This article explains how to set up (configure) and run a workflow using inputs from a table.
- For hands-on practice setting up and running a workflow analysis, try the T101 Workflows QuickStart tutorial. It should take about half an hour to complete the tutorial exercises and cost less than a dime (Google Cloud costs).
- To learn how to automate some of this setup step by using a JSON file (especially useful if you anticipate using similar configurations many times), see Getting workflows up and running faster with a JSON file.
- To learn how to configure additional cost-saving options in Terra, see Workflow setup: virtual machine (VM) options.
Workflow setup: Inputs overview
To run a workflow on Terra, you need to specify all required workflow input variables. This article walks through the setup process when using inputs stored in the data table.
To learn details of how to set up inputs with direct paths (the URL for the data file in the cloud), scroll down to the bottom of the article or click here.
Inputs optionsInputs include input data file names and locations as well as reference files, compute parameters, and data file names and locations.
WDL workflows require special configuration files that tell them what parameters to use for different inputs, like file paths (URLs for cloud data), strings, etc. These configuration files are in a special format called JSON. See Getting workflows up and running faster with a JSON file.
Why use tables for workflow inputs?
Using data tables for inputs makes it easy to scale and automate your analysis.
With data tables, you can reference multiple files without hard coding values or having to adjust your workflow configuration when you add more data to the table.
Data tables keep all associated data together
Data files are connected no matter where in the cloud they reside, even data from different sources, including generated data.
When NOT to use a data table Although we generally recommend using data tables, you may not want to if: you cannot fit your data into the data table in a way that makes sense for your analysis or if you want to test a new method in Terra quickly - with as little setup as possible.
To learn more about setting up a data table, see Organizing data with workspace tables.
Step 1: Select data
Note: These instructions are for running a workflow with inputs from the data table. For details of how to set up inputs with direct paths, scroll down to the bottom of the article. You can start by selecting data (on the Data page) or by selecting the tool (on the Workflows page).
1.1. Go to the Data tab.
1.2. Select the table that includes links to the input data files.
1.3. Select the rows with the entities to analyze.
1.4. Click Open with.
1.5. Choose Workflows.
1.6. Choose the workflow to run to expose the workflow configuration form.
1.1 Go to the Workflows page
1.2. Click the name of the workflow from the available cards to expose the configuration form.
1.3. Select the root entity type from the Step 1 dropdown. The root entity type is the table that contains the inputs required by the workflow.
Selecting the right root entity table Note: The dropdown includes all the tables in your workspace. If you have more than one table and don't know which one is right, see Selecting the root entity type or When to use a set table for workflow inputs for guidance.
1.4. Click the blue Step 2 Select Data button in the configuration form to select the specific entities to analyze.
1.5. Follow the prompts in the form to select the data to analyze. See screenshots of what to expect for different use cases below.
Single entities (samples, specimens etc.)
If your workflow runs on a single entity, you can process all rows of data or choose specific rows to process. If you analyze more than one data file, Terra creates a set of those inputs and you can name the set containing those particular entities:
After making your selection, click OK at the bottom of the form.
A group of single entities
If your workflow runs on a single entity, but you are running on a set of single entities, the root entity type is still the single entity table. You can choose to run on all entities, pick specific entities to run on, or choose to run on the entities in a predefined set. Terra will submit as many jobs as there are members in the set to run in parallel. Your Select Data form will look like this (assuming you have some subsets already defined as entity_set tables).
If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. See When to use sets as inputs to a workflow for more information. Your Select Data form will look like this.
Tumor/normal pairs (somatic workflows)
You can choose exactly which tumor-normal pairs you want to analyze in the Select Data form:
Note: If you select more than one pair, you can name the new set that Terra will automatically generate. Terra will analyze the selected tumor/normal pairs in parallel and will create a
pair_setthat includes those selected pairs.
1.7. Click the blue "OK" button at the bottom right to confirm your data selection.
Verify data selection
Beside the blue "Select Data" button, you should see the data you selected. Click the appropriate tab below to see a screenshot of what to expect.
If your workflow runs on a single entity (or several single entities), your form will look like this when you've selected the data.
Note: If you run on more than one data file, Terra creates a set of those particular entities by default. It names the set "workflow-name" + "run date". To give the set a more meaningful name, start your analysis from the workflow configuration card.
If your workflow runs on single entities and you run a set of single entities, the root entity type is still the single entity. Terra will submit as many jobs as there are members in the set to run in parallel. Your form will look like this.
If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. In the screenshot below, the workflow is running one job on an array of entities defined in the
If you run a somatic workflow, the (typical) root entity type is
pair. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create a
pair_settable that includes those two specific pairs.
Step 2. Specify fixed workflow attributes
Attributes are the integers, strings, or files that correspond to input variables in the workflow. You specify inputs by choosing filling in the Attributes fields for all required variables in the setup form.
2.1. Fill in fixed attributes. These include variables like disk or memory size or Docker image URLs.
Some common attribute formats Integer - No formatting required
String - Quotes required. e.g.,
"my string"Boolean- Quotes required. Case insensitive so
"TrUE" are the same.
File - This type can be referenced from the Google bucket, data model, or workspace attribute section.
Array[X] - Lists of these attributes can be entered with a comma between each item. e.g.,
Step 3. Specify inputs from table
3.1. For each variable that comes from a data table (either an entity table or the workspace data table), click into the attribute field.
3.2. Once you click in, you'll see a drop-down menu with all the available options from both the root entity table and the workspace table (i.e., workspace-level resources). Choose the right variable from the dropdown. HINT: look across the row to see what the variable is!
What is in the drop-down menu?Input data files
Attributes that begin with
this. are taken from the table you selected as the "root entity type" in the configuration form. The drop-down menu will list all columns in the root entity table.
Attributes that begin with
workspace. are from the workspace data table. The drop-down menu will list all workspace-level resource files in the workspace data table.
this.CRAM. In the Screenshot below, there are five items in the drop-down menu after clicking into the InputCram (circled) attribute field. Each corresponds to a column in either the root entity or the workspace data table. Scroll down to select the one corresponding to the InputCram variable,
Required format (must be typed in exactly like the example)
If your workflow runs on an array of entities, the format is slightly different!Note: This option will not show up automatically in the drop-down menu.
To learn more, see Configuring workflow inputs: sets and pairs tables.
Required format (must be typed in exactly as shown)
If you run a somatic workflow, the format is slightly differentNote: This option will not show up automatically in the drop-down menu.
To learn more, see Configuring workflow inputs: sets and pairs tables.
If you don't see the right input in the drop-down menu, check your root entity type to make sure you specified the right table. This can be tricky if you are using interconnected (nested) tables!
For example, if you run multiple workflows in parallel on a group in a specimen_set table, the entity type is specimen. You only use the specimen_set to choose what specimens to process. It's not where the input files are, so it's not the root entity type!
Working with data from a data repository
If you import data from the Terra Data Repository, Gen 3, or other repository, take note of the formatting in the data table.
If you use data with a pfb or tdr prefix
You must include the prefix in the attribute field. Note: The proper format should show up in the drop-down menu.
Why use this formatting? This formatting gives you the flexibility to reference any entity, including using nested tables.
Note: Data in interdependent tables require more complex formatting. If your desired input is a single file, the syntax points directly at the file. If your desired input is a set of files nested inside a folder, the syntax must first point to the correct folder, then point to the desired files within. Looking at the Type and Attributes columns is a quick way to check how your workflow is set up.
Why use workspace data tables?
Storing a file as a workspace attribute in the Workspace data table is convenient if you use it over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. Workspace data tables are specified by the format
workspace. plus the attribute key (i.e.
Some workflows may require additional types of input You may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task. You will find fields for these options, if available, in the Inputs section of the configuration form.
Isn't there an easier way? Yes! Use JSONs
It's tedious (not to mention inviting errors) to type in every attribute by hand. JSON files can vastly simplify the process. To learn how to use a JSON to configure files so you don't have to do it manually, see Getting workflows up and running faster with a JSON file. It's especially useful if you anticipate running the same configuration many times over.
One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of guesswork out of configuring someone else's workflows to run on your own data.
Using full paths (direct links) for file inputs
If you use file paths for your input data, you can enter the full path directly into the attribute field. Your configuration form will look like this.
Format when using direct paths as inputsUse
"gs://url-to-file-in-bucket" to reference a file in a Google bucket directly.
Formatting requirement - The quotes are necessary if you directly reference a file URL.
Video and tutorial workflow resources
Hands-on practice setting up and running a workflow analysisTo practice setting up and running workflows, try the T101 Workflows QuickStart (click for guide) workspace. It should take about half an hour to complete the hands-on tutorial and cost less than a dime (Google Cloud costs).
(Note: To run the exercises, you need to clone the workspace under your own billing project.)