This article explains how to set up (configure) and run a workflow using inputs from a table.
- To learn how to automate some of this setup step by using a JSON file (especially useful if you anticipate using much the same configurations many times), see Getting workflows up and running faster with a JSON file.
- To learn how to configure additional cost-saving options in Terra, see Workflow setup: Runtime options.
Workflow setup: Inputs overview
To run a workflow on Terra, you will need to specify all required workflow input variables. This article walks through the setup process using the data table for inputs. To learn details of how to set up inputs with direct paths (the URL for the data file in the cloud), scroll down to the bottom of the article or click here.
Inputs optionsInputs include data file names and locations as well as reference files, compute parameters, and data file names and locations.
WDL workflows require special configuration files that tell them what parameters to use for different inputs, like file paths (URLs for cloud data), strings, etc. These configuration files are in a special format called JSON. See Getting workflows up and running faster with a JSON file.
Why use tables for workflow inputs
Using data tables for inputs makes it easy to scale and automate your analysis. With data tables, you can reference multiple files without hard coding values or having to adjust your workflow configuration when you add more data to the table.
Data tables keep all associated data together
Data files are connected no matter where in the cloud they reside, even data from different sources, including generated data.
When NOT to use a data table Although we recommend using data tables, there are situations where you may not want to: if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little set up as possible.
To learn more about setting up a data table, see Organizing data with workspace tables.
Step 1: Select data
Note that these instructions are for running a workflow with inputs from the data table. For details of how to set up inputs with direct paths, scroll down to the bottom of the article. You can start by selecting data (on the Data page) or by selecting the tool (on the Workflows page).
1.1. Go to the Data page.
1.2. Select the table that includes links to the input data files.
1.3. Select the rows with the entities to analyze.
1.4. Click three vertical dots in the blue circle (top right).
1.5. Choose Open with > Workflows.
1.6. Choose the workflow to run to expose the workflow configuration form.
1.1 Go to the Workflows page
1.2. Click the name of the workflow from the available cards. to expose the configuration form.
1.3. Select the root entity type from the Step 1 dropdown. The root entity type is the table that contains the inputs required by the workflow.
Selecting the right root entity table Note that the dropdown includes all the tables in your workspace. If you have more than one table and don't know which one is the right one, see Selecting the root entity type or When to use a set table for workflow inputs for guidance.
1.5. Click the blue Step 2 Select Data button in the configuration form to select the specific entities to analyze.
1.6. Follow prompts in the form to select the data to analyze. See screenshots what to expect for different use-cases below.
Single entities (samples, specimens etc.)
If your workflow will run on a single entity, you can process all rows of data or choose specific rows to process. If you analyze more than one data file, Terra will create a set of those inputs and you'll be able to name the set containing those particular entities:
After making your selection, make sure to click OK at the bottom of the form.
A group of single entities
If your workflow will run on a single entity, but you are running on a set of single entities, the root entity type is still the single entity table. You can choose to run on all entities, pick specific entities to run on, or choose to run on the entities in a pre-defined set. Terra will submit as many jobs as there are members in the set to run in parallel. Your Select Data form will look like this (assuming you have some subsets already defined as entity_set tables).
If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. See When to use sets as inputs to a workflow for more information. Your Select Data form will look like this.
Tumor/normal pairs (somatic workflows)
You can choose exactly which tumor-normal pairs you want to analyze in the Select Data form:
Note that if you select more than one pair, you can name the new set that Terra will automatically generate. Terra will analyze the selected tumor/normal pairs in parallel and will create a
pair_setthat includes those selected pairs.
1.7. Click the blue "OK" button at the bottom right to confirm your data selection.
Verify data selection
Beside the blue "Select Data" button you should see the data you have selected. Click the appropriate tab below to see a screenshot of what to expect.
If your workflow runs on a single entity (or several single entities), your form will look like this when you've selected the data.
Note that if you run on more than one data file, Terra will create a set of those particular entities by default. It will name the set "workflow-name" + "run date". To give the set a more meaningful name, you must start your analysis from the workflow configuration card.
If your workflow runs on single entities and you are running a set of single entities, the root entity type is still the single entity. Terra will submit as many jobs as there are members in the set to run in parallel. Your form will look like this.
If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. In the screenshot below, the workflow is running one job on an array of entities defined in the
If you are running a somatic workflow, the (typical) root entity type is
pair. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create a
pair_settable that includes those two specific pairs.
Step 2. Specify fixed workflow attributes
Attributes are the integers, strings, or files that correspond to input variables in the workflow. You'll specify inputs by choosing filling in the Attributes fields for all required variables in the setup form.
2.1. Fill in fixed attributes. These include variables like disk or memory size or Docker image URLs.
Some common attribute formats Integer - No formatting required
String - Quotes required. e.g.
"my string"Boolean- Quotes required. Case insensitive so
"TrUE" are the same.
File - This type can be referenced from the Google bucket, data model, or workspace attribute section.
Array[X] - Lists of these attributes can be entered with a comma between each item. e.g.
Step 3. Specify inputs from table
3.1. For each variable that comes from a data table (either an entity table or the workspace data table), click into the attribute field.
3.2. Once you click in, you'll see a dropdown with all the available options from both the root entity table and the workspace table (i.e. workspace-level resources). Choose the right variable from the dropdown. HINT: look across the row to see what the variable is!
What is in the dropdown?Input data files
Attributes that begin with
this. are taken from the table you selected as the "root entity type" in the configuration form. The dropdown menu will list all columns in the root entity table.
Attributes that begin with
workspace. are from the workspace data table. The dropdown menu will list all workspace-level resource files in the workspace data table.
this.CRAM. In the Screenshot below, there are five items in the dropdown after clicking into the InputCram (circled) attribute field. Each corresponds to a column in either the root entity or the workspace data table. Scroll down to select the one corresponding to the InputCram variable,
Required format (must be typed in exactly like the example)
If your workflow runs on an array of entities, the format is slightly different!Note that this option will not show up automatically in the dropdown.
To learn more, see Configuring workflow inputs: sets and pairs tables.
Required format (must be typed in exactly as shown)
If you are running a somatic workflow, the format is slightly differentNote that this option will not show up automatically in the dropdown.
To learn more, see Configuring workflow inputs: sets and pairs tables.
If you don't see the right input in the dropdown, check your root entity type to make sure you specified the right table. This can be tricky if you are using interconnected (nested) tables!
For example, if you're running multiple workflows in parallel on a group in a specimen_set table, the entity type is specimen. You only use the specimen_set to choose what specimens to process. It's not where the input files are, so it's not the root entity type!
Working with data from a data repository
If you are importing data from the Terra Data Repository, Gen 3, or other repository, take note of the formatting in the data table.
If you are using data with a pfb or tdr prefix
You must include the prefix in the attribute field. Note that the proper format should show up in the dropdown menu.
Why use this formatting? This formatting gives you the flexibility to reference any entity, including using nested tables.
Note that data in interdependent tables will require more complex formatting. If your desired input is a single file, the syntax simply points directly at the file. If your desired input is a set of files nested inside of a folder, the syntax must first point to the correct folder, then points to the desired files within. Looking at the Type and Attributes columns serves as a quick way to check how your workflow is set up.
Why use workspace data tables?
Storing a file as a workspace attribute in the Workspace data table is convenient if you are using it over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. Workspace data tables are specified by the format
workspace. plus the attribute key (i.e.
Some workflows may require additional types of input You may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task, for example. You will find fields for these options, if available, in the Inputs section of the configuration form.
Isn't there an easier way? Yes! Use JSONs
It's tedious (not to mention error-prone) to type in every attribute by hand. JSON files can vastly simplify the process, To learn how to use a JSON to configure files so you don't have to do it manually, see Getting workflows up and running faster with a JSON file. It's especially useful if you anticipate running much the same configuration many times over.
One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of the guesswork out of configuring someone else's workflows to run on your own data.
Using full paths (direct links) for file inputs
If you are using file paths for your input data, you can enter the full path directly into the attribute field. Your configuration form will look like this.
Format when using direct paths as inputsUse
"gs://url-to-file-in-bucket" to reference a file in a Google bucket directly.
Formatting requirement - The quotes are necessary if you are directly referencing a file URL.
Video and tutorial workflow resources
Hands-on practice setting up and running a workflow analysisTo practice setting up and running workflows, work through the Terra-Workflows-QuickStart workspace. It should take about half an hour to complete the hands-on tutorial and cost less than a dime (GCP costs).
(Note that to run the exercises you will need to clone the workspace under your own billing project.)