This article explains how to set up (configure) and run a workflow using inputs from a table.
Workflow setup: Inputs overview
To run a workflow on Terra, you need to specify all required workflow input variables. This article walks through the setup process when using inputs stored in the data table.
To learn details of how to set up inputs with direct paths (the URL for the data file in the cloud), scroll down to the bottom of the article or click here.
Inputs options
You can specify the data that your workflow operates over in two ways:
-
Data table columns. In most cases, we recommend specifying your workflow inputs by referring to columns in a Terra data table within the same workspace. Using data tables makes it easy to scale and automate your analysis, because you can reference multiple files without relying on error-prone hard-coded values.
You also won't have to adjust your workflow's configuration if you add more data to the table. In addition, data tables keep your workflow inputs and outputs together, no matter where in the cloud they reside. - Hard-coded file paths. Alternatively, you can hard-code paths to your files' location on the cloud.
When NOT to use a data table Although we generally recommend using data tables, you may not want to if:
- you cannot fit your data into the data table in a way that makes sense for your analysis, or
- you want to test a new method in Terra quickly, with as little setup as possible
To learn more about setting up a data table, see Organizing data with workspace tables.
Step 1: Select data
Note: These instructions are for running a workflow with inputs from a data table. For details of how to set up inputs with direct paths, scroll down to the bottom of this section. You can start by selecting data (on the Data page) or by selecting the tool (on the Workflows page).
-
1.1. Go to the Data tab.
1.2. Select the table that includes links to the input data files.
1.3. Select the rows with the entities to analyze.
1.4. Click Open with.
1.5. Choose Workflows.
1.6. Choose the workflow to run to expose the workflow configuration form.
-
1.1 Go to the Workflows page
1.2. Click the name of the workflow from the available cards to expose the configuration form.
1.3. Select the root entity type from the Step 1 dropdown. The root entity type is the table that contains the inputs required by the workflow.
Selecting the right root entity table Note: The dropdown includes all the tables in your workspace. If you have more than one table and don't know which one is right, see Selecting the root entity type or When to use a set table for workflow inputs for guidance.
1.4. Click the blue Step 2 Select Data button in the configuration form to select the specific entities to analyze.
1.5. Follow the prompts in the form to select the data to analyze. See screenshots of what to expect for different use cases below.
Single entities (samples, specimens etc.)
If your workflow runs on a single entity, you can process all rows of data or choose specific rows to process. If you analyze more than one data file, Terra creates a set of those inputs and you can name the set containing those particular entities:
After making your selection, click OK at the bottom of the form.
A group of single entities
If your workflow runs on a single entity, but you are running on a set of single entities, the root entity type is still the single entity table. You can choose to run on all entities, pick specific entities to run on, or choose to run on the entities in a predefined set. Terra will submit as many jobs as there are members in the set to run in parallel. Your Select Data form will look like this (assuming you have some subsets already defined as entity_set tables).
Arrays
If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. See When to use sets as inputs to a workflow for more information. Your Select Data form will look like this.
Tumor/normal pairs (somatic workflows)
You can choose exactly which tumor-normal pairs you want to analyze in the Select Data form:Note: If you select more than one pair, you can name the new set that Terra will automatically generate. Terra will analyze the selected tumor/normal pairs in parallel and will create a
pair_set
that includes those selected pairs.1.7. Click the blue "OK" button at the bottom right to confirm your data selection.
Verify data selection
Beside the blue "Select Data" button, you should see the data you selected. Click the appropriate tab below to see a screenshot of what to expect.
-
If your workflow runs on a single entity (or several single entities), your form will look like this when you've selected the data.
Note: If you run on more than one data file, Terra creates a set of those particular entities by default. It names the set "workflow-name" + "run date". To give the set a more meaningful name, start your analysis from the workflow configuration card.
-
If your workflow runs on single entities and you run a set of single entities, the root entity type is still the single entity. Terra will submit as many jobs as there are members in the set to run in parallel. Your form will look like this.
-
If the workflow accepts an array (set) of entities as input, the root entity type is entity_set. In the screenshot below, the workflow is running one job on an array of entities defined in the
specimen_set
table. -
If you run a somatic workflow, the (typical) root entity type is
pair
. In the screenshot below, Terra will run two tumor/normal pairs workflows in parallel and will create apair_set
table that includes those two specific pairs.
-
If you use file paths for your input data, you can enter the full path directly into the attribute field. Your configuration form will look like this.
Format when using direct paths as inputsUse
"gs://url-to-file-in-bucket"
to reference a file in a Google bucket directly.
Formatting requirement - The quotes are necessary if you directly reference a file URL.
Step 2. Specify fixed workflow attributes
Attributes are the integers, strings, or files that correspond to input variables in the workflow. You specify inputs by choosing filling in the Attributes fields for all required variables in the setup form.
2.1. Fill in fixed attributes. These include variables like disk or memory size or Docker image URLs.
Some common attribute formats Integer - No formatting required
String - Quotes required. e.g., "my string"
Boolean- Quotes required. Case insensitive so"true"
or "TRue"
or "TrUE"
are the same.
File - This type can be referenced from the Google bucket, data model, or workspace attribute section.
Array[X] - Lists of these attributes can be entered with a comma between each item. e.g., "a","b","c"
or 1,2,3
or "true","True","TruE","TRUE"
Step 3. Specify flexible workflow attributes from a data table
3.1. For each variable that comes from a data table (either an entity table or the workspace data table), click into the attribute field.
3.2. Once you click in, you'll see a drop-down menu with all the available options from both the root entity table and the workspace table (i.e., workspace-level resources). Choose the right variable from the dropdown. HINT: look across the row to see what the variable is!
What is in the drop-down menu?Input data files
Attributes that begin with this.
are taken from the table you selected as the "root entity type" in the configuration form. The drop-down menu will list all columns in the root entity table.
Workspace-level resources
Attributes that begin with workspace.
are from the workspace data table.
Storing a file as a workspace attribute in the Workspace data table is convenient if you use it over and over again in multiple workflows. If the file path changes, you only have one place to update, similar to global variables in scripting. Workspace data tables are specified by the format workspace.
plus the attribute key (i.e. workspace.ref_fasta
or workspace.ref_dict).
The drop-down menu will list all workspace-level resource files in the workspace data table.
-
Format:
this.CRAM
. In the Screenshot below, there are five items in the drop-down menu after clicking into the InputCram (circled) attribute field. Each corresponds to a column in either the root entity or the workspace data table. Scroll down to select the one corresponding to the InputCram variable,this.CRAM
. -
Required format (must be typed in exactly like the example)
this.your-entity+s.your-variable-name
If your workflow runs on an array of entities, the format is slightly different!Note: This option will not show up automatically in the drop-down menu.
To learn more, see Configuring workflow inputs: sets and pairs tables. -
Required format (must be typed in exactly as shown)
this.case_sample.your-variable-name
orthis.normal_sample.your-variable-name
If you run a somatic workflow, the format is slightly differentNote: This option will not show up automatically in the drop-down menu.
To learn more, see Configuring workflow inputs: sets and pairs tables.
If you don't see the right input in the drop-down menu, check your root entity type to make sure you specified the right table. This can be tricky if you are using interconnected (nested) tables!
For example, if you run multiple workflows in parallel on a group in a specimen_set table, the entity type is specimen. You only use the specimen_set to choose what specimens to process. It's not where the input files are, so it's not the root entity type!
- If you import data from the Terra Data Repository, Gen 3, or other repository, take note of the formatting in the data table.
If you use data with a pfb or tdr prefix, you must include the prefix in the attribute field. For example, the table shown below contains column names that start withpfb
:
When specifying inputs from this data table in a workflow, the formatting should bethis.pfb:COLUMN_NAME
. Note: The proper format should show up in the drop-down menu.
Data in interdependent tables require more complex formatting. If your desired input is a single file, the syntax points directly at the file. If your desired input is a set of files nested inside a folder, the syntax must first point to the correct folder, then point to the desired files within. Looking at the Type and Attributes columns is a quick way to check how your workflow is set up.
Some workflows may require additional types of input You may need to select one of several possible analysis options in the case of branched workflows, or you may have the opportunity to specify runtime options like the amount of memory and disk space provided to each task. You will find fields for these options, if available, in the Inputs section of the configuration form.
Isn't there an easier way? Yes! Use a JSON parameter file
It's tedious (not to mention inviting errors) to type in every attribute by hand. JSON parameter files can vastly simplify the process. To learn how to use a JSON to configure files so you don't have to do it manually, see Getting workflows up and running faster with a JSON file. It's especially useful if you anticipate running the same configuration many times over.
One nice aspect of how Terra manages workflows and their configurations is that it allows you to export your workflow config (JSON) back to the Method Repository and share it with others. Conversely, you can import any published workflow config to your own workspace. That can take a lot of guesswork out of configuring someone else's workflows to run on your own data.
Next steps: Video and tutorial workflow resources
Hands-on practice setting up and running a workflow analysisTo practice setting up and running workflows, try the T101 Workflows QuickStart (click for guide) workspace. It should take about half an hour to complete the hands-on tutorial and cost less than a dime.
(Note: To run the exercises, you need to clone the workspace under your own billing project.)
-
To learn more about using data tables to organize your data and enable you to scale your analysis, see Managing data with workspace tables.
-
To learn more about how to update workflows, see Updating workflows to the latest version.
-
To see a video tutorial on configuring a workflow, click here.
- To learn how to configure additional cost-saving options in Terra, see Workflow setup: virtual machine (VM) options.