Set tables are a useful way to organize data when you want to group files for (repeat) analysis or if your analysis requires multiple input files to produce a single output. Learn when to use a set table for a workflow in both cases below.
How do you know when to use a set table in a workflow setup?
The answer depends on your analysis, and what kind of data your workflow expects. You could use a set table when you:
- Run a workflow on a particular subgroup (set) of single entities
In this case, the workflow generates one output file for each entity. It is possible to run a workflow on a single entity. If you run on a set of entities, Terra will run several jobs - one for each entity in the set - in parallel. See Analyzing many single entities in parallel below.
- Run a workflow that takes in many data files (an array) to generate a single output
See Workflows that have to run on sets below.
How you set up (configure) your workflow in Terra depends on how your workflow fits into a specific use-case.
Option 1: Analyzing many single entities in parallel
Even if your workflow can run on a single sample, sometimes you might want to analyze the same set of samples together:
- To group samples that share certain characteristics (i.e., same species, same developmental age, same sequencing method, etc.)
- To test the workflow (run many times) on a small group of the same samples
In this case, your workflow will run on a single sample, but you want to analyze many at once.
Why use a set as input rather than running on many individual entities?
The samples are independent and each sample has its own unique output (see illustration above). However, they share the same workflow setup or reference files, so it's easier to group them.
- You set up your workflow only once.
- Instead of selecting each sample row by row each time, you run on the entire set.
The same workflow will run in parallel as many times as you have samples in the set.
Setting up a sample_set of single samples to run in parallel
1. Start in the Workflows page.
2. Select the workflow you want to run on.
3. In Step 1, choose the input (root entity) table from the drop-down menu. Note: It should be the single entity, not the entity_set (even if you already defined a set to run on):
4. In Step 2, click the blue button to select the data.
5. In the data selection form, choose specific entities to process (Terra will create a set for you) or choose an existing set of entities (if you already created an entity_set table).
For example, if your workflow runs on one specimen and you don't already have a group to run on, you can choose specific specimens from the specimens table and Terra automatically generates a set for you. This screenshot is taken from the Data Tables Quickstart.
You select the subset you want and name the set something meaningful for you in the Select Data screen (circled in the screenshot below).
If you already have a specimen_set table, you can select existing sets of specimens in workflow setup Step 2 and run the same workflow on all the samples in the set at once.
Using a set table helps you keep track of the sets you run and allows you to easily rerun an analysis on the exact samples, avoiding manually setting up your samples each time.
6. Click the blue OK button (bottom right) to save your selection
7. Click the blue Save button
8. Click the blue Run Analysis button to launch your workflow.
Creating set tables on the fly for workflow analysis
You can manually create set tables using a spreadsheet editor (learn more in How to add a Table to a Terra workspace). Or you can create a set table on the fly as you set up and run your workflow.
To learn how to generate a set automatically to run a workflow on a subset of data, see the Data Tables QuickStart Part 3, or watch this video:
Option 2: Workflows that have to run on sets (array inputs)
Some workflows take in multiple sample inputs and write a single output for the whole set. For example, the Optimus pipeline takes in multiple lanes of a sample but outputs one file that corresponds to the sample, not the individual read lanes. Another example, familiar to cancer researchers, is the CNV_Somatic_Panel_workflow, which generates a single Panel of Normals (PoN) from a list of normal samples. The PoN is used in variant calling to filter out systemic errors that occur when reads are processed.
Your workflow runs only on multiple samples, and generates a single output that maps to the group, not the individual samples.
In this case, the samples are connected (dependent) in sets; each set has its own output. It's not possible to generate output from a single sample.
Example: Processing multiple lanes of sequencing as a set
Let's say you have a sample data table where each row represents a lane of sequencing for a given sample and you might have multiple sample types in the table. To process these data, you need to select all the lanes of sequencing that belong to only one sample. After processing, you generate one output, a file that represents the whole sample, not just the individual lanes of sequencing.
Why use a set for the root entity type?
- To process all the lanes in a single workflow submission, as opposed to running an independent workflow for each lane of sequencing
- To write the single output representing multiple lanes of sequencing back to the set table, to better organize the outputs for each sample
Another option: Arrays in an entity table
You can use an array in an entity table for workflows that take arrays as input. In the example below, the sample table includes five mouse sample files. Each data file is a different flow lane from the sequencer, but the same mouse sample.
The two screenshots below are two different ways to make the same input of five data files for a workflow that takes an array.
Array in sample table
The same array in a sample set table
See How to upload an array of files to a table for more details of how to create arrays inside entity tables.