Set tables are a useful way to organize data when you want to group files for (repeat) analysis or if your analysis requires multiple input files to produce a single output. Learn when to use a set table for a workflow in both cases below.
Overview: Two ways to use set tables in a workflow setup
If your workflow inputs are stored in a workspace table, you'll need to choose a "root entity" when you set up your workflow. The root entity is the data table that contains your data inputs (read more in Selecting the root entity type). Tables can be either an "entity" table (such as a sample table) or a "set" (a sample_set table).
How do you know when and how to use a set table?
The answer depends on your analysis! And what kind of data your workflow expects.
- Running a workflow on a particular subgroup (set) of single entities
This option generates one output file for each entity. Terra will run several jobs - one for each entity in the set - in parallel. See Analyzing many single entities in parallel below.
- Running a workflow that takes in many data files to generate a single output
See Workflows that have to run on sets below. How you set up (configure) your workflow in Terra depends on which use-case your workflow fits into.
Option 1: Analyzing many single entities in parallel
Even if your workflow can run on a single sample, there may be times when you might want to analyze the same set of samples together:
- To group samples that share certain characteristics (i.e. same species, same developmental age, same sequencing method, etc.)
- To test the workflow (run many times) on a small group of the same samples
In this case, your workflow will run on a single sample, but you want to analyze many at once.
Why use a set as input rather than running on many individual entities
The samples are independent and each sample will have its own unique output (see illustration above). However, they share the same workflow setup or reference files, so it's easier to group them:
- You only have to set up your workflow once
- Instead of selecting each sample row by row each time, you run on the entire set.
The same workflow will run in parallel as many times as you have samples in the set.
Setting up a sample_set of single samples to run in parallel
1. Start in the Workflows page.
2. Select the workflow you want to run on.
3. In Step 1, choose the entity table from the dropdown. Note that it should be the single entity, not the entity_set (even if you have already defined a set to run on):
4. In Step 2, click the blue button to select the data.
5. In the data selection form, you can choose specific entities to process (Terra will create a set for you) or choose an existing set of entities (if you have already created an entity_set table).
For example, if your workflow runs on one specimen and you don't already have a group to run on, you can choose specific specimens from the specimens table and Terra will automatically generate a set for you. This screenshot is taken from the Data Tables Quickstart.
You select the subset you want and name the set something meaningful for you in the Select Data screen (circled in the screenshot below).
If you already have a specimen_set table, you can select existing sets of specimens in workflow setup Step 2 and run the same workflow on all the samples in the set at once.
Using a set table helps you keep track of the sets you run and allows you to easily rerun an analysis on the exact samples, avoiding manually setting up your samples each time.
6. Click the blue OK button (bottom right) to save your selection
7. Click the blue Save button
8. Click the blue Run Analysis button to launch your workflow.
Creating set tables on the fly for workflow analysis
You can manually create set tables using a spreadsheet editor (learn more in How to add a Table to a Terra workspace). But you can also create a set table on the fly as you set up and run your workflow.
To learn how to generate a set automatically to run a workflow on a subset of data, see the Data Tables QuickStart Part 3, or watch this video:
Option 2: Workflows that have to run on sets (array inputs)
Some workflows take in multiple sample inputs and write a single output for the whole set. For example, the Optimus pipeline takes in multiple lanes of a sample but outputs one file that corresponds to the sample, not the individual read lanes. Another example, familiar to cancer researchers, is the CNV_Somatic_Panel_workflow, which generates a single Panel of Normals (PoN) from a list of normal samples. The PoN is used in variant calling to filter out systemic errors that occur when reads are processed.
Your workflow will only run on multiple samples, and generates a single output that maps to the group, not the individual samples.
In this case, the samples are connected (dependent) in sets; each set has its own output. It's not possible to generate output from a single sample.
Example: Processing multiple lanes of sequencing as a set
Let's say you have a sample data table where each row represents a lane of sequencing for a given sample and you might have multiple sample types in the table. To process this data, you'll need to select all the lanes of sequencing that belong to only one sample. After processing, you'll generate one output, a file that represents the whole sample, not just the individual lanes of sequencing.
Why use a set for the root entity type
In this case, using a set table as the root entity is beneficial because we can process all the lanes in a single workflow submission, as opposed to running an independent workflow for each lane of sequencing and we can write the single output representing multiple lanes of sequencing back to the set table, allowing us to better organize the outputs for each sample.