When and how to use a set table for a workflow

Set tables are a useful way to organize data when 1) you want to group specific input data for (repeat) analysis or 2) if your analysis requires multiple input files to produce a single output. Learn when to use a set table for a workflow and how to set up your workflow to run in both cases below.

How do you know when to use a set table in a workflow setup?

There are two use cases, so the answer depends on your analysis and what kind of data your workflow expects.

1. When running a workflow on a particular subgroup (set) of single entities

You will know this is your use case if it is possible to run the workflow on a single entity. When running on more than one entity, the workflow generates one output file for each entity.

Running on a set means running many workflows in parallel - one for each set member. This makes it easy to run the workflow - or downstream workflows - on the same subset of the entire data table, again and again. See Analyzing many single entities in parallel below.

2. When a workflow that takes in many data files (an array) to generate a single output

In this case, the set acts as a proxy for an array. Note that it is possible to include an array in a single cell in a data table. But if each input is already a separate row - in a sample table, for example - it's easier to run the workflow on the set of those rows. See Workflows that have to run on sets below.

Analyzing many single entities in parallel

In this case, your workflow will run on a single sample, but you want to analyze many at once.

Reasons to analyze many samples at once

To group samples that share certain characteristics (i.e., same species, same developmental age, same sequencing method, etc.)
To test the workflow (run many times) on a small group of the same samples

Why use a set as input rather than running on many individual entities?

The samples are independent, and each sample has its own unique output (see illustration above). However, they share the same workflow setup or reference files, so it's easier to set up the workflow if you group them.

You set up your workflow only once.
You don't have to select each sample manually - row by row - each time.

The same workflow will run in parallel as many times as you have samples in the set.

How to set up a sample_set of single samples to run in parallel

1. Start in the Workflows page.

2. Select the workflow you want to run on.

3. In Step 1, choose the input (root entity) table from the drop-down menu.

Note: The root entity should be the single entity, not the entity_set (even if you already defined a set to run on).

4. In Step 2, click the blue Select Data button.

5. In the data selection form, choose specific entities to process (Terra will create a set for you) or choose an existing set of entities (if you already created an entity_set table).

For example, if your workflow runs on one specimen and you don't already have a group to run on, you can choose specific specimens from the specimens table, check the Selected [entities] will be saved as a new [entities]_set named___ box, and Terra automatically generates a set for you. This screenshot is taken from the Data Tables Quickstart.

You select the subset you want and name the set something meaningful for you in the Select Data screen (circled in the screenshot below).
If you already have a specimen_set table, you can select existing sets of specimens in workflow setup Step 2 and run the same workflow on all the samples in the set at once.

Using a set table helps you keep track of the sets you run and allows you to easily rerun an analysis on the exact samples, avoiding manually setting up your samples each time.

6. Click the blue OK button (bottom right) to save your selection

7. Click the blue Save button

8. Click the blue Run Analysis button to launch your workflow.

Creating set tables on the fly for workflow analysis

You can manually create set tables using a spreadsheet editor (learn more in How to add a Table to a Terra workspace). Or you can create a set table on the fly as you set up and run your workflow.

To learn how to generate a set automatically to run a workflow on a subset of data, see the Data Tables QuickStart Part 3, or watch this video.

Workflows that have to run on sets (array inputs)

Some workflows take in multiple sample inputs and write a single output for the whole set. For example, the Optimus pipeline takes in multiple lanes of a sample but outputs one file that corresponds to the sample, not the individual read lanes. Another example, familiar to cancer researchers, is the CNV_Somatic_Panel_workflow, which generates a single Panel of Normals (PoN) from a list of normal samples. The PoN is used in variant calling to filter out systemic errors that occur when reads are processed.

Your workflow runs only on multiple samples, and generates a single output that maps to the group, not the individual samples.

The root entity is a sample_set table. In this case, the samples are connected (dependent) in sets; each set has its own output. It's not possible to generate output from a single sample.

Example: Processing multiple lanes of sequencing as a set

Let's say you have a sample data table where each row represents a lane of sequencing for a given sample, and you might have multiple sample types in the table. To process these data, you need to select all the lanes of sequencing that belong to only one sample. After processing, you generate one output, a file that represents the whole sample, not just the individual lanes of sequencing.

Why use a set for the root entity type?

To process all the lanes in a single workflow submission and get a single output, as opposed to running an independent workflow for each lane of sequencing
To write the single output representing multiple lanes of sequencing back to the set table, to better organize the outputs for each sample

Formatting requirements

When a workflow takes an array of files as input, the root entity type might be a _set table (for example, sample_set), but the data files are in the single entity table (i.e., the sample table).

In this case, you use the format this.samples.attribute-name.

Note: The extra s is always appended to the entity (table) in the nested format.

Another option: Arrays in an entity table

You can use an array in an entity table for workflows that take arrays as input. In the example below, the sample table includes five mouse sample files. Each data file is a different flow lane from the sequencer, but they're all from the same mouse sample.

The two screenshots below are two different ways to make the same input of five data files for a workflow that takes an array.

Array in the sample table

Mouse-FASTQ-lanes-array-in-combined-sample-table_Screen_shot.png

The same array in a sample set table

See How to upload an array of files to a table for more details of how to create arrays inside entity tables.