When and how to use a set table for a workflow

Liz Kiernan
  • Updated

Set tables are a useful way to organize data when 1) you want to group specific input data for (repeat) analysis or 2) if your analysis requires multiple input files to produce a single output. Learn when to use a set table for a workflow and how to set up your workflow to run in both cases below. 

How do you know when to use a set table in a workflow setup?

The answer depends on your analysis, and what kind of data your workflow expects. How you set up (configure) your workflow in Terra depends on how your workflow fits into a specific use case. 

1. When running a workflow on a particular subgroup (set) of single entities

In this case, it is possible to run the workflow on a single entity. When running on more than one entity, the workflow generates one output file for each entity. Running on a set make sit easy to run the workflow - or downstream workflows - on the same subset of the entire data table, again and again. If you run on a set of entities, Terra will run several jobs - one for each entity in the set - in parallel. See Analyzing many single entities in parallel below.

2. When a workflow that takes in many data files (an array) to generate a single output

In this case, the set acts as a proxy for an array. Note that it is possible to include an array in a single cell in a data table. But if your data is already in `has the files that need to See Workflows that have to run on sets below.

Option 1: Analyzing many single entities in parallel 

In this case, your workflow will run on a single sample, but you want to analyze many at once. 

Data-Quickstart_Part3_Sets-of-single-entities-as-input.png

Reasons to analyze many samples at once

  • To group samples that share certain characteristics (i.e., same species, same developmental age, same sequencing method, etc.)
  • To test the workflow (run many times) on a small group of the same samples

Why use a set as input rather than running on many individual entities?

The samples are independent, and each sample has its own unique output (see illustration above). However, they share the same workflow setup or reference files, so it's easier to set up the workflow if you group them.

  • You set up your workflow only once.
  • You don't have to select each sample manually - row by row - each time.

The same workflow will run in parallel as many times as you have samples in the set. 

How to set up a sample_set of single samples to run in parallel

1. Start in the Workflows page.

2. Select the workflow you want to run on. 

3. In Step 1, choose the input (root entity) table from the drop-down menu.

Note: The root entity should be the single entity, not the entity_set (even if you already defined a set to run on).

Root-entity-type_Specimen_Screen_shot.png

4. In Step 2, click the blue Select Data button

5. In the data selection form, choose specific entities to process (Terra will create a set for you) or choose an existing set of entities (if you already created an entity_set table).

  • For example, if your workflow runs on one specimen and you don't already have a group to run on, you can choose specific specimens from the specimens table and Terra automatically generates a set for you. This screenshot is taken from the Data Tables Quickstart.

    You select the subset you want and name the set something meaningful for you in the Select Data screen (circled in the screenshot below).

    Select-data_Create-a-new-specimen-set_Screen_shot.png

  • If you already have a specimen_set table, you can select existing sets of specimens in workflow setup Step 2 and run the same workflow on all the samples in the set at once.

    Select-data_Choose-specific-sets-to-process_Screen_shot.png

    Using a set table helps you keep track of the sets you run and allows you to easily rerun an analysis on the exact samples, avoiding manually setting up your samples each time. 

6. Click the blue OK button (bottom right) to save your selection

7. Click the blue Save button

8. Click the blue Run Analysis button to launch your workflow.

Creating set tables on the fly for workflow analysis

You can manually create set tables using a spreadsheet editor (learn more in How to add a Table to a Terra workspace). Or you can create a set table on the fly as you set up and run your workflow.

To learn how to generate a set automatically to run a workflow on a subset of data, see the Data Tables QuickStart Part 3, or watch this video.

Option 2: Workflows that have to run on sets (array inputs) 

Some workflows take in multiple sample inputs and write a single output for the whole set. For example, the Optimus pipeline takes in multiple lanes of a sample but outputs one file that corresponds to the sample, not the individual read lanes. Another example, familiar to cancer researchers, is the CNV_Somatic_Panel_workflow, which generates a single Panel of Normals (PoN) from a list of normal samples. The PoN is used in variant calling to filter out systemic errors that occur when reads are processed. 

Your workflow runs only on multiple samples, and generates a single output that maps to the group, not the individual samples. 

Data-Quickstart_Part4_Sets-as-input-arrays.png

In this case, the samples are connected (dependent) in sets; each set has its own output. It's not possible to generate output from a single sample

Example: Processing multiple lanes of sequencing as a set

Let's say you have a sample data table where each row represents a lane of sequencing for a given sample, and you might have multiple sample types in the table. To process these data, you need to select all the lanes of sequencing that belong to only one sample. After processing, you generate one output, a file that represents the whole sample, not just the individual lanes of sequencing. 

Why use a set for the root entity type?

  • To process all the lanes in a single workflow submission, as opposed to running an independent workflow for each lane of sequencing 
  • To write the single output representing multiple lanes of sequencing back to the set table, to better organize the outputs for each sample

Formatting requirements

When a workflow takes an array of files as input, the root entity type might be a _set table (for example, sample_set), but the data files are in the single entity table (i.e., the sample table).

In this case, you use the format this.samples.attribute-name.

Note: The extra s is always appended to the entity (table) in the nested format. 

Another option: Arrays in an entity table

You can use an array in an entity table for workflows that take arrays as input. In the example below, the sample table includes five mouse sample files. Each data file is a different flow lane from the sequencer, but they're all from the same mouse sample.

Mouse-FASTQ-lanes-in-sample-table_Screen_shot.png

The two screenshots below are two different ways to make the same input of five data files for a workflow that takes an array.

 Array in the sample table

Mouse-FASTQ-lanes-array-in-combined-sample-table_Screen_shot.png

The same array in a sample set table

Mouse-FASTQ-lanes-in-sample-set-table_Screen_shot.png

See How to upload an array of files to a table for more details of how to create arrays inside entity tables. 

Additional resources

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.