Data Tables Quickstart Part 3: Understanding sets of data

Allie Hajian

What if you need to run a workflow analysis on the same group of single entities again and again? Selecting the exact same entities every time you run a workflow by clicking each row in a table - one at a time - will get tedious very quickly. And, if your dataset is big, you're likely to make mistakes when trying to choose the exact same rows every time. 

Thankfully, Terra has a built-in way to group single entities together in sets. In this part, we'll introduce the idea of sets, and work through two ways to create tables of sets to group data that make it easy to run on subsets of data. 

Overview: Running a workflow on groups of single entities

You may have hundreds of samples in your data table, but want to do some testing with just a few samples. The workflow outputs a data file for each input. You want to use the same three samples every time, but they're sprinkled throughout the sample table. 


Problem: Opening the table and selecting the right three samples each time you do a test run is tedious at best and prone to errors at worst.

Solution: Defining a set of the samples you want and running on the set is an elegant solution. For WDLs that accept single entities as input, using sets as input means the workflow runs multiple times in parallel - once on each entity in the set. The outputs will be distinct output files - one for each entity - written to the original table.


1. Create a set on the fly: Run a workflow on data subset

Terra will automatically generate a set when you run a workflow on more than one entities, which will make it easy to run on the same subset again (and again and again) without having to carefully choose the correct rows of data. Let's see how that works.

1.1. Go to the Workflows page and select the 1-Single-Input-Workflow WDL. 

1.2. For Step 1, select the root entity type "specimen" from the dropdown. Note that you're still running the workflow on a single entity, but running it many times (in parallel) on a set of entities. 

1.3. For Step 2, click the "Select Data" button to go to the workflow configuration form. This is where you will select the data that will form your set.  

1.4. In the "Select Data" form
       1. Select the "Choose specific rows to process" radio button
       2. Select two or three specimens that you want to group together 
       3. Name the set (at the bottom of the form)
       4. Save the selection by clicking "OK" 

1.5. Run the workflow by selecting the "Run Analysis" button and following the prompts.

Examine the generated "specimen_set" table

Once your specimens have been processed successfully, take a look again at the workspace Data page.

You'll notice an additional table, "specimen_set". The name tells you it's a table of sets of specimens. If you click to expand it, you'll see it includes one set (with whatever name you gave it) of however many specimens you chose.

Click on "# entities" link in the Specimens column to see the sample IDs of the samples you analyzed together:

Where's the output data?

The specimen_set table only includes the name of the set (in the ID column) and the specimen_IDs of the samples in the set. The input data files are in the specimen table (corresponding to the root entity type). 

The generated output data are also in the table corresponding to the root entity type, (the specimen table).

2. Make a table for a set of data from scratch 

As with other tables, you'll use a spreadsheet editor to make an entity _set table for any entity in Terra.

The load file that defines a set has two columns.
1. The ID column (with a unique name for each set)
2. The entity column, with the unique ID of every entity in the set.

There is a separate row for each entity in the set.

To start, open a blank sheet or page in your spreadsheet editor

Step 1. Fill in the header row

1.1. Fill in the ID (first column) 
The first column is the ID column, which defines the unique name of the set.

The format is "membership:your-entity-name_set_id".

The parts in red are required exactly as typed. The entity-name is whatever the entity you are grouping together. For the Data-Tables-Quickstart it will be specimens, but if your workflow processes samples, it could be samples.

Terra requires a particular format for the set_ID column header!membership:your-entity-name_set_id

Notice that the second column must be the entity name (i.e. the entity must match)

1.2. Fill in the Entity column
The second column is the entity you're grouping into sets. The header must match the first column header of the table your workflow will take its single inputs from.

What the header row will look like: After filling in the headers, your specimen_set tsv file will look like this.

Step 2. Fill in the entity_set rows

Next, you'll add information about the sets - the set names and what unique entities are in each set. Each entity in the set has its own row in the load file.

2.1. Fill in the set name. In this example, we will make a set called `human_v3`.  

2.2. Fill in the entity_id (from the primary data table). Starting with the first specimen, `pbmc_human_v3_lane1`.  

2.3. Repeat for each member of each set. (i.e. the second row will be `human_v3` and `pbmc_human_v3_lane2`).

The spreadsheet (one set with two members) would look like thisScreen_Shot_2021-03-29_at_2.23.59_PM.png

Step 3. Save and upload

3.1. Save file as "tab-delimited text", once you have finished filling in the spreadsheet.

3.2. Upload to your workspace by clicking the "+" icon at the top of the left column in the Data page

What to expect: You should see a new row in the "specimen_set" table that corresponds to the load file (tsv) you just made and uploaded.

Adding additional sets

Note that you can add additional sets of different groups of specimens by adding to the same spreadsheet load file (TSV).

For example, let's say you wanted to add a set of all the human specimens. The spreadsheet would look like this.

After uploading the TSV file above to the workspace, the specimen_sets table will look like the screenshot below (note that it now includes an additional specimen_set, "human_all").

Required TSV upload order (for data tables that reference attributes in other tables)If there is a reference from entity B to entity A, then you must upload entity A first. In other words, Terra must already know of any entities in additional columns.

For example, because the specimen_set references the specimens, you would upload the specimens load file before the specimen_set  TSV. Note that in this exercise, the specimens table already existed.

3. Set up and run a workflow on a set of single entities

 Think of your set as a means of grouping individual inputs. The specimen_set table doesn't actually include the specimen data files the workflow uses as input. The workflow runs exactly like it did on a single sample entity, but on all the members of the set simultaneously.

Given this, how do you tell the workflow where to go for the data? We'll demonstrate by running the 1-Single-Input-Workflow on the set you just created. 

How to configure the workflow (step-by-step)

Start from the Workflow page and select `1-Single-Input-Workflow` 

3.1. Set the root entity type

The workflow is taking single entities as input and processing them in parallel. So the root entity type is still `specimens`

Formatting when using the data table for inputs and outputs: Note that the formatting for writing the input and output to the data table is exactly the same as in the single-entity case.

What to expect

Once you launch your workflow, you will see the status of all the specimens you ran in your set. When they are done, you should see a number of satisfying green checks.

And when you go back to the Data page, you should see that each specimen that was in your set has an additional column that includes links to output data.

G0-smiley-icon.png Congratulations! You've completed Part 3 of the Data Tables Quickstart!

Was this article helpful?

1 out of 1 found this helpful


1 comment

  • Comment author
    Yossi Farjoun

    I tried creating a set of sample_sets using this method (with

    membership:sample_set_set_id sample_set

    as column headers)

    and it kept telling me that I need to have "sample" in the second column. I can still create a sample_set_set by running a workflow on a collection of sample_sets, but I wanted to find a way to do this without having to manually select the sets. 


Please sign in to leave a comment.