Data Tables Quickstart Part 3: Understanding sets of data

Allie Hajian
  • Updated

In Parts 1 and 2, you've practiced using tables to run a workflow on single entities. What if you need to run a workflow analysis on the same group of single entities again and again? Selecting the exact same group every time you run a workflow by clicking each row in a table - one at a time - will get tedious very quickly. And if your dataset is big, you're likely to make mistakes when trying to choose the exact same subset every time. 

Thankfully, Terra has a built-in way to group single entities together - sets. In this part, we'll introduce the idea of sets, and work through two ways to create tables of sets to group data and make it easy to run together. 

Overview: Running a workflow on groups of single entities

Say you have hundreds of samples in your data table, but you want to do some testing with just three samples. You want to use the same three samples every time, and they're sprinkled throughout the sample table. The output will be a single data file for each input: 

Data-QuickStart_Part3_Many-single-inputs.png

Problem: Opening the table and selecting the right three samples each time you do a test run is tedious at best and prone to errors at worst.

Solution: Defining a set of the samples you want and running on the set is a much more elegant solution. For WDLs that accept single entities as input, using sets as input just means the workflow runs multiple times in parallel - once on each entity in the set. The outputs will be a distinct output file for each entity, written to the original entity table:

Data-QuicStart_Part3_single_samples_input_as_sets.png

Exercise 3. 1. Create a set the easy way - run a workflow on a data subset

Terra will automatically generate a set when you run a workflow on more than one entities. Let's see how that works.

Step-by-step instructions

1. Go to the Workflows page and select the 1-Single-Input-Workflow WDL. 

2. In step 1, select the root entity type "specimen" from the dropdown. Note that you're still running the workflow on a single entity, but running it many times (in parallel) on a set of entities. 

3. In Step 2, click the "Select Data" button to go to the workflow configuration form. This is where you will select the data that will form your set.  

4. In the "Select Data" form
        4.1.
Chose the "Choose specific rows to process" radio button
        4.2. Select two or three specimens that you want to group together 
        4.3. Name the set (at the bottom of the form)
       4.4. Save the selection by clicking "OK" 

Data-QuickStart-Part3_Select-specimens-to-process_Screen_Shot.png

5. Run the workflow by selecting the "Run Analysis" button and following the prompts

Examine the generated "specimen_set" table

Once your specimens have been processed successfully, take a look again at the workspace Data page.

You'll notice an additional table, `specimen_set`. The name tells you it's a table of sets of specimens. If you click to expand it, you'll see it includes one set (with whatever name you gave it) of however many specimens you chose. Clicking on the "Specimens" column, you will see the sample IDs of the samples you selected to run together:
Data-QuickStart-Part3_Generated-set-table.png

Where's the output data?

The specimen_set table only includes the name of the set (in the ID column) and the specimen_IDs of the samples in the set.

The generated output data are actually in the table corresponding to the root entity type, which is the specimen table:
Data-QuickStart-Part3_Output-files_Screen_Shot.png

Exercise 3.2. Make a table for a set of data from scratch 

As with other tables, it's possible to use a spreadsheet editor to make an entity _set table for any entity in Terra. The load file that defines a set has two columns, the ID column (with a unique name for each set) and the entity column, with the unique ID of every entity in the set. There is a separate row for each entity in the set.

1. To start, open a blank sheet or page in your spreadsheet editor

2. Fill in the header row

ID (first column) 
The first column is the ID column, the unique name of the set. The format is "membership:your-entity-name_set_id". The parts in red are required exactly as typed. The entity-name is whatever the entity you are grouping together. For the Data-Tables-Quickstart it will be specimens, but if your workflow processes samples, it could be samples
icon-warning2.png


Terra requires a particular format for the set_ID column header!

 

membership:your-entity-name_set_id

Notice that the second column must be the entity name (i.e. the entity must match)


Entity column
The second column is the entity you're grouping into sets. The header must match the first column header of the table your workflow will take its single inputs from.

What the header row will look like
After filling in the headers, your specimen_set tsv file will look like this:

Screen_Shot_2021-03-29_at_2.22.04_PM.png

3. Fill in the entity_set rows

Next you'll add information about the sets - the set names and what unique entities are in each set. Each entity in the set has its own row in the load (tsv) file. Lets first make a set that includes the human v3 specimens (i.e.  `pbmc_human_v3_lane1` and `pbmc_human_v3_lane2`) and call it `humans_v3`. The spreadsheet would look like this:

Screen_Shot_2021-03-29_at_2.23.59_PM.png

4. Save file as "tab-delimited text", once you have finished filling in the spreadsheet.

5. Upload to your workspace

by clicking the "+" icon at the top of the left column in the Data page. 

G0_icon-tip.png


What to expect

 

You should see a new row in the "sepecimen_set" table that corresponds to the load file (tsv) you just made and uploaded (see screenshot below)

Data-QuickStart-Part3_New-set-added.png

Tips on adding additional sets

Note that you can add additional sets of different groups of specimens by adding to the same spreadsheet load file (tsv).

For example, let's say we wanted to add a set of all the human specimens. The spreadsheet would look like this:

Data-QuickStart-Part3_Specimen-set-in-spreadsheet_Screen_shot.png

After uploading the tsv file above to the workspace, the specimen_sets table will look like the screenshot below (note that it now includes an additional specimen_set, "human_all"):
Data-QuickStart-Part3_Three-specimen_sets.png

icon-warning2.png


Required tsv upload order

 

(for data tables that reference attributes in other tables)

If there is a reference from entity B to entity A, then you must upload entity A first. In
other words, Terra must already know of any entities in additional columns.

For example, because the specimen_set references the specimens, you would upload
the specimens tsv file before the specimen_set  tsv file.
Note that in this exercise,
the specimens table already existed.


Exercise 3.3. Set up and run a workflow on a set of single entities

Think of your set as a means of grouping individual inputs. The workflow runs exactly like it did on a single sample entity, but on all the members of the set simultaneously. 

The specimen_set table doesn't actually include the specimen data files the workflow uses as input.  How to you tell the workflow where to go for the data? We'll demonstrate by running the 1-Single-Input-Workflow on the set you just created. 

How to configure the workflow (step-by-step)

Start from the Workflow page and select `1-Single-Input-Workflow` 

Step 1 - Set the`root entity type`

The workflow is taking single entities as input and processing them in parallel. So the root entity type is still `specimens`
Data-QuickStart-Part3_Run-on-set-configure-step-1.png

Then click on the LAUNCH button to submit your workflow to the cloud.
 
icon-warning2.png


Using the data table for inputs and outputs 

  Note that the formatting for writing the input and output to the data table is exactly the
same as in the single-entity case (screenshot below).

Data-QuickStart-Part3_WDL_configuration_Screen_shot.png


What to expect

Once you launch your workflow, you will see the status of all the specimens you ran in your set. When they are done, you'll get a number of satisfying green checks:
Data-QuickStart-Part3_Successful-set-run_Screen_Shot.png

And when you go back to the Data page, you will see that each specimen that was in your set has additional columns of links to output data:
Data-QuickStart-Part3_Specimen-table-after-successful-set-run_Screen_Shot.png

Congrats!
You've completed Part 3 of the Data Tables Quickstart

 

Next up - Part 4: WDLs that take sets (arrays) of entities as input

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.