Data Tables QuickStart Part 3: Understanding sets of data
FollowIn Parts 1 and 2, you've practiced using tables to run a workflow on single entities. What if you need to run a workflow analysis on the same group of single entities again and again? Selecting the exact same group every time you run a workflow by clicking each row in a table - one at a time - will get tedious very quickly. And if your dataset is big, you're likely to make mistakes when trying to choose the exact same subset every time.
Thankfully, Terra has a built-in way to group single entities together - sets. In this part, we'll introduce the idea of sets, and work through two ways to create tables of sets to group data and make it easy to run together.
Contents
Overview: Running a workflow on groups of single entities
Exercise 3.1. Create a set of data the easy way: run a workflow on a subset
Exercise 3.2. Make a set from scratch in a spreadsheet editor and upload
Exercise 3.3. Set up and run a workflow on a set of single entities
Overview: Running a workflow on groups of single entities
Say you have hundreds of samples in your data table, but you want to do some testing with just three samples. You want to use the same three samples every time, and they're sprinkled throughout the sample table. The output will be a single data file for each input:
Opening the table and selecting the right three samples each time you do a test run is tedious at best and prone to errors at worst. Running on a set of exactly the samples you want is a much more elegant solution. For WDLs that accept single entities as input, using sets as input just means running the workflow multiple times independently (i.e. in parallel) on each entity in the set. The output will be a distinct output file for each entity, written to the original entity table:
Exercise 3. 1. Create a set the easy way - run a workflow on a data subset
If you choose a subset of input data and run a workflow on it, Terra will automatically generate a set of the entities you ran on. Let's see how that works.
Step-by-step instructions
2. In step 1, select the root entity type "specimen" from the dropdown. Note that you're still running the workflow on a single entity, but running it many times (in parallel) on a set of entities.
3. In Step 2, click the "Select Data" button to go to the workflow configuration form. This is where you will select the data that will form your set.
4. In the "Select Data" form
4.1. Chose the "Choose specific rows to process" radio button
4.2. Select two or three specimens that you want to group together
4.3. Name the set (at the bottom of the form)
4.4. Save the selection by clicking "OK"
5. Run the workflow by selecting the "Run Analysis" button and following the prompts
Examine the generated "specimen_set" table
Once your specimens have been processed successfully, take a look again at the workspace Data page.
You'll notice an additional table, `specimen_set`. The name tells you it's a table of sets of specimens. If you click to expand it, you'll see it includes one set (with whatever name you gave it) of however many specimens you chose. Clicking on the "Specimens" column, you will see the sample IDs of the samples you selected to run together:
Where's the output data?
You'll see that the specimen_set table only includes the name of the set (in the ID column) and the specimen_IDs of the samples in the set.
The generated output data are actually in the table corresponding to the root entity type, which is the specimen table:
Exercise 3.2. Make a set in a table from scratch
As with other tables, it's possible to use a spreadsheet editor to make an entity _set table for any entity in Terra. The load file that defines a set has two columns, the ID column (with a unique name for each set) and the entity column, with the unique ID of every entity in the set. There is a separate row for each entity in the set.
1. To start, open a blank sheet or page in your spreadsheet editor
2. Fill in the header row
The first column is the ID column, the unique name of the set. The format is "membership:your-entity-name_set_id". The parts in red are required exactly as typed. The entity-name is whatever the entity you are grouping together. For the Data-QuickStart it will be specimens, but if your workflow processes samples, it could be samples.
|
|
---|---|
Notice that the second column must be |
Entity column
The second column is the entity you're grouping into sets. The header must match the first column header of the table your workflow will take its single inputs from.
What the header row will look like
After filling in the headers, your specimen_set tsv file will look like this:
3. Fill in the entity_set rows
4. Save file as "tab-delimited text", once you have finished filling in the spreadsheet.
5. Upload to your workspace
by clicking the "+" icon at the top of the left column in the Data page.
|
|
---|---|
You should see a new row in the "sepecimen_set" table that corresponds to the load file |
Tips on adding additional sets
For example, let's say we wanted to add a set of all the human specimens. The spreadsheet would look like this
After uploading the tsv file above to the workspace, the specimen_sets table will look like the screenshot below:
|
|
---|---|
(for data tables that reference attributes in other tables)If there is a reference from entity B to entity A, then you must upload entity A first. In For example, because the specimen_set references the specimens, you would upload |
Exercise 3.3. Set up and run a workflow on a set of single entities
Think of your set as a means of grouping individual inputs. The workflow runs exactly like it did on a single sample entity, but on all the members of the set simultaneously.
The specimen_set table doesn't actually include the specimen data files the workflow uses as input. How to you tell the workflow where to go for the data? We'll demonstrate by running the 1-Single-Input-Workflow on the set you just created.
How to configure the workflow (step-by-step)
Step 1 - Set the`root entity type`
The workflow is taking single entities as input and processing them in parallel. So the root entity type is still `specimens`
|
|
---|---|
Note that the formatting for writing the input and output to the data table is exactly the same as in the single-entity case (screenshot below). |
What to expect
Once you launch your workflow, you will see the status of all the specimens you ran in your set. When they are done, you'll get a number of satisfying green checks:
And when you go back to the Data page, you will see that each specimen that was in your set has additional columns of links to output data:
✔ |
Congrats!
|
Next up - Part 4: WDLs that take sets (arrays) of entities as input |
Comments
0 comments
Please sign in to leave a comment.