Managing data with workspace tables
FollowWorkspace data tables (in the Data tab) can help organize and keep track of all project data, no matter where in the cloud they are. This article will help explain what workspace data tables are as well as how you can run a workflow analysis on individual samples, groups of samples, or arrays of samples in a table.
Contents
Understanding where your data are.. and are not
Why use workspace data tables?
What does a table look like? What does it contain?
"Entities" and entity tables, explained
Using sets and arrays of data in a workflow analysis
Special sets - Tumor-normal sample pairs
Next step! Practice using and creating sets
Watch an introductory video on data tables here
Understanding where your data are... and are not
One advantage of working in the cloud is you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to a workspace bucket or external bucket, or that's available in Terra's Data Library, or numerous other data repositories. Even better, you can analyze data from many different sources in a single, more robust analysis.
|
|
---|---|
Data files are not actually "in" the table (or even in the workspace, really...). Tables can include links to the physical locations of the data in the cloud, and keep associated data organized together. Save storage costs and eliminate copying errors |
Why use workspace data tables?
Data tables help organize large numbers of samples
The payoff of investing time to set up data tables
Tables do take time to set up. But once set up, you won't have to worry about keeping track of data (original data files and analysis outputs) manually. This built-in organization can be especially useful as studies and analyses get more complex. Tables can include as much information you need in additional columns. For example, as you do a workflow analysis, you can add output files, keeping original and generated data all together in a single row for each unique sample:
- Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example)
- Participant or other ID to associate samples and other data - such as phenotypic data
- Study particulars such as collection dates or techniques
Data tables help keep track of generated data from a workflow analysis
Workflow outputs in the Google bucket file folder (random strings of folders)
Workflow outputs in the data table (clear associations)

Data tables make automation easier (one workflow setup, no matter how many input files)
For example, in the screenshot below, see how selecting the "Run with inputs from the data table" option (1) allows you to run on all 2504 samples in parallel automatically (2):
What does a table look like? What does it contain?
Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet. Each table is identified by its "entity" (smallest thing, or piece of input data it contains); each row corresponds to one distinct entity, and each column is a different type of information about that entity.
|
|
---|---|
According to the dictionary, an "entity" is "a thing with distinct and independent existence." In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.
An entity is the type of primary data stored in a data table. It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table), or unicorns (a "unicorn" table), for example. |
Example: sample data in a sample table

This sample table includes genomic data (BAM and BAM index files) of various samples. Note that the first column is each sample's unique ID and the fourth column is the participant ID, also found in the participant table.
|
|
---|---|
As much information as you need, and at least two columns: an ID column and a single data column. Data tables aren't limited to data inputs for your workflows! |
Example 1: Genomic data table
The table can include additional columns to organize additional data associated with the sample - for example, additional metadata (such as the data type, or when and how the data were collected).
|
|
---|---|
Use a shared attribute (such as the unique participant_id) to associate a participant's genomic data (in a sample or specimen table, for example) and their phenotypic data. |
Example 2: Phenotypic data table
|
|
---|---|
Use a shared attribute (such as the participant_id) to associate a participant's phenotypic data (in the participant table below, for example) and genomic data (in the sample table). |
Sets and arrays of data in a workflow analysis
In addition to tables of single entities, you can keep track of groups of entities in tables. Set tables have a predefined format and relationship. For example, a sample_set table (entity) is a table of named sets of particular samples. The screenshot below shows what a sample_set table looks like.
Notice that the sample_set table only includes the set names (SampleSet1 and SampleSet2) and which samples are in each set. The sample table includes the actual data. You must have a table of samples before you can have a sample_set table. Thus both the the sample_set and sample tables must exist in the workspace.
Analyzing groups of single samples with sample sets tables
|
|
---|---|
To learn how to create a sample set for this use-case see the Data Tables QuickStart Part 3: Understanding sets of data To learn how to configure workflows for this use-case see this article |
Analyzing array inputs from a data table
|
|
---|---|
To learn how to create a data table for this use-case see the Data Tables QuickStart Part 4 To see how to configure and run a workflow that accepts arrays as input, see this article |
Special data tables: pairs of tumor-normal samples
A pairs table is a predefined data table tailored for somatic workflows, which have a specific way of handling paired tumor and normal samples taken from the same patient. Somatic workflows that require pairs of tumor and normal samples accept data in pairs tables by default.
Tumor-normal pairs in a data table

The sample (i.e. genomic) data files are in the sample data table:
Using the default, predefined entities - participant, sample, and pair - allows Terra to associate the data correctly for somatic workflows.
|
|
---|---|
To learn how to create a sample set for this use-case see the Data Tables QuickStart Part 3: Understanding sets of data To learn how to configure workflows for this use-case see this article |
|
|
---|---|
Learn how to modify, delete, and create data tables Hands-on practice using and creating data tables Learn to use data tables for input to a workflow Practice using and creating sets Learn how to set up your workflow to run on input data in a table |
Comments
6 comments
Is their an explanation on how to setup workspace for cancer genomics analysis when the BAM.BAI files already uploaded to google bucket? Is the order of tables still important ?
Hi Pazpolak,
You would set up the workspace in much the same way, adding a column to the .tsv for the BAM.BAI with the associated paths. For an example, download the .tsv from this Featured Workspace here: https://app.terra.bio/#workspaces/help-gatk/Somatic-CNVs-GATK4/data
The order of table upload is still important if you would like them linked/nested. I recommend reviewing the contents of this article Understanding Entity Types for more details.
If you have any further questions, please let us know.
Kind regards,
Jason
Is there any way to automitically parse a .tsv cell that contains a delimiter into an Array? For example turn text ["item1,item2,item3"] into an Array[String] = ["item1", "item2", "item3"]? I realize I can change the type once the .tsv text is uploaded but then the text becomes one item and I have to do a lot of horizontal scrolling to cut and paste to new slots. It would be more convenient to have a feature like this at upload time.
James Gatter - I don't think there is a way in the UI. But take a look at the Data-Tables-QuickStart and see if it leads you in a fruitful direction (https://support.terra.bio/hc/en-us/articles/360047611871 - making sets in data tables and https://support.terra.bio/hc/en-us/articles/360047621171 running workflows that take arrays as input).
And if that doesn't work for you, you can file a feature request at http:// jason:shibaplz: 5:08 PM if it's not something that we do, they can always file a feature request at https://support.terra.bio/hc/en-us/community/topics/360000500452-Feature-Requests.
Thanks Allie! I found that even if the Arrays aren't marked as Array types in the data table, Cromwell will still accept them so long as they are surrounded by square brackets and delimited by commas. Not an issue after all. I might still put it in for a feature request since it would make visualizing large arrays in Terra nicer.
James Gatter I'm glad it worked out for you. Definitely submit the feature request - we're always looking for the best ways to make visualizing things in Terra more intuitive. Happy analyzing in the cloud on Terra!
Please sign in to leave a comment.