Workspace data tables (in the Data tab) can help organize and keep track of all project data, no matter where in the cloud they are. This article will help explain what workspace data tables are as well as how you can run a workflow analysis on individual samples, groups of samples, or arrays of samples in a table.
Watch an introductory video on data tables here
Understanding where your data are... and are not
When working in the cloud, you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to a workspace bucket or external bucket, or that's available in Terra's Data Library, or numerous other data repositories. Even better, you can analyze data from many different sources in a single, more robust analysis.
Why use workspace data tables?
A Terra workspace includes built-in spreadsheet-like "tables" that can help with organization as well as with scaling and automating your analysis.
Data "in" a workspace table can be anywhere in the cloud! Data files are not actually "in" the table (or even in the workspace, really...). Tables can include links to the physical locations of the data in the cloud, and keep associated data organized together.
Save storage costs and eliminate copying errors
Being able to use the data that's physically located in an external bucket can be especially nice when working with large data files stored in a public bucket, as you do not have to pay to store the original data. By sharing, rather than copying the data, you also reduce copying errors.
Data tables help organize large numbers of data (i.e. samples)
Imagine trying to keep track of hundreds or thousands of original data files in different buckets each with its own non-human-readable bucket or DRS URI link. Then imagine keeping the data generated during your analysis associated with the right original data. Tables are designed to help you keep all the data associated with a particular "entity" - whether a sample or participant -together.
The payoff of investing time to set up data tablesTables do take time to set up. But once set up, you won't have to worry about keeping track of data (original data files and analysis outputs) manually. This built-in organization can be especially useful as studies and analyses get more complex. Tables can include as much information you need in additional columns. For example, as you do a workflow analysis, you can add output files, keeping original and generated data all together in a single row for each unique sample.
Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example)
- Participant or other ID to associate samples and other data - such as phenotypic data
- Study particulars such as collection dates or techniques
Data tables help keep track of generated data from a workflow analysis
If you've ever run a workflow, you know that the generated data is stored by default in the Workspace bucket, in folders whose names correspond to the workflow submission ID.
If you don't set up your workflow to write to the data table, you will have to go down four levels of automatically assigned folder levels to get to your output file.
Workflow outputs in the Google bucket file folder (random strings of folders)
Contrast this to the same output file in the data table. Note that the link to the generated data is in the same row in the table as the primary data, and associated with a unique collaborator ID.
Workflow outputs in the data table (clear associations)
Data tables make it easier to automate and scale an analysis
When running a workflow analysis, you can manually put in direct paths for the input data or other attributes in the WDL. But it's not a system that works well if you have more than a handful of files. Because workflows can be configured to read and write directly from a data table, workspace data tables can save time and headache in the long run and enable automation of back-to-back workflows.
For example, in the screenshot below, see how selecting the "Run with inputs from the data table" option (1) allows you to run on all 2504 samples in parallel automatically (2):
What does a table look like? What does it contain?
Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet. Each table is identified by its "entity" (smallest thing, or piece of input data it contains); each row corresponds to one distinct entity, and each column is a different type of information about that entity.
What's an "entity"? A piece or kind of dataAccording to the dictionary, an "entity" is "a thing with distinct and independent existence." In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.
An entity is the type of primary data stored in a data table. It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table), or unicorns (a "unicorn" table), for example.
Example: sample data in a sample table
This sample table includes genomic data (BAM and BAM index files) of various samples. Note that the first column is each sample's unique ID and the fourth column is the participant ID, also found in the participant table.
What's the minimum and maximum information in a workspace table?> As much information as you need, and at least two columns: an ID column and a single data column.
A participant table is the exception to this, since participant tables can include one column only for participant IDs. You can include additional columns (workflows can add generated files from an analysis, for example), and the data table will keep it all organized in one place.
Data tables aren't limited to data inputs for your workflows!
They are flexible, intended to help organize any information you might need for your study. Additional table columns work much like columns in a spreadsheet. The column header describes what information is in the column, and cells keep track of the information. Terra accepts any kind of data table, to be able to keep track of whatever "entities" you need. The two examples below are fairly common, but not the only, types of tables. You can use the step-by-step instructions further down to create your own table.
Example 1: Genomic data table
A table of genomic data includes (at minimum)
1) the unique ID for each distinct sample or specimen
2) a link to the data file in a Google bucket
(i.e. the "cram_path" column below is a link to a CRAM file in a Google bucket)
The table can include additional columns to organize additional data associated with the sample - for example, additional metadata (such as the data type, or when and how the data were collected).
Example 2: Phenotypic data table
You can often find phenotypic data such as lab results, demographics, and medical records data in a participant table or subject table. The table must include, at minimum, the participant ID (first column) and one type of data. But it can include as many or as few columns of additional data as you need.
How to associate data in different tables Use a shared attribute (such as the participant_id) to associate a participant's phenotypic data (in the participant table, for example) and genomic data (in the sample table).
Sets and arrays of data in a workflow analysis
In addition to tables of single entities, you can keep track of groups of entities in tables. Set tables have a predefined format and relationship. For example, a sample_set table (entity) is a table of named sets of particular samples. The screenshot below shows what a sample_set table looks like.
Notice that the sample_set table only includes the set names (SampleSet1 and SampleSet2) and which samples are in each set. The sample table includes the actual data. You must have a table of samples before you can have a sample_set table. Thus both the the sample_set and sample tables must exist in the workspace.
Analyzing groups of single samples with sample sets tables
If your workflow analyzes single entities, you can run on the same group of n particular entities again and again using an entity_set table. The workflows will run n times, in parallel, and will output n output files to the appropriate row in the data table.
Practice using sets to analyze groups of single entities To learn how to create a sample set for this use-case see the Data Tables QuickStart Part 3: Understanding sets of data.
To learn how to configure workflows for this use-case see this article.
Analyzing array inputs from a data table
Practice using arrays as workflow input To learn how to create a data table for this use-case see the Data Tables QuickStart Part 4.
To see how to configure and run a workflow that accepts arrays as input, see this article.
Special data tables: pairs of tumor-normal samples
A pairs table is a predefined data table tailored for somatic workflows, which have a specific way of handling paired tumor and normal samples taken from the same patient. Somatic workflows that require pairs of tumor and normal samples accept data in pairs tables by default.
Tumor-normal pairs in a data tableYou can specify the control and case samples for a particular participant (HCC1141 below) in a pairs data table.
Using the default, predefined entities - participant, sample, and pair - allows Terra to associate the data correctly for somatic workflows.
Practice this using data-normal pairs as workflow inputTo learn how to create a sample set for this use-case see the Data Tables QuickStart Part 3: Understanding sets of data.
To learn how to configure workflows for this use-case see this article.
Additional resources and next steps
Learn how to modify, delete, and create data tables
See this article.
Hands-on practice using and creating data tables
Try the Data QuickStart workspace.
Learn to use data tables for input to a workflow
See How to set up a workflow analysis.