Managing data with workspace tables

Allie Hajian

Workspace data tables (in the Data tab) can help organize and keep track of all project data, no matter where in the cloud they are. This article will help explain what workspace data tables are as well as how you can run a workflow analysis on individual samples, groups of samples, or arrays of samples in a table. 

Contents

Understanding where your data are.. and are not
Why use workspace data tables?
What does a table look like? What does it contain?
   "Entities" and entity tables, explained
Using sets and arrays of data in a workflow analysis
Special sets - Tumor-normal sample pairs
Next step! Practice using and creating sets


Watch an introductory video on data tables here


Understanding where your data are... and are not

One advantage of working in the cloud is you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to a workspace bucket or external bucket, or that's available in Terra's Data Library, or numerous other data repositories. Even better, you can analyze data from many different sources in a single, more robust analysis.

Data-QuickStart_Part1_Workspace_in_Cloud.png

G0_tip-icon.png


Data "in" a workspace table can be anywhere in the cloud!

 

Data files are not actually "in" the table (or even in the workspace, really...). Tables can include links to the physical locations of the data in the cloud, and keep associated data organized together. 

Save storage costs and eliminate copying errors
Being able to use the data that's physically located in an external bucket can be especially nice when working with large data files stored in a public bucket, as you do not have to pay to store the original data. By sharing, rather than copying the data, you also reduce copying errors.

 

Why use workspace data tables?

Data tables help organize large numbers of samples

Imagine trying to keep track of hundreds or thousands of original data files in different buckets each with its own non-human-readable bucket or DRS URI link. Then imagine keeping the data generated during your analysis associated with the right original data. Tables are designed to help you keep all the data associated with a particular "entity" - whether a sample or participant -together.

The payoff of investing time to set up data tables

Tables do take time to set up. But once set up, you won't have to worry about keeping track of data (original data files and analysis outputs) manually. This built-in organization can be especially useful as studies and analyses get more complex. Tables can include as much information you need in additional columns. For example, as you do a workflow analysis, you can add output files, keeping original and generated data all together in a single row for each unique sample:

Understanding-entities_Sample-table_Screen_shot.png

  • Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example) 
  • Participant or other ID to associate samples and other data - such as phenotypic data 
  • Study particulars such as collection dates or techniques 

Data tables help keep track of generated data from a workflow analysis

If you've ever run a workflow, you know that the generated data is stored by default in the workspace. If you don't set up your workflow to read to and write from the data table, you will have to go down four levels of automatically assigned folder levels to get to your output file (top screenshot). Contrast this to the same output file in the data table (bottom screenshot). Note that it is in the same row in the table as the primary data, and associated with a unique collaborator ID.
 

Workflow outputs in the Google bucket file folder (random strings of folders)
Managing-data-with-tables_Generated-data-in-bucket_Screen_shot.png 

Workflow outputs in the data table (clear associations)
Managing-data-with-tables_Generated-data_Screen_shot.png

Data tables make automation easier (one workflow setup, no matter how many input files)

When running a workflow analysis, you can manually put in direct paths for the input data or other attributes in the WDL. But it's not a system that works well if you have more than a handful of files. Because workflows can be configured to read and write directly from a data table, workspace data tables can save time and headache in the long run and enable automation of back-to-back workflows.

For example, in the screenshot below, see how selecting the "Run with inputs from the data table" option (1) allows you to run on all 2504 samples in parallel automatically (2):
Managing-data-with-tables_Automating-workflows.png


What does a table look like? What does it contain?

Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet. Each table is identified by its "entity" (smallest thing, or piece of input data it contains); each row corresponds to one distinct entity, and each column is a different type of information about that entity.

G0_tip-icon.png


What's an "entity"? A piece or kind of data

  According to the dictionary, an "entity" is "a thing with distinct and independent existence."  In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.

An entity is the type of primary data stored in a data table. It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table), or unicorns (a "unicorn" table), for example. 


Example: sample data in a sample
 table
Unerstanding-entities_Sample-entities-table_Screen_shot.png
This sample table includes genomic data (BAM and BAM index files) of various samples.  Note that the first column is each sample's unique ID and the fourth column is the participant ID, also found in the participant table. 

G0_tip-icon.png


What's the minimum and maximum information in a workspace table? 

 

As much information as you need, and at least two columns: an ID column and a single data column.
A participant table is the exception to this, since participant tables can include one column only for participant IDs. You can include additional columns (workflows can add generated files from an analysis, for example), and the data table will keep it all organized in one place. 

Data-QuickStart_Part2_New-table.png

Data tables aren't limited to data inputs for your workflows!
They are flexible, intended to help organize any information you might need for your study. Additional table columns work much like columns in a spreadsheet. The column header describes what information is in the column, and cells keep track of the information. Terra accepts any kind of data table, to be able to keep track of whatever "entities" you need. The two examples below are fairly common, but not the only, types of tables. You can use the step-by-step instructions further down to create your own table. 

Example 1: Genomic data table

A table of genomic data includes (at minimum): 1) the unique ID for each distinct sample or specimen, and 2) a link to the data file in a Google bucket (i.e. the "cram_path" column below is a link to a CRAM file in a Google bucket).

Data-QuickStart_Part1_sample-table.pngThe table can include additional columns to organize additional data associated with the sample  - for example, additional metadata (such as the data type, or when and how the data were collected).

G0_tip-icon.png


How to associate data in different tables

  Use a shared attribute (such as the unique participant_id) to associate a participant's genomic data (in a sample or specimen table, for example) and their phenotypic data.

Example 2: Phenotypic data table

Include phenotypic data such as lab results, demographics, and medical records data in a participant table. The table must include, at minimum, the participant ID (first column), but can include as many or as few columns of additional data as you need. 

Data-Quickstart_Part1_participant-table.png

G0_tip-icon.png


How to associate data in different tables 

  Use a shared attribute (such as the participant_id) to associate a participant's phenotypic data (in the participant table below, for example) and genomic data (in the sample table).


Sets and arrays of data in a workflow analysis

In addition to tables of single entities, you can keep track of groups of entities in tables. Set tables have a predefined format and relationship. For example, a sample_set table (entity) is a table of named sets of particular samples. The screenshot below shows what a sample_set table looks like. 

Managing-Data_Sample-set_Screen_shot.png

Notice that the sample_set table only includes the set names (SampleSet1 and SampleSet2) and which samples are in each set. The sample table includes the actual data. You must have a table of samples before you can have a sample_set table. Thus both the the sample_set and sample tables must exist in the workspace.

Analyzing groups of single samples with sample sets tables

If your workflow analyzes single entities, you can run on the same group of n particular entities again and again using an entity_set table. The workflows will run n times, in parallel, and will output n output files to the appropriate row in the data table.

Data-QuickStart_Part3_single_samples_input_as_sets.png

 

G0_tip-icon.png


Practice using sets to analyze groups of single entities

 

To learn how to create a sample set for this use-case see the Data Tables QuickStart Part 3: Understanding sets of data

To learn how to configure workflows for this use-case see this article

Analyzing array inputs from a data table

If your workflow takes an array as input, you can use a table of entity_sets as inputs to the workflow.

Data-QuickStart_Part4_Array-inputs-diagram.png

G0_tip-icon.png


Practice using arrays as workflow input

 

To learn how to create a data table for this use-case see the Data Tables QuickStart Part 4

To see how to configure and run a workflow that accepts arrays as input, see this article


Special data tables: pairs of tumor-normal samples

A pairs table is a predefined data table tailored for somatic workflows, which have a specific way of handling paired tumor and normal samples taken from the same patient. Somatic workflows that require pairs of tumor and normal samples accept data in pairs tables by default.

Tumor-normal pairs in a data table

You can specify the control and case samples for a particular participant (HCC1141 below) in a pairs data table:
Managing-data-tables_Pair-table_Screen_shot.png

The sample (i.e. genomic) data files are in the sample data table:
Managing-data-tables_Samples-in-pairs-table_Screen_shot.png

Using the default, predefined entities - participant, sample, and pair - allows Terra to associate the data correctly for somatic workflows.

G0_tip-icon.png


Practice this using data-normal pairs as workflow input 

 

To learn how to create a sample set for this use-case see the Data Tables QuickStart Part 3: Understanding sets of data

To learn how to configure workflows for this use-case see this article

 

G0_tip-icon.png


Additional resources and next steps 

 

Learn how to modify, delete, and create data tables 
See this article

Hands-on practice using and creating data tables
Try the Data QuickStart workspace

Learn to use data tables for input to a workflow
See this article.

Practice using and creating sets
See the Data-Tables-QuickStart tutorial workspace or these step-by-step instructions.

Learn how to set up your workflow to run on input data in a table
See this article

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

6 comments

Please sign in to leave a comment.