Managing data with tables

Allie Hajian

Workspace data tables (in the Terra Data page) can help organize and keep track of all project data, no matter where in the cloud they are. This article will help explain what workspace data tables are and how they can help you with your downstream analysis. 

Watch an introductory video on data tables here

Understanding where your data are... and are not

When working in the cloud, you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to a workspace bucket or external bucket, or that's available in Terra's Data Library or numerous other data repositories. Even better, you can analyze data from many different sources in a single, more robust analysis.  

Data-QuickStart_Part1_Workspace_in_Cloud.png

Why use data tables?

 A Terra workspace includes built-in spreadsheet-like "tables" that can help organize data stored in different cloud locations and enable you to easily scale and automate your analysis.

Data "in" a workspace table can be anywhere in the cloud! Data files are not actually "in" the table (or even in the workspace, really...). Tables can include links to the physical locations of the data in the cloud, and keep associated data organized together. This saves storage costs and eliminates copying errors.
Being able to use data that are physically located in an external bucket can be especially nice when working with large data files stored in a public bucket, as you do not have to pay to store the original data. By sharing, rather than copying the data, you also reduce copying errors.

Organize large numbers of data (i.e. samples)

Imagine trying to keep track of hundreds or thousands of original data files in different buckets each with its own non-human-readable bucket or DRS URI link. Then imagine keeping the data generated during your analysis associated with the right original data. Tables are designed to help you keep all the data associated with a particular "entity" - whether a sample or participant -together.

The payoff of investing time to set up data tablesTables do take time to set up. But once set up, you won't have to worry about keeping track of data (original data files and analysis outputs) manually. This built-in organization can be especially useful as studies and analyses get more complex. Tables can include as much information you need in additional columns. For example, as you do a workflow analysis, you can add output files, keeping original and generated data all together in a single row for each unique sample.

Understanding-entities_Sample-table_Screen_shot.png
- Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example)
-
Participant or other ID to associate samples and other data - such as phenotypic data 
- Study particulars such as collection dates or techniques 

Keep track of data files generated from a workflow analysis

If you've ever run a workflow, you know that the generated data is stored by default in the Workspace bucket, in folders whose names correspond to the workflow submission ID.

If you set up your workflow to write to the data table, you won't have to search through different cloud directories to find the files you need. 

Automate and scale an analysis 

When running a WDL workflow analysis in Terra, you can read inputs directly from a data table, allowing you to iterate seamlessly through multiple samples (the whole data table if you want!). By reading inputs from the data table and writing outputs back to it, you can chain WDL workflows together without needing to manually set up your analyses each time (turning workflows into whole pipelines).

What does a table look like? What does it contain?

Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet. Each table is identified by its "entity" (smallest thing, or piece of input data it contains); each row corresponds to one distinct entity; each column is a different piece of information (metadata) about that entity.

What's an "entity"? A piece or kind of dataAccording to the dictionary, an "entity" is "a thing with distinct and independent existence."  In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.

An entity is the type of primary data stored in a data table. It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table) - any table name you want.

Example: sample data in a sample table
Unerstanding-entities_Sample-entities-table_Screen_shot.png
This sample table includes links to genomic files (BAM and BAM index files) of various samples.  Note that the first column is each sample's unique ID and the fourth column is the participant ID, also found in the participant table. 

Keeping track of data files in the cloud

Remember that you don't have to import data files to your workspace Google bucket to use them. The whole point of data tables is that you work with files wherever they are in the cloud. This is why you'll often find data tables with links to data files. If you click on one of the links, you can see the File Details including the cloud path for where the file lives (see image below). Sometimes the cloud path is a Google bucket URL that starts with "gs://" or sometimes it's a cloud-agnostic URL, like a DRS URI that starts with "drs://". 

Screen_Shot_2022-02-07_at_12.47.04_PM.png

What's the minimum and maximum information in a workspace table? As much information as you need, and at least two columns: an ID column and a single data column.

Data tables aren't limited to data inputs for your workflows!
Tables are flexible, intended to help organize any information you might need for your study. Additional table columns work much like columns in a spreadsheet. The column header describes what information is in the column, and cells keep track of the information. Terra accepts (almost) any size and entity type data table, to keep track of whatever "entities" you need. 

Dedicated sections for different data types

As you think about different data processing steps, like genome alignment, variant calling, expression analyses, etc., you may realize there are multiple data files you'll need to turn your raw data into meaningful output. Maybe you'll need some references files like FASTAs, dictionary files, and indices. Maybe you'll need lists of cell barcodes or Unique Molecular Identifiers (UMIs). Additionally, you'll need your actual sample files, like FASTQs containing genomic reads or VCFs containing variant calls.

Whatever analysis files you need,  the Terra data page has three different sections dedicated to organizing your reference and sample data: Tables, Reference Data, and Other Data.

Screen_Shot_2022-01-26_at_11.02.15_AM.png

The Tables section is where you can create custom data tables representing your different samples, participants, specimens, files, or whatever entity you choose. You can also copy existing data tables from other Terra workspaces into this section or export a data table containing a custom cohort's metadata from one of the repositories in the Terra Data Library (to learn more about this, read the Overview: How to add a Table to a Terra workspace).

The Reference Data section allows you to include a preloaded human reference data table for either B37 or Hg38. These tables list all the files you need to perform most genomic analyses for human data. The reference files are actually hosted in the Broad's public Google bucket for human reference files. 

What if you're using alternative references, such as mouse, or using other workspace-level metadata? These types of files and metadata can be organized in the Other Data section in the Workspace Data table. Read more about modifying Workspace Data tables in Creating Workspace Data tables.

Customizing tables for your analysis: entities, sets, and pairs 

Whether you're creating a table from scratch or importing a table from an existing workspace or repository, it's important to think about how you want to analyze the data downstream. Terra allows you to set up WDL workflows so that they pull inputs from your data tables and similarly write data back to a data table. Whether you choose to do one or both of these will affect the kinds of tables you'll need and how you'll organize them. 

Entities and sets

Overall, there are two main types of data tables in Terra: entity tables and set tables.

An entity table contains a piece of data that you want to analyze (samples, files, participants, specimens, etc.), whereas a set table groups together different entities from your entity table.  

You'll learn more about using set tables in Overview: How to add a Table to a Terra workspace.

Pair tables for tumor-normal analysis

While entity tables and set tables are the main two types of tables in Terra, there's of course an exception to every rule. Pair tables are a specific type of data table used in cancer research, where somatic analysis requires samples corresponding to both tumor and normal tissue. Learn more about these tables in Adding pair tables to a workspace for tumor-normal analysis.

Programmatically making tables

You can automate the process of making and modifying tables using a special API called FISS. Learn more in Managing data and automating workflows with the FISS API.

Next steps

Now that you know a little more about tables in Terra, you're ready to learn how to populate a workspace with your own data table. Get started by reading How to add a Table to a Terra workspace which will walk you through creating both entity and set tables.

Additional resources

 

 

Was this article helpful?

2 out of 2 found this helpful

Have more questions? Submit a request

Comments

6 comments

Please sign in to leave a comment.