Overview: How to manage data in the cloud with data tables

Workspace data tables (in the Data tab) can help organize and keep track of all project data, no matter where in the cloud they are. This article will help explain what workspace data tables are as well as how you can run a workflow analysis on individual samples, groups of samples, or arrays of samples in a table.

Watch an introductory video on data tables here

Understanding where your data are... and are not

When working in the cloud, you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to a workspace bucket or external bucket, or that's available in Terra's Data Library or numerous other data repositories. Even better, you can analyze data from many different sources in a single, more robust analysis.

Why use data tables?

A Terra workspace includes built-in spreadsheet-like "tables" that can help organize data stored in different cloud locations and enable you to easily scale and automate your analysis.

Data "in" a workspace table can be anywhere in the cloud! Data files are not actually "in" the table (or even in the workspace, really...). Tables can include links to the physical locations of the data in the cloud, and keep associated data organized together. This saves storage costs and eliminates copying errors.
Being able to use data that are physically located in an external bucket can be especially nice when working with large data files stored in a public bucket, as you do not have to pay to store the original data. By sharing, rather than copying the data, you also reduce copying errors.

Organize large numbers of data (i.e. samples)

Imagine trying to keep track of hundreds or thousands of original data files in different buckets each with its own non-human-readable bucket or DRS URI link. Then imagine keeping the data generated during your analysis associated with the right original data. Tables are designed to help you keep all the data associated with a particular "entity" - whether a sample or participant -together.

The payoff of investing time to set up data tablesTables do take time to set up. But once set up, you won't have to worry about keeping track of data (original data files and analysis outputs) manually. This built-in organization can be especially useful as studies and analyses get more complex. Tables can include as much information you need in additional columns. For example, as you do a workflow analysis, you can add output files, keeping original and generated data all together in a single row for each unique sample.

- Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example)
- Participant or other ID to associate samples and other data - such as phenotypic data
- Study particulars such as collection dates or techniques

Keep track of generated data from a workflow analysis

If you've ever run a workflow, you know that the generated data is stored by default in the Workspace bucket, in folders whose names correspond to the workflow submission ID.

If you set up your workflow to write to the data table, you won't have to search through different cloud directories to find the files you need.

Automate and scale an analysis

When running a WDL workflow analysis in Terra, you can read inputs directly from a data table, allowing you to iterate seamlessly through multiple samples (the whole data table if you want!). By reading inputs from the data table and writing outputs back to it, you can chain WDL workflows together without needing to manually set up your analyses each time (turning workflows into whole pipelines).

What does a table look like? What does it contain?

Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet. Each table is identified by its "entity" (smallest thing, or piece of input data it contains); each row corresponds to one distinct entity; each column is a different piece of information (metadata) about that entity. You can create a table in Terra by generating a TSV (spreadsheet) file locally and uploading it to your workspace. See How to create a table with a template for details.

Example sample TSV in a spreadsheet editor

sample_id	BAM_file	subject_id
89ryqiuhfo7ybiifn50	gs://your-bucket-name/blood_sample_P1.bam	NA10296
ncif71f1bfj4fbfihfhb	gs://your-bucket-name/spit_sample_P1.bam	NA10296

What's an "entity"? A piece or kind of dataAccording to the dictionary, an "entity" is "a thing with distinct and independent existence." In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.

An entity is the type of primary data stored in a data table. It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table) - any table name you want.

Example: sample data in a sample table
Unerstanding-entities_Sample-entities-table_Screen_shot.png
This sample table includes genomic data (BAM and BAM index files) of various samples. Note that the first column is each sample's unique ID and the fourth column is the participant ID, also found in the participant table.

What's the minimum and maximum information in a workspace table? As much information as you need, and at least two columns: an ID column and a single data column.

Data tables aren't limited to data inputs for your workflows!
Tables are flexible, intended to help organize any information you might need for your study. Additional table columns work much like columns in a spreadsheet. The column header describes what information is in the column, and cells keep track of the information. Terra accepts (almost) any size and entity type data table, to keep track of whatever "entities" you need. The two examples below are fairly common, but not the only, types of tables. You can use the step-by-step instructions further down to create your own table.

Dedicated sections for different data types

As you think about different data processing steps, like genome alignment, variant calling, expression analyses, etc., you may realize there are multiple data files you'll need to turn your raw data into meaningful output. Maybe you'll need some references files like FASTAs, dictionary files, and indices. Maybe you'll need lists of cell barcodes or Unique Molecular Identifiers (UMIs). Additionally, you'll need your actual sample files, like FASTQs containing genomic reads or VCFs containing variant calls.

Whatever analysis files you need, the Terra data page has three different sections dedicated to organizing your reference and sample data: Tables, Reference Data, and Other Data.

The Tables section is where you can create custom data tables representing your different samples, participants, specimens, files, or whatever entity you choose. You can also copy existing data tables from other Terra workspaces into this section or export a data table containing a custom cohort's metadata from one of the repositories in the Terra Data Library (to learn more about this, read the XX article).

The Reference Data section allows you to generate a preloaded human reference data table for either B37 or Hg38. These tables list all the files you need to perform most genomic analyses for human data. These files are actually hosted in the Broad's public Google bucket for human reference files.

What if you're using alternative references, such as mouse, or using other workspace-level metadata? These types of files and metadata can be organized in the Other Data section in the Workspace Data table. Read more about modifying Workspace Data tables in article XX.

Customizing tables for your analysis: entities, sets, and pairs

Whether you're creating a table from scratch or importing a table from an existing workspace or repository, it's important to think about how you want to analyze the data downstream. Terra allows you to set up WDL workflows so that they pull inputs from your data tables and similarly write data back to a data table. Whether you choose to do one or both of these will affect the kinds of tables you'll need and how you'll organize them.

Entities and sets

Overall, there are two main types of data tables in Terra: entity tables and set tables.

An entity table contains a piece of data that you want to analyze (samples, files, participants, specimens, etc.), whereas a set table groups together different entities from your entity table.

You'll learn more about using set tables in the XX article (Creating tables from scratch).

Pair tables for tumor-normal analysis

While entity tables and set tables are the main two types of tables in Terra, there's of course an exception to every rule. Pair tables are a specific type of data table used in cancer research, where somatic analysis requires samples corresponding to both tumor and normal tissue.

Programmatically making tables

You can automate the process of making and modifying tables using a special API called FISS. Learn more in How to Manage data with the FISS API.

Next steps

Now that you know a little more data tables in Terra, you're ready to learn how to actually populate a workspace with your own data table. Get started by reading XX article which will walk you through creating both entity and set tables.