Workspace data tables are like integrated spreadsheets in the Data page that help to organize and keep track of all project data, no matter where in the cloud data files are stored. This article will help explain what workspace data tables are and how they can help streamline your analysis.
Watch an introductory video on data tables here
Understanding where your data are... and are not
When working in the cloud, you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to workspace storage (i.e. Google) bucket or external storage, or that's available in Terra's Data Library or numerous other data repositories.
Save storage costs and eliminate copying errors with data in the cloud
Being able to use data that are stored in an external bucket that someone else pays for and maintains can be especially nice when working with large data files, as you do not have to pay to store the original data. Sharing data, rather than copying, also reduce copying errors.
Data in a workspace table isn't in the table, it's in the cloud! The table holds metadata - such as links to the physical locations of the data in the cloud. Using data tables allows you to keep all the associated metadata data organized and together.
Why use data tables?
Managing data in a cloud-native world
The vast amounts of data you can access in the cloud offer exciting new opportunities for discovery. But large datasets can be overwhelming. A Terra workspace includes built-in spreadsheet-like "tables" to help.
1. Organize large amounts of data entities
Imagine trying to keep track of hundreds or thousands of original data files in different cloud locations, each with its own non-human-readable URL or DRS URI link. It's a Big Data nightmare.
Using, not copying and storing, data files in the cloud
As long as they exist somewhere in the cloud, you don't have to import data files to your workspace storage (i.e. Google bucket) to analyze them with a workflow in Terra. Data tables let you "store" and organize data files in your workspace - no matter where they are in the cloud. Terra will localize the files in the VM that runs your workflow for you.
Tables reference data files in the cloud with metadata links
This is why you'll often find data tables with links to data files. Clicking on a link exposes the File Details pane with the URL where the file lives (see image below). The cloud path can be a Google bucket URL that starts with "gs://" or a cloud-agnostic URL, like a DRS URI that starts with "drs://".
Integrated tables are designed to help
You can organize and associate data in tables in a way that makes sense to you: with separate tables for participants, or samples, or subjects, and even nested tables (pairs and sets). Like spreadsheets, you can search and edit and manipulate tables right in Terra. You can add as many rows of data or columns of metadata as you need, which lets you keep all the data associated with a particular "entity" - whether a sample or participant - together, including data generated from a workflow analysis.
Examples of data you can keep in a data table
- Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example)
- Participant or other ID to associate samples and other data - such as phenotypic data
- Study particulars such as collection dates or techniques
2. Keep track of generated data files (workflows analysis)
If you've ever run a workflow, you know that the generated data is stored by default in workspace storage (i.e. Google bucket), in folders whose names correspond to the workflow submission ID. With long, non-human-friendly directories, it could be challenging to keep the data generated during an analysis associated with the right original data.
Output file in workspace storage
The directory tree includes several directories with long alphanumeric IDs (circled above)
If you set up your workflow to write to the data table, you won't have to search through layers of non-human-friendly cloud directories to find the files you need.
Output file in data table
Output file metadata (URL in workspace storage) is associated with the input file in the data table
3. Automate and scale a workflow analysis
When running a WDL workflow analysis in Terra, you can read inputs directly from a data table, allowing you to iterate seamlessly through multiple samples (the whole data table if you want!). You can save particular subsamples to analyze as sets, allowing you to run a workflow on the same subset without having to configure manually each time.
By reading inputs from the data table and writing outputs back to it, you can chain WDL workflows together without needing to manually set up your analyses each time (turning workflows into whole pipelines).
Combining data from different sources in a single table for analysis yields better statistics and more robust results.
The payoff of investing time to set up data tablesTables do take time to set up. But once set up, they will help
- Organize large amounts of data from different cloud locations
- Track generated data
- Scale and automate a workflow analysis
This built-in organization can be especially useful as studies and analyses get more complex. You won't have to worry about keeping track of data (original data files and analysis outputs) manually.
What does a table look like? What does it contain?
Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet.
- Each table is identified by its entity (the smallest thing, or piece of input data it contains)
- Each row corresponds to one distinct entity
- Each column is a different piece of information (metadata) about that entity
What's an "entity"? A piece or kind of dataAccording to the dictionary, an "entity" is "a thing with distinct and independent existence." In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.
A table's root entity is the type of primary data stored in the table
It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table) - any table name you want.
Example: specimen data in a specimen table
This specimen table includes links to genomic files of various specimens in the r1_fastq column. Note that the first column is each specimen's unique ID and the second column is the participant ID, from the participant table.
How much data/metadata can you include in your table? As much as you need!
You can add rows (additional entities) or additional columns in Terra. For example, as you do a workflow analysis, you can set it up to add output files in additional columns, keeping original and generated data all together in a single row for each unique sample.
What's the minimum and maximum information in a workspace table? Minimum: at least two columns: an ID column and a single data column.
Maximum: As much information as you need. Terra accepts (almost) any size table.
Data tables aren't limited to data inputs for your workflows
Tables are flexible, intended to help organize any information (metadata) you might need for your study. Additional table columns work much like columns in a spreadsheet. Column headers describes what metadata is in each column, and cells keep track of the information.
Dedicated sections for different data types
As you think about different data processing steps, like genome alignment, variant calling, expression analyses, etc., you may realize there are multiple data files you'll need to turn your raw data into meaningful output. Maybe you'll need some references files like FASTAs, dictionary files, and indices. Maybe you'll need lists of cell barcodes or Unique Molecular Identifiers (UMIs). Additionally, you'll need your actual sample files, like FASTQs containing genomic reads or VCFs containing variant calls.
Whatever analysis files you need, the Terra data page has three different sections dedicated to organizing your reference and sample data: Tables, Reference Data, and Other Data.
Input data tables
The Tables section is for input data tables such as samples, participants, specimens, or whatever entity you choose. You can copy data tables from other Terra workspaces into this section, or export a table containing metadata for a custom cohort from one of the repositories in the Terra Data Library (to learn more about this, read the Overview: How to add a Table to a Terra workspace).
Preloaded human genomic references
The Reference Data section allows you to include a preloaded human genomic reference data table for either B37 or Hg38. The reference files are hosted in the Broad's public Google bucket for human reference files.
Workspace-wide reference files
Using alternate references, such as mouse? Do you have other workspace-level metadata, such as Docker images? Reference files and other data that you'll use across many analyses in the workspace can be organized in the Other Data section in the Workspace Data table.
To learn more, see Creating Workspace Data tables.
Customizing your data tables: entities, sets, and pairs
Whether you're creating a table from scratch or importing a table from an existing workspace or repository, what data you have, how it's currently organized, and how you plan to analyze it downstream will all impact the type and formatting of data tables you will need to set up.
For example, Terra allows you to set up WDL workflows to pull inputs from and write generated data to a data table. You'll want to make sure the primary root table includes the right input, whether it's single entities or arrays of entities.
Below are examples of custom tables and when you might use them.
Entities and sets
There are two primary types of data tables in Terra: entity tables and set tables.
An entity table contains a piece of data that you want to analyze (samples, files, participants, specimens, etc.). A set table groups together different entities from your entity table.
When to use an entity table
- When you can run your workflow on single entities (e.g. samples
- When your data is logically organized by single entities. Note that you can include an array in a cell, if you have multiple data files that are the same kind of metadata and are all associated with a single entity.
When to use a set table
- When you ant to analyse the same subset of entities again and again
- When your workflow requires many data files to generate a single output
Learn more about using set tables in When to use a set table for a workflow.
Pair tables for tumor-normal analysis
Pair tables are a predefined type of data table used in cancer research, where somatic analysis requires samples corresponding to both tumor and normal tissue. Learn more about working with and creating these tables in Adding pair tables to a workspace for tumor-normal analysis.
Next steps and additional resources
Now that you have an overview of tables in Terra, you're ready to learn how to populate a workspace with your own data table. Get started by reading How to add a Table to a Terra workspace, which will walk you through creating both entity and set tables.
For some guided exercises to help you understand data tables (how to create them, how to import them, and how to modify them), try the Data Tables Quickstart.
Making tables with scripts (programmatically)
You can automate the process of making and modifying tables using a special API called FISS. Learn more in Managing data and automating workflows with the FISS API.
- Overview: How to add a Table to a Terra workspace
- Modifying and editing a data table
- Creating Workspace Data tables
- Adding pair tables to a workspace for tumor-normal analysis
- Managing data and automating workflows with the FISS API