Workspace data tables are like integrated spreadsheets that help to organize and keep track of all project data, no matter where in the cloud data files are physically stored. This article helps define workspace data tables are and how they can help streamline your analysis.
Watch an introductory video on data tables here
Understanding where your data are... and are not
When working in the cloud, you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to workspace storage (i.e., Google) bucket or external storage, or that's available in the Terra Data Library or numerous other data repositories.
Save storage costs and eliminate copying errors with data in the cloud
Using data stored in an external bucket that someone else pays for and maintains is especially nice when working with large data files, as you do not have to pay to store the original data. Sharing data, rather than copying, also reduce copying errors.
Data in a workspace table isn't in the table, it's in the cloud! The table holds metadata - such as links to the physical locations of the data in the cloud. Using data tables allows you to keep all the associated metadata data organized and together.
Why use data tables?
Managing data in a cloud-native world
The vast amounts of data you can access in the cloud offer exciting new opportunities for discovery. But large datasets can be overwhelming. A Terra workspace includes built-in spreadsheet-like "tables" to help.
The payoff of investing time to set up data tablesTables do take time to set up. But once set up, they will help
- Organize large amounts of data from different cloud locations
- Track generated data
- Scale and automate a workflow analysis
This built-in organization is especially useful as studies and analyses become more complex. You won't have to worry about keeping track of data (original data files and analysis outputs) manually.
1. Organize large amounts of data entities
Imagine trying to keep track of hundreds or thousands of original data files in different cloud locations, each with its own non-human-readable URL or DRS URI link. It's a Big Data nightmare.
Using, not copying and storing, data files in the cloud
As long as they exist somewhere in the cloud, you don't have to import data files to your workspace storage (i.e., Google bucket) to analyze them with a workflow in Terra. Data tables let you "store" and organize data files in your workspace - no matter where they are in the cloud. Terra will localize the files in the VM that runs your workflow for you.
Tables reference data files in the cloud with metadata links
This is why you'll often find data tables with links to data files. Clicking on a link exposes the File Details pane with the URL where the file lives (see image below). The cloud path can be a Google bucket URL that starts with "gs://" or a cloud-agnostic URL, like a DRS URI that starts with "drs://".
Integrated tables are designed to help
You can organize and associate data in tables in a way that makes sense to you: with separate tables for participants, or samples, or subjects, and even nested tables (pairs and set tables as well as arrays within an entity table). Like spreadsheets, you can search and edit and manipulate tables right in Terra. You can add as many rows of data or columns of metadata as you need, which lets you keep all the data associated with a particular "entity" - whether a sample or participant - together, including data generated from a workflow analysis.
Examples of data you can keep in a data table
- Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example)
- Arrays of genomic data files (such as VCF files for each chromosome in a sample)
- Participant or other ID to associate samples and other data (such as phenotypic or clinical data)
- Study particulars such as collection dates or techniques
2. Keep track of generated data files (workflows analysis)
If you've ever run a workflow, you know that the generated data is stored by default in workspace storage (i.e., Google bucket), in folders whose names correspond to the workflow submission ID. With long, non-human-friendly directories, it can be challenging to keep the data generated during an analysis associated with the original data.
Output file in workspace storage
The directory tree includes several directories with long alphanumeric IDs (circled above)
If you set up your workflow to write to the data table, you won't have to search through layers of non-human-friendly cloud directories to find the files you need.
Output file in data table
Output file metadata (URL in workspace storage) is associated with the input file in the data table
3. Automate and scale a workflow analysis
When running a WDL workflow analysis in Terra, reading inputs directly from a data table allows you to
- Iterate seamlessly through multiple samples (the whole data table if you want!)
- Analyze particular subsets of data without having to configure manually each time (Terra saves the subset as a set, allowing you to run a workflow on the same subset)
- Chain WDL workflows together without needing to manually set up your analyses each time (turning workflows into whole pipelines) by writing outputs back to it
- Combine data from different sources in a single table for analysis to yield better statistics and more robust results
What does a table look like? What does it contain?
Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet.
- Each table is identified by its entity (the smallest thing, or piece of input data it contains)
- Each row corresponds to one distinct entity
- Each column is a different piece of information (metadata) about that entity
What's an "entity"? A piece or kind of dataAccording to the dictionary, an "entity" is "a thing with distinct and independent existence." In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.
A table's root entity is the type of primary data stored in the table
It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table) - any table name you want.
Example: specimen data in a specimen table
This specimen table includes links to genomic files of various specimens in the r1_fastq column. Note: The first column is each specimen's unique ID and the second column is the participant ID, from the participant table.
How much data/metadata can you include in your table? As much as you need!
You can add rows (additional entities) or additional columns in Terra. For example, as you do a workflow analysis, you can set it up to add output files in additional columns, keeping original and generated data all together in a single row for each unique sample.
What's the minimum and maximum size of information in a workspace table? Minimum: at least two columns: an ID column and a single data column.
Maximum: As much information as you need. Terra accepts (almost) any size table.
Data tables aren't limited to data inputs for your workflows
Tables are flexible, intended to help organize any information (metadata) you might need for your study. Additional table columns work much like columns in a spreadsheet. Column headers describe what metadata are in each column, and cells keep track of the information.
Dedicated sections for different data types
As you think about different data processing steps, like genome alignment, variant calling, expression analyses, etc., you may realize there are multiple data files you'll need to turn your raw data into meaningful output. Maybe you'll need some references files like FASTAs, dictionary files, and indices. Maybe you'll need lists of cell barcodes or Unique Molecular Identifiers (UMIs). Additionally, you'll need your actual sample files, like FASTQs containing genomic reads or VCFs containing variant calls.
Whatever analysis files you need, the Terra data page has three different sections dedicated to organizing your reference and sample data: Tables, Reference Data, and Other Data.
Input data tables
The Tables section is for input data tables such as samples, participants, specimens, or whatever entity you choose. You can copy data tables from other Terra workspaces into this section, or export a table containing metadata for a custom cohort from one of the repositories in the Terra Data Library (to learn more about this, read the Overview: How to add a Table to a Terra workspace).
The Reference Data section allows you to include preloaded references including human genomic reference data table for either B37 or Hg38. The reference files are hosted in the Broad's public Google bucket for human reference files.
Add reference files by clicking the Import Data button (top left on the Data page) and selecting Add reference data.
Workspace-wide reference files
Using alternate references not offered in the References table? Do you have other workspace-level metadata, such as Docker images? Files and other data that you'll use across many analyses in the workspace can be organized in the Other Data section in the Workspace Data table.
To learn more, see Creating Workspace Data tables.
Customizing your data tables: entities, sets, and pairs
Whether you're creating a table from scratch or importing a table from an existing workspace or repository, what data you have, how they're currently organized, and how you plan to analyze them downstream will all impact the type and formatting of data tables you will need to set up.
For example, Terra allows you to set up WDL workflows to pull inputs from and write generated data to a data table. You'll want to make sure the primary root table includes the right input, whether it's single entities or arrays of entities.
Below are examples of custom tables and when you might use them.
Entities and sets
There are two primary types of data tables in Terra: entity tables and set tables.
An entity table contains a piece of data that you want to analyze (samples, files, participants, specimens, etc.). A set table groups together different entities from your entity table.
When to use an entity table
- When you can run your workflow on single entities (e.g., samples)
- When your data are logically organized by single entities. Note: You can include an array in a cell, if you have multiple data files that are the same kind of metadata and are all associated with a single entity.
When to use a set table
- When you want to analyse the same subset of entities again and again
- When your workflow requires many data files to generate a single output
Learn more about using set tables in When to use a set table for a workflow.
Pair tables for tumor-normal analysis
Pair tables are a predefined type of data table used in cancer research, where somatic analysis requires samples corresponding to both tumor and normal tissue. Learn more about working with and creating these tables in Adding pair tables to a workspace for tumor-normal analysis.
Next steps and additional resources
Now that you have an overview of tables in Terra, you're ready to learn how to populate a workspace with your own data table. Get started by reading How to add a Table to a Terra workspace, which will walk you through creating both entity and set tables.
For some guided exercises to help you understand data tables (how to create them, how to import them, and how to modify them), try the Data Tables Quickstart.
Making tables with scripts (programmatically)
You can automate the process of making and modifying tables using a special API called FISS. Learn more in Managing data and automating workflows with the FISS API.
- Overview: How to add a Table to a Terra workspace
- Modifying and editing a data table
- Creating Workspace Data tables
- Adding pair tables to a workspace for tumor-normal analysis
- Managing data and automating workflows with the FISS API