Managing data with tables

Workspace data tables are like integrated spreadsheets that help to organize and keep track of all project data, no matter where in the cloud data files are physically stored. This article helps define workspace data tables are and how they can help streamline your analysis.

Hands-on practice

For some guided exercises to help you understand data tables (how to create them, how to import them, and how to modify them), try the Terra (GCP) Quickstart 1: Data tables.

Watch an introductory video on data tables here

Understanding where your data are... and are not

When working in the cloud, you're not limited to analyzing data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to workspace storage (i.e., Google) bucket or external storage, or that's available in the Terra Data Library or numerous other data repositories.

Save storage costs and eliminate copying errors with data in the cloud

Using data stored in an external bucket that someone else pays for and maintains is especially nice when working with large data files, as you do not have to pay to store the original data. Sharing data, rather than copying, also reduce copying errors.

Data in a workspace table isn't in the table, it's in the cloud! The table holds metadata - such as links to the physical locations of the data in the cloud. Using data tables allows you to keep all the associated metadata data organized and together.

Why use data tables?

Managing data in a cloud-native world

The vast amounts of data you can access in the cloud offer exciting new opportunities for discovery. But large datasets can be overwhelming. A Terra workspace includes built-in spreadsheet-like "tables" to help.

The payoff of investing time to set up data tablesTables do take time to set up. But once set up, they will help
- Organize large amounts of data from different cloud locations
- Track generated data
- Scale and automate a workflow analysis

This built-in organization is especially useful as studies and analyses become more complex. You won't have to worry about keeping track of data (original data files and analysis outputs) manually.

1. Organize large amounts of data entities

Imagine trying to keep track of hundreds or thousands of original data files in different cloud locations, each with its own non-human-readable URL or DRS URI link. It's a Big Data nightmare.

Using, not copying and storing, data files in the cloud

As long as they exist somewhere in the cloud, you don't have to import data files to your workspace storage (i.e., Google bucket) to analyze them with a workflow in Terra. Data tables let you "store" and organize data files in your workspace - no matter where they are in the cloud. Terra will localize the files in the VM that runs your workflow for you.

Tables reference data files in the cloud with metadata links

This is why you'll often find data tables with links to data files. Clicking on a link exposes the File Details pane with the URL where the file lives (see image below). The cloud path can be a Google bucket URL that starts with "gs://" or a cloud-agnostic URL, like a DRS URI that starts with "drs://".
Screen_Shot_2022-02-07_at_12.47.04_PM.png

Integrated tables are designed to help

You can organize and associate data in tables in a way that makes sense to you: with separate tables for participants, or samples, or subjects, and even nested tables (pairs and set tables as well as arrays within an entity table). Like spreadsheets, you can search and edit and manipulate tables right in Terra. You can add as many rows of data or columns of metadata as you need, which lets you keep all the data associated with a particular "entity" - whether a sample or participant - together, including data generated from a workflow analysis.

Examples of data you can keep in a data table

Links to genomic data (FASTQ, CRAM, BAM, VCF, GVFC files, for example)
Arrays of genomic data files (such as VCF files for each chromosome in a sample)
Participant or other ID to associate samples and other data (such as phenotypic or clinical data)
Study particulars such as collection dates or techniques

2. Keep track of generated data files (workflows analysis)

If you've ever run a workflow, you know that the generated data is stored by default in workspace storage (i.e., Google bucket), in folders whose names correspond to the workflow submission ID. With long, non-human-friendly directories, it can be challenging to keep the data generated during an analysis associated with the original data.

Output file in workspace storage

The directory tree includes several directories with long alphanumeric IDs (circled above)

If you set up your workflow to write to the data table, you won't have to search through layers of non-human-friendly cloud directories to find the files you need.

Output file in data table

Output file metadata (URL in workspace storage) is associated with the input file in the data table

3. Automate and scale a workflow analysis

When running a WDL workflow analysis in Terra, reading inputs directly from a data table allows you to

Iterate seamlessly through multiple samples (the whole data table if you want!)
Analyze particular subsets of data without having to configure manually each time (Terra saves the subset as a set, allowing you to run a workflow on the same subset)
Chain WDL workflows together without needing to manually set up your analyses each time (turning workflows into whole pipelines) by writing outputs back to it
Combine data from different sources in a single table for analysis to yield better statistics and more robust results

What does a table look like? What does it contain?

Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet.

Each table is identified by its entity (the smallest thing, or piece of input data it contains)
Each row corresponds to one distinct entity
Each column is a different piece of information (metadata) about that entity

What's an "entity"? A piece or kind of dataAccording to the dictionary, an "entity" is "a thing with distinct and independent existence." In Terra, entities are pieces of information - almost like variables - used as input for a workflow analysis.

A table's root entity is the type of primary data stored in the table
It's also the name of the table in the workspace Data page. You can have tables of sample data (a "sample" entity table) or tissues (a "tissue" table) - any table name you want.

Example: specimen data in a specimen table

Data-Tables-Quickstart-Part1_Specimen-table_Screen_shot.png
This specimen table includes links to genomic files of various specimens in the r1_fastq column. Note: The first column is each specimen's unique ID and the second column is the participant ID, from the participant table.

How much data/metadata can you include in your table?

Terra workspace data tables are highly capable and work well for the vast majority of research analyses. There are no fixed limits on a table's number of rows or columns.

Naming and length conventions

Primary key of data table row (values in the first column): alphanumeric characters, underscores, dashes, and periods. Max 254 characters.
Data table name: alphanumeric characters, underscores, and dashes. Max 254 characters.
Data table column names: alphanumeric characters, underscores, dashes, and periods. Max 254 characters.
Data table column values: dependent on data type:
- Number values are represented as FLOAT and have a precision limitation of 16 decimals. Numbers with more than 16 decimals are rounded by approximation. For more information see the MySQL Documentation.
- String values have a maximum of 65,535 bytes.

Overall size guidance

A table's overall size is based on the number of rows multiplied by the number of columns and the size of the table's values. Tables often contain 200,000+ rows and 100+ columns. As the table size increases well beyond these values, the Terra UI performance and usability may decline, and some operations may not be completed within the allotted timeout.

If you have questions regarding the size of your data and the capability of Terra workspace data tables, please contact Terra Support for confirmation or guidance. We are committed to users successfully performing large-scale analyses in Terra.

Data tables aren't limited to data inputs for your workflows Tables are flexible, intended to help organize any information (metadata) you might need for your study. For example, as you do a workflow analysis, you can set it up to add output files in additional columns, keeping original and generated data all together in a single row for each unique sample. Additional table columns work much like columns in a spreadsheet. Column headers describe what metadata are in each column, and cells keep track of the information.

Dedicated sections for different data types

As you think about different data processing steps, like genome alignment, variant calling, expression analyses, etc., you may realize there are multiple data files you'll need to turn your raw data into meaningful output. Maybe you'll need some references files like FASTAs, dictionary files, and indices. Maybe you'll need lists of cell barcodes or Unique Molecular Identifiers (UMIs). Additionally, you'll need your actual sample files, like FASTQs containing genomic reads or VCFs containing variant calls.

Whatever analysis files you need, the Terra data page has three different sections dedicated to organizing your reference and sample data: Tables, Reference Data, and Other Data.
Managing-data-with-tables_Dedicated-table-types_Screen_shot.png

Input data tables

The Tables section is for input data tables such as samples, participants, specimens, or whatever entity you choose. You can copy data tables from other Terra workspaces into this section, or export a table containing metadata for a custom cohort from one of the repositories in the Terra Data Library (to learn more about this, read the Overview: How to add a Table to a Terra workspace).

Preloaded references

The Reference Data section allows you to include preloaded references including human genomic reference data table for either B37 or Hg38. The reference files are hosted in the Broad's public Google bucket for human reference files.

Add reference files by clicking the Import Data button (top left on the Data page) and selecting Add reference data.
Reference-data-in-REFERENCE-table-section_Screen_shot.png

Workspace-wide reference files

Using alternate references not offered in the References table? Do you have other workspace-level metadata, such as Docker images? Files and other data that you'll use across many analyses in the workspace can be organized in the Other Data section in the Workspace Data table.

To learn more, see Creating Workspace Data tables.

Customizing your data tables: entities, sets, and pairs

Whether you're creating a table from scratch or importing a table from an existing workspace or repository, what data you have, how they're currently organized, and how you plan to analyze them downstream will all impact the type and formatting of data tables you will need to set up.

For example, Terra allows you to set up WDL workflows to pull inputs from and write generated data to a data table. You'll want to make sure the primary root table includes the right input, whether it's single entities or arrays of entities.

Below are examples of custom tables and when you might use them.

Entities and sets

There are two primary types of data tables in Terra: entity tables and set tables.

An entity table contains a piece of data that you want to analyze (samples, files, participants, specimens, etc.). A set table groups together different entities from your entity table.

When to use an entity table

When you can run your workflow on single entities (e.g., samples)
When your data are logically organized by single entities. Note: You can include an array in a cell, if you have multiple data files that are the same kind of metadata and are all associated with a single entity.

When to use a set table

When you want to analyze the same subset of entities again and again
When your workflow requires many data files to generate a single output

Learn more about using set tables in When to use a set table for a workflow.

Pair tables for tumor-normal analysis

Pair tables are a predefined type of data table used in cancer research, where somatic analysis requires samples corresponding to both tumor and normal tissue. Learn more about working with and creating these tables in Adding pair tables to a workspace for tumor-normal analysis.

Next steps and additional resources

Now that you have an overview of tables in Terra, you're ready to learn how to populate a workspace with your own data table. Get started by reading How to add a Table to a Terra workspace, which will walk you through creating both entity and set tables.

Hands-on practice

For some guided exercises to help you understand data tables (how to create them, how to import them, and how to modify them), try the Terra (GCP) Quickstart 1: Data tables.

Making tables with scripts (programmatically)

You can automate the process of making and modifying tables using a special API called FISS. Learn more in Managing data and automating workflows with the FISS API.

Additional reading

Comments

8 comments

Pazpolak
- October 20, 2019 23:56
Is their an explanation on how to setup workspace for cancer genomics analysis when the BAM.BAI files already uploaded to google bucket? Is the order of tables still important ?

0
Jason Cerrato
- March 17, 2020 15:29
Hi Pazpolak,

You would set up the workspace in much the same way, adding a column to the .tsv for the BAM.BAI with the associated paths. For an example, download the .tsv from this Featured Workspace here: https://app.terra.bio/#workspaces/help-gatk/Somatic-CNVs-GATK4/data

The order of table upload is still important if you would like them linked/nested. I recommend reviewing the contents of this article Understanding Entity Types for more details.

If you have any further questions, please let us know.

Kind regards,

Jason

0
James Gatter
- Edited October 28, 2020 20:49
Is there any way to automitically parse a .tsv cell that contains a delimiter into an Array? For example turn text ["item1,item2,item3"] into an Array[String] = ["item1", "item2", "item3"]? I realize I can change the type once the .tsv text is uploaded but then the text becomes one item and I have to do a lot of horizontal scrolling to cut and paste to new slots. It would be more convenient to have a feature like this at upload time.

0
Allie Hajian
- October 28, 2020 21:12
James Gatter - I don't think there is a way in the UI. But take a look at the Data-Tables-QuickStart and see if it leads you in a fruitful direction (https://support.terra.bio/hc/en-us/articles/360047611871 - making sets in data tables and https://support.terra.bio/hc/en-us/articles/360047621171 running workflows that take arrays as input).

And if that doesn't work for you, you can file a feature request at http:// jason:shibaplz: 5:08 PM if it's not something that we do, they can always file a feature request at https://support.terra.bio/hc/en-us/community/topics/360000500452-Feature-Requests.

1
James Gatter
- October 28, 2020 21:31
Thanks Allie! I found that even if the Arrays aren't marked as Array types in the data table, Cromwell will still accept them so long as they are surrounded by square brackets and delimited by commas. Not an issue after all. I might still put it in for a feature request since it would make visualizing large arrays in Terra nicer.

0
Allie Hajian
- October 29, 2020 13:20
James Gatter I'm glad it worked out for you. Definitely submit the feature request - we're always looking for the best ways to make visualizing things in Terra more intuitive. Happy analyzing in the cloud on Terra!

0
Michael Love
- March 09, 2026 15:13
This link didn't work for me above:

"... try the T101 Data Tables Quickstart (click for guide)."

0
Jason Cerrato
- March 16, 2026 17:52
Hey Michael Love,

We've fixed the link! Thanks for pointing that out.

0

Please sign in to leave a comment.

Managing data with tables

Hands-on practice

Watch an introductory video on data tables here

Understanding where your data are... and are not

Save storage costs and eliminate copying errors with data in the cloud

Why use data tables?

Managing data in a cloud-native world

1. Organize large amounts of data entities

Using, not copying and storing, data files in the cloud

Tables reference data files in the cloud with metadata links

Integrated tables are designed to help

Examples of data you can keep in a data table

2. Keep track of generated data files (workflows analysis)

Output file in workspace storage

Output file in data table

3. Automate and scale a workflow analysis

What does a table look like? What does it contain?

Example: specimen data in a specimen table

How much data/metadata can you include in your table?

Naming and length conventions

Overall size guidance

Dedicated sections for different data types

Input data tables

Preloaded references

Workspace-wide reference files

Customizing your data tables: entities, sets, and pairs

Entities and sets

When to use an entity table

When to use a set table

Pair tables for tumor-normal analysis

Next steps and additional resources

Hands-on practice

Making tables with scripts (programmatically)

Additional reading

Was this article helpful?

That’s great, can you tell us why? (Click all that apply)

Thanks for your feedback, help us improve by telling us what you think could be better (click all that apply)

Comments