Workspace data tables (in the Data tab) are a convenient way to reference and organize attributes from different sources, including output files from previous analysis. You can use data tables to store links to data files list files, arrays, variable names, participant names, phenotype data - really any information you might have once kept in a spreadsheet. You can populate tables directly in the Terra interface or add new ones by uploading a tab-delimited file. This article covers what data tables are and how to format and work with data tables in Terra.
- Understanding where your data are.. and are not
- Why use workspace data tables?
- Data table structure and format
- File format (sets)
- How to edit data table entries directly in Terra (small numbers of inputs)
- How to add (or delete) columns and rows in a table
- How to add a data table to the workspace (large numbers of inputs)
Watch an introductory video on data tables here:
1. Understanding where your data are... and are not
The diagram below shows how data can exist in the cloud - in the Workspace storage bucket, a data library bucket or other storage - but be separate from your workspace. Links in tables can be used to connect the files in the cloud to the rest of your workspace. For example, you can configure a workflow to reference a link to input data from a workspace table.
You will know your data are NOT linked to the workspace data table if the Tables column is empty. Note that data can be in your workspace bucket but not in the table (see screenshot below). Workflows will not be able to use input data from the table, but will only find the data if you use the full path to the data as input.
2. Why use workspace data tables?
They help organize large numbers of samples
Tables can contain all you need to keep track of data, including intermediate outputs: what types (or "entities") of attributes you are working with, where the data are, and how the entities relate to each other. This built-in organization can be useful as studies get more complex: if you have many samples from one participant, and perhaps many patients in a study, for example.
They make automation easier
When running a workflow analysis, you can manually put in complete direct paths for the input data or other attributes in the WDL, but it's not a system that works well if you have more than a handful of files. Keeping all attributes in a workspace data table can save time and headache in the long run. They enable automation of back-to-back pipelines configured to read and write from the table.
How to include additional information in a data table (click to expand)
Data tables aren't limited to data inputs for your workflows. They are flexible, intended to help organize all the relevant information you might need in the course of your study.
Other useful information a data table might include:
- Phenotype data (for an epidemiologic study)
- Links to other genomic data
- The location on a chromosome (GWAS)
Additional table columns work much like adding columns to a spreadsheet. The column header describes what goes in the column and cells keep track of the information you need. In the screenshot below, we've included a column labeled, "Participant," that links each sample to a particular individual in a study:
3. Data table structure and format
The data table has two parts: 1) column headers that identify what's in that column, and 2) rows of attributes or metadata. Each row corresponds to a different entity (a sample, or a participant, or a lane, for example).
Note: Your data table can be contain as much information as you need, but at minimum needs two columns: an id column and a data column (containing links to input data files). You can include additional columns (to reference phenotype information, for example), and the data table will keep it all organized in one place. You can also configure workflows to write links to output files to the workspace table, which is useful for downstream analysis.
4. File format requirements (sets)
Note that the data table also supports set entities, which are lists of the basic entity type, for example:
- Participant Set
- Sample Set
- Pair Set
In set tables, each line lists the membership of a non-set entity (e.g., participant) in a set (e.g., participant set). The first column contains the identifier of the set entity and the second column contains a key referencing a member of that set. For example, a table for a participant set looks like this:
Note that multiple rows in a set table may have the same set entity id (e.g. TCGA_COAD).
5. How to modify table entries or delete rows (directly in Terra)
If you only need to change a handful of entries, you can do so directly in the Terra interface, starting in the Data tab.
To edit individual table cells
If your workspace already includes a data table with at least the number of rows you need, you can edit individual cells by clicking on the pencil icon in the cell you want to change:
To delete table rows
To delete table rows in the interface, check the box to the left of the row(s) you want to delete, click the three vertical dots at the top (next to the Copy Page to Clipboard button) and select "Delete".
6. How to add rows or columns (using tsv files)
If you need to add rows, or add or delete columns in a workspace table, or don't want to manually type into the interface, you will need to modify the table as a tab-separated values file outside of Terra. You can download and modify the existing table, or create one from scratch.
Add table rows, or add or delete table columns
To add rows, or add or delete table columns in a table that already exists
- First download the table to your local machine
To download all rows:
Click the Download all Rows button at the top of the table
To download some rows:
a. Select the rows to download
b. Click the three vertical dots to the right of the Download all Rows button and select "Download as tsv"
- Add rows, or add or delete columns using your favorite spreadsheet editor
- Save the file in "tab-delimited values format
- Click the blue "+" at the top right of the TABLES column (in rectangle in screenshot below) and follow instructions to upload the table. Note that you will get a warning about overwriting the existing table. That is OK.
Create a table (tsv file) from scratch
1. Make a "load" file (i.e. spreadsheet)
You may find it easiest to use a spreadsheet editor to generate the file to upload. It keeps the values in columns instead of all jammed together in one line.
The most basic file will look like this:
- Each row is one entity (sample, or participant, for example)
- Each column is a particular variable (could be sample data in a Google bucket, phenotype data, participant ID)
- Each cell contains metadata such as the sample ID, the file path to data in a Google bucket, or phenotype data
The first column header has to have the format `entity:your_entity_id`
(Note that all headers must end with "_id")
You can call your_entity whatever you want. For example, valid first column headers would be `entity:sample_id` (for a sample table) or `entity:lane_number_id` (for a table of lane numbers).
The next column header would be the input variable name, such as `cram`
2. Save file in "tab-delimited format" on your local machine
3. Upload to the workspace Data tab
- Click on the "+" sign in the blue circle at the top of the left TABLES column
(in the orange rectangle in the screenshot below)
- Select either the "upload" or "paste" option
Once you've uploaded a load file to your workspace, you should see your data right away in the Data tab. Click on the entity link to expand the table, will look like this: