Workspace data tables (in the Data tab) are a convenient way to reference and organize attributes from different sources, including output files from previous analysis. You can use data tables to store links to data files list files, arrays, variable names, participant names, phenotype data - really any information you might have once kept in a spreadsheet. You can populate tables directly in the Terra interface or add new ones by uploading a tab-delimited file. This article covers what data tables are and how to format and work with data tables in Terra.
- Understanding where your data are.. and are not
- Why use workspace data tables?
- Data table structure and format
- File format (sets)
- How to edit data table entries directly in Terra (small numbers of inputs)
- How to add a data table to the workspace (large numbers of inputs)
1. Understanding where your data are.. and are not
The diagram below shows how data can exist in the cloud - in the Workspace storage bucket, a data library bucket or other storage - but be separate from your workspace. Metadata links in tables can be used to connect the files in the cloud to the rest of your workspace.
You will know your data are NOT linked to the workspace data tables if the Tables column is empty (this can be true even if the data are in your dedicated workspace bucket!):
2. Why use workspace data tables?
They help organizing large numbers of samples
Tables can contain all you need to keep track of data, including intermediate outputs: what types (or "entities") of attributes you are working with, where the data are, and how the entities relate to each other. This built-in organization can be useful as studies get more complex: if you have many samples from one participant, and perhaps many patients in a study, for example.
They make automation easier
When running a workflow analysis, you can manually put in complete direct paths for the input data or other attributes in the WDL, but it's not a system that works well if you have more than a handful of files. Keeping all attributes in a workspace data table can save time and headache in the long run. They enable automation of back-to-back pipelines configured to read and write from the table.
Side note on including additional information in a data table
Data tables aren't limited to data inputs for your workflows. They are flexible, intended to help organize all the relevant information you might need in the course of your study.
Other useful information a data table might include:
- Phenotype data (for an epidemiologic study)
- Links to other genomic data
- The location on a chromosome (GWAS)
Additional table columns work much like adding columns to a spreadsheet. The column header describes what goes in the column and cells keep track of the information you need. In the screenshot below, we've included a column labeled, "Participant," that links each sample to a particular individual in a study:
3. Data table structure and format
The data table has two parts: column headers that identify what's in that column, and rows of attributes or metadata. Each row corresponds to a different entity (a sample, or a participant, or a lane, for example).
Note: Your data table can be contain as much information as you need, but at minimum needs two columns: an id column and a data column (containing links to the input data files). You can include additional columns (to reference phenotype information, for example), and the data table will keep it all organized in one place. You can also configure workflows to write links to output files to the workspace table, which is useful for downstream analysis.
4. File format requirements (sets)
Note that the data table also supports set entities, which are lists of the basic entity type, for example:
- Participant Set
- Sample Set
- Pair Set
In set load files, each line lists the membership of a non-set entity (e.g., participant) in a set (e.g., participant set). The first column contains the identifier of the set entity and the second column contains a key referencing a member of that set. For example, a load file for a participant set looks like this:
Note that multiple rows in a set load file may have the same set entity id (e.g. TCGA_COAD).
5. How to edit data table entries directly in Terra
Edit individual cells
If your workspace already includes a data table with at least the number of rows you need, you can edit individual cells by clicking on the pencil icon in the cell you want to change:
5. How to add a data table to the workspace
If you need to add rows to your table, or don't want to manually type in attributes, you will need to upload a tab-delimited file to generate a table.
5.1. Make a "load" file of metadata
Helpful hint: You may find it easiest to use a spreadsheet editor to generate the file to upload. It keeps the values in columns instead of all jammed together in one line.
The most basic file will look like this:
- Each row is one entity (sample, or participant, for example)
- Each column is a particular variable (could be sample data in a Google bucket, phenotype data, participant ID)
- Each cell contains metadata such as the sample ID, the file path to data in a Google bucket, or phenotype data
The first column header has to have the format `entity:your_entity_id`
(Note that all headers must end with _id)
You can call your_entity whatever you want. For example, valid first column headers would be `entity:sample_id` or `entity:lane_number_id`.
The next column header would be the input variable name, such as `cram`
5.2. Save file in "tab-delimited format" on your local machine
5.3. Upload to the workspace Data tab
- Click on the "+" sign in the blue circle at the top of the left TABLES column
- Select either the "upload" or "paste" option
Once you've uploaded a load file to your workspace, you should see your data right away in the Data tab. Click on the entity link to expand the table, will look like this: