Workspace data tables are like integrated spreadsheets that can store primary data like clinical or phenotypic data and help you organize and keep track of large data files including inputs for workflow analyses. This article walks through workspace data management using data tables.
Where is your data in Terra (blob storage container versus data table)?
Where your data resides in the cloud and how it’s integrated with your workspace depends on what kind of data it is. There are different storage locations for unstructured data files and tabular data.
Unstructured data files are stored in the workspace cloud storage (blob storage container)
- Large genomic files (i.e., CRAM, BAM, FASTQ files) or small files (TSV or CSV) that you upload to your workspace cloud storage
- Data generated in a workflow (such as VCFs generated when calling variants) are stored in workspace cloud storage. Find them under "submissions" in your workspace files (on the left side of the Data page).
Tabular data (workspace data tables)
Anything that can be in a spreadsheet can be stored in a data table in your Terra workspace. Data tables are stored in a private relational database in Azure cloud. The database infrastructure is owned by you, which gives you maximum control over where this data is kept. See Data tables: Additional resources for more details on data tables and the Workspace Data Services infrastructure that powers them.
Examples of data typically stored in a data table
- Administrative data (Sample ID, Project ID, Accession numbers)
- Primary data (clinical or demographic data)
- Metadata (URIs of large files stored in workspace or external cloud storage, dates of sample collection, etc.)
- Anything else traditionally kept in CSV or TSV format
What's in a data table?
Tables are tabular storage (i.e., spreadsheets) built into your workspace, and a data table looks and behaves a lot like a spreadsheet.
- Each table is identified by its name, which indicates the type of data or record stored in the table. For example, participant or sample.
- Each row corresponds to one distinct record (e.g., participant or sample), with a unique ID that identifies the individual record in that row in the table. The first column of the table is the record ID, which must be unique for all rows.
- Each column is a different piece of information (primary data or metadata) about that record.
Example of a data table with demographic data
Creating data tables to store your data in Terra
To add a data table to your workspace you'll need to follow three steps.
1. Define your data model (optional). This is important if you have complex data in more than one table.
2. Generate a TSV in a spreadsheet editor and store it locally.
3. Upload the TSV to your Terra workspace.
For step-by-step instructions, see How to create a data table from scratch.
Note that if you already have a TSV formatted for Terra on Azure, you can skip to step 2.
Multi-cloud Terra (Azure versus GCP)
If you're moving from Terra on Google, it is important to note that you won’t be able to directly access data hosted in Google Cloud when using Terra on Azure.
To copy data to Azure cloud storage, follow instructions in the bring your own data tutorial. Note that copying data from Google to Azure cloud storage will incur egress costs(see Azure’s bandwidth egress pricing).
Terminology: Azure versus GCP
|Cloud storage||Blob storage container||Google bucket|
|Standard storage class||Hot (default)||Standard (default)|
|Controlled access mechanism||Shared Access Signature (SAS)||Signed URL|
|Requester Pays support||No||Yes|
Please sign in to leave a comment.