Managing data with data tables (Terra on Azure)

Allie Cliffe
  • Updated

If you're interested in using Terra on Azure, please email terra-enterprise@broadinstitute.org.

Workspace data tables are like an integrated database that can store primary data like clinical or phenotypic data and help you organize and keep track of large data files including inputs for workflow analyses. This article walks through workspace data management using data tables.

Where is your data in Terra (blob storage container versus database tables)?

Where your data resides in the cloud and how it’s integrated with your workspace depends on what kind of data it is. There are different storage locations for unstructured data files and tabular data

Unstructured data files  are stored in the workspace cloud storage (blob storage container)

  • Large genomic files (i.e., CRAM, BAM, FASTQ files)
  • Small files (TSV or CSV) that you upload to your workspace cloud storage 
  • Data generated in a workflow (such as VCFs generated when calling variants) are stored in workspace cloud storage. Find them under "submissions" in your workspace files (on the left side of the Data page). 

Tabular data is stored in a relational database (tables)

  • Administrative data (Sample ID, Project ID, Accession numbers)
  • Primary data (clinical or demographic data)
  • Metadata (URIs of large files stored in workspace or external cloud storage, dates of sample collection, etc.)
  • Anything else traditionally kept in CSV or TSV format

Anything that can be in a spreadsheet can be stored in a data table in your Terra workspace. Data tables are stored in a private relational database in Azure cloud. The database infrastructure is owned by you, which gives you maximum control over where this data is kept. See Data tables: Additional resources for more details on data tables and the Workspace Data Services infrastructure that powers them. 

Database versus spreadsheetExcel is a spreadsheet program; historically, data tables on GCP behave more like a spreadsheet. Spreadsheets are common and easy to use, but limit the scale and complexity of data you store.

Why use databases for tables?
Data is growing large and more complex. Databases provide powerful tools for storing, managing, searching, analyzing large structured datasets

Editing a spreadsheet versus a database
In a spreadsheet, you can edit a specific cell, which is flexible in terms of datatype.
In a database, data is structured, with one datatype per column; and relational, so you can link your data, allowing more powerful search. You need to be careful when editing a TSV to upload to Terra on Azure to make sure the value for each row fits the expected data type for that column. 

What's in a data table? Your workspace database

Tables are a relational database built into your workspace, and a data table looks and behaves a lot like a spreadsheet with some important caveats.

  • Each table is identified by its name, which indicates the type of data or record stored in the table. For example, participant or sample.
  • Each row corresponds to one distinct record (e.g., participant or sample), with a unique ID that identifies the individual record in that row in the table. The first column of the table is the record ID, which must be unique for all rows. 
  • Each column is a different piece of information (primary data or metadata) about that record.
  • Each column has a specified datatype (see options below). Each field in the column must conform to the same datatype. 

Supported datatypes

Columns of data in data tables in Terra on Azure are automatically assigned one of the supported data types below.

Defining the datatype

Terra will interpret the datatype when you upload a TSV to your workspace. For example, if the cells in a column have the format "https://<file-diectory>/<file-name>", Terra will assume the column includes links to files in cloud storage. The table cell will be clickable. Note that this can cause problems if a cell in one row does not seem to have the same datatype as the rest. It's important to ensure consistency in your TSVs. 

  • Data type

    Definition

    Format in WDS

    Array format

    Record ID  The primary key of the record. Used to retrieve and update records.  You may specify a column using “sys_name” as the header; otherwise, WDS will default to using the first column of the TSV. All cells in this column must be unique. Set table
    String

    A sequence of characters. 

    The most commonly used data type to store text. 

    To explicitly create a string in WDS with a value that would otherwise resolve to a relation, number, boolean, date, or datetime, wrap your string in double quotes. [foo,bar,"\"baz\" is the best"]
    Number An integer Unquoted numbers [2,4,6]
    Float A number that has a decimal place 0.2 [0.2, 0.3, 0.4]
    Boolean Binary values (e.g., true or false) WDS can also interpret different casings such as "TRUE", "fALSE", "True" "[true, false, true]"
    Date A date value in ISO-8601 format

    YYYY-MM-DD

    For instance: 2011-12-03

    [YYYY-MM-DD, YYYY-MM-DD,
    YYYY-MM-DD]

    Datetime A combined date and time value in ISO-8601 format

    YYYY-MM-DDTHH:MM:SS

    For instance: 2011-12-03T10:15:30

    [YYYY-MM-DDTHH:MM:SS, YYYY-MM-DDTHH:MM:SS]
    Relation To relate one record to another

    terra-wds:/{table-name}/{row-id}

    For example: terra-wds:/sample/sample1

    Arrays of relations must all relate to the same record type.

    For example,
    ["terra-wds:/type/1", "terra-wds:/type/2"]

    Array A list, or collection of similar types of data, in a specific order.

    Represented as an array using JSON syntax. That is, include all array values as a comma-delimited list inside square brackets [].

    Values inside an array can be any of the datatypes above: relation, number, boolean, date, datetime, or string.

    n/a

Example of a data table with demographic data

Terra-on-Azure_Screenshot-of-data-table.png

Creating data tables to store your data in Terra

To add a data table to your workspace you'll need to follow three steps.

1. Define your data model (optional). This is important if you have complex data in more than one table. 

2. Generate a TSV in a spreadsheet editor and store it locally. 

3. Upload the TSV to your Terra workspace. 

For step-by-step instructions, see How to create a data table from scratch.

Note that if you already have a TSV formatted for Terra on Azure, you can skip to step 3.

Multi-cloud Terra (Azure versus GCP)

If you're moving from Terra on Google, it is important to note that you won’t be able to directly access data hosted in Google Cloud when using Terra on Azure.

To copy data to Azure cloud storage, follow instructions in the bring your own data tutorial.

Note that copying data from Google to Azure cloud storage will incur egress costsFor more egress cost details, see Azure’s bandwidth egress pricing).

Terminology: Azure versus GCP

Feature

Microsoft/Azure

Google Cloud

Cloud storage Blob storage container Google bucket
Objects (files) Blob Object
Standard storage class Hot (default) Standard (default)
Controlled access mechanism Shared Access Signature (SAS) Signed URL
Open access Public Public
Requester Pays support No Yes

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.