Terra Data Repository: Overview

Anton Kovalsky
  • Updated

This article gives an overview of the structure and function of the Terra Data Repository. The first section gives some background on the motivation behind the Data Repo's development. The second section outlines the components of the Data Repo. The third section summarizes the Data Repo's access controls.

What is the Data Repo, and why does it exist?

The Terra Data Repository (or TDR for short) is a data repo designed for flexible sharing access to different cross sections of data. It's built to give data custodians granular ability to build out focused data sets from larger ones, and share them with specific users on a case-by-case basis.

Support for Complex Schemas

Historically, work in genomics has been file-centric, making end-to-end workflows relatively inflexible.  Input files were turned into intermediary files, which were turned into output files whose analytical value was limited in part by the inflexibility of the workflows. Each file had specific purpose and function, and the analytical scope of the output was consequently rigidly defined. Since each step of the workflow produced computer-readable files rather than human-readable data, most of the analytical value was in the output at the very end of the pipeline. Modifying pipelines to integrate clinically useful input and output mechanisms at various intermediate points of the workflows represented enormous overhead.

Such file-centric pipelining remains an important analysis mode, especially for large projects and standardized uses. There is also an increasing need for more modular and nimble data storage that can integrate with rich phenotypic data, electronic health records, epigenomics data, and other non-file data. For example, the Human Cell Atlas includes a combination of phenotypic and genomic data that can be chopped up in different ways. Creating a data storage solution that allows for more fine-tuned and granular organization of data would make a huge difference for researchers and clinicians alike. 

Our goal is to have a single collection of data that can be sliced and diced in different ways to allow access to different overlapping subsets without making extra copies. We want the Terra Data Repository to be extremely versatile, while being efficient in terms of how much storage it requires, to minimize the cost.

Repository Structure

The Data Repository metadata describes the structure of the primary data objects and access control information. 

Dataset | Data Snapshot | Asset | Schema

Screen_Shot_2021-09-22_at_10.29.29_PM.png

Dataset

A dataset is a container holding a set of related data

Each piece of data (primary data) is owned by exactly one dataset. A dataset defines the layout (schema) of the data it holds. The data layout of the dataset is stored in the Repository Metadata.

The Terra Data Repository will support many datasets, managed by different people and containing different kinds of data. For example, one dataset might have a participant-sample schema primarily focused on sequencer files while another might have a schema primarily focused on electronic health records.

Asset

An asset is a set of data that always goes together

For example, one biosample might include a BAM file, some sequencing quality metrics, information about tissue type, and information about the donor. The Terra Data Repo enables the creation of asset specifications that describe different views of what “goes together.”

Assets can be used to identify which data to include in data Snapshots

Different and overlapping asset specifications are permitted. For example, one might select a set of sample-centric assets for research purposes by organizing an asset primarily around sample ID and including associated files and patient health records. Alternately, one might select a set of patient-centric assets for clinical purposes by organizing an asset around patient ID and include their associated samples and files.

The ability use the same base data to create datasets with different styles of tabular organization is a key goal of the Terra Data Repository. We want to empower users to be able to create complex interrelated tables to meet their needs.

Data Snapshot

A data snapshot is a slice of a single dataset or a view of all or part of one or more studies

For example, a Snapshot could be the data for samples funded by one organization, or the subset of individuals matching a specific criteria not common to everyone in their cohort. The Snapshot is the key element for most users. 

The data snapshot is a unit of access control management

Access is granted to a data snapshot and applies to all data in view of the data snapshot. Data snapshots are immutable: they provide an unchanging view of the base data; changes to the base data are not visible in the data snapshot view. Making data snapshots immutable allows analysis to be reproducible over time. There may be exceptions; for example, revocation of consent might require removal of data. However, that is based on the dataset owner’s policy. The operation is enabled by TDR, but is not required.

Schema

The schema represents the organization of a dataset's primary data and metadata. The schema (data model) consists of these components.

  • Entities
    The primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key.
  • Attributes/properties
    The columns in a database table (i.e. phenotypic data like demographic or lab results or genomic data metadata like)
  • Associations
    The unique identifiers that link data between tables (i.e. a subject_id column in the sample table that links samples with the subject)

Repository Permission Model

The available roles within the Data Repo, and the specific permissions afforded to each role are below.

Roles

Admin

The Admin role is an owner of the Data Repository. This role is assigned to technically trained individuals on the Data Repo development team to help maintain the service. For example, an Admin can assign Steward roles to other users if the original owners are no longer available.

Steward

A Steward, or Data Owner, is the person who created the dataset. While they are ultimately liable for the data, they can assign the hands-on data management to another person by assigning the Custodian role.

Custodian

The Custodian role is defined on a dataset. Someone may be the Custodian for one or more studies. A Custodian is responsible for creating data snapshots over datasets and controlling access to those snapshots.

Snapshot Creator

The Snapshot Creator role is defined on a dataset. Users with this role can read dataset data and create new Snapshots.

Reader

The Reader role is defined on a snapshot. Readers may be assigned by a dataset Custodian to get read-access to the snapshot data.

Discoverer

A Discoverer is someone using the repository to find data snapshots for analysis. They cannot read snapshot data unless they are given the Reader role.

Owner

The Owner role is assigned to the creator of a spend profile, which is a billing account used to fund data storage and querying. They can update, delete, or share this profile with other users.

User

The User role is defined on a spend profile. Users may link this spend profile to a dataset or snapshot that they create. They can also assign this role to other individuals.

Permissions

We define permissions for fairly fine-grained operations in the repository. The table below summarizes the permissions assigned to the various roles.

Object

Permission 

Name

Admin

Steward

Custodian

Snapshot

Creator

Reader

Discoverer

Owner

User

Repository

create_dataset

 

           

Repository

list_jobs

           

Repository

delete_jobs

           

Repository

delete

             

Repository

configure

             

Repository

share_policy::admin

             

Repository

share_policy::steward

           

Repository

read_policy::steward

 

           

Repository

read_policies

             

Repository

alter_policies

             

Spend Profile

update_metadata

           

 

Spend Profile

update_billing_account

           

 

Spend Profile

delete

           

 

Spend Profile

link

           

Spend Profile

read_policies

         

 

Spend Profile

share_policy::owner

         

 

Spend Profile

share_policy::user

           

Spend Profile

alter_policies

             

Dataset

read_dataset

 

         

Dataset

delete

 

           

Dataset

manage_schema

 

         

Dataset

read_data

 

       

Dataset

ingest_data

 

         

Dataset

soft_delete

 

         

Dataset

hard_delete

 

         

Dataset

link_snapshot

 

       

Dataset

unlink_snapshot

 

         

Dataset

list_snapshots

 

         

Dataset

share_policy::steward

           

Dataset

share_policy::custodian

 

           

Dataset

share_policy::ingester

 

           

>Dataset

read_policies

       

Dataset

alter_policies

             

Snapshot

delete

 

         

Snapshot

update_snapshot

 

         

Snapshot

read_data

 

 

   

Snapshot

discover_data

 

 

   

Snapshot

share_policy::steward

           

Snapshot

share_policy::custodian

 

           

Snapshot

share_policy::reader

 

         

Snapshot

share_policy::discoverer

 

         

Snapshot

read_policy::custodian

       

     

Snapshot

read_policy::steward

       

     

Snapshot

read_policy::

discoverer

       

   

Snapshot

read_policies

         

Snapshot

alter_policies

             

Was this article helpful?

1 out of 2 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.