Terra Data Repository: Overview

Anton Kovalsky
  • Updated

How can the Terra Data Repository help you? If you're a dataset owner, it can help get your dataset into the hands of researchers. If you're a researcher, TDR can make it easier to find and access the data you want. This article addresses these questions with an overview of the structure and function of the Terra Data Repository and a detailed list of Data Repo's access controls. 

What is the Data Repo, and why does it exist?

The Terra Data Repository (or TDR for short) is a data repo designed to address friction points for researchers and dataset owners using and delivering ever-larger datasets in the cloud. 

Data owners friction points and TDR solutions

  • Versatility
    Fine-tuned and granular organization structures allow for building out focused data sets from larger ones, and sharing with specific users on a case-by-case basis. Use the same base data to create datasets with different tabular organization.
  • Cost-saving 
    Single dataset can be sliced and diced in different ways to allow access to different overlapping subsets without making and storing extra copies.
  • Versioning and consent withdrawal 
    Raw datasets can be continually updated and delivered as new versions/releases via immutable snapshots.

Researcher friction points and TDR solutions

  • Discovery (to be able to easily find all the relevant data you're authorized to use)
    TDR includes faceted, indexed search across datasets.
  • Scientific reproducibility (even as datasets are modified or expanded)
    Data snapshots are fixed, allowing analysis to be reproducible over time.
  • Access control (strict control of restricted data, but easy to share with authorized colleagues) Data is easily subset to let you share unique views with distinct collaborators; integration with DUOS enforces authorization controls.

Support for complex schemas / Competing goals

TDR is built to create complex interrelated tables to meet data owner and researcher needs. 

  • Accept/store as much data as possible
    • Large computer readable genomic files (BAMs, CRAMs, VCFs)
    • Phenotypic data
    • Electronic health records
    • Epigenomics data
    • Other non-file data
  • Maximize data findability and usefulness, and facilitate cross-study analysis
    • Modular and nimble data storage
    • Fine-tuned and granular data organization

The first goal requires a very flexible data model (shema). The second places some constraints on the data model. The underlying structure of the TDR is designed to address both goals.

Screen_Shot_2021-09-22_at_10.29.29_PM.png

Dataset | Data Snapshot | Asset | Schema

Dataset

A set of related data. The Terra Data Repository will support many datasets, managed by different people and containing different kinds of data. For example, one dataset might have a participant-sample schema primarily focused on sequencer files while another might have a schema primarily focused on electronic health records. Each piece of (primary) data is owned by exactly one dataset.

The TDR dataset schema (data model) is flexible, to enable custodians to store and organize almost any existing dataset. 

Asset

A set of data and metadata that always goes together and to which a custodian has given access. For example, one asset might include a BAM file, some sequencing quality metrics, information about tissue type, and information about the donor for a particular biosample.

The Terra Data Repo allow data owners to have fine-grained control over access by creating assets that describe different views of what “goes together” for different people. When you create an asset, you specify which table and which columns of metadata are available to which individuals.

TDR allows different and overlapping asset specifications

  • Sample-centric assets for research purposes (asset organized primarily by sample ID and including associated files and patient health records).
  • Patient-centric assets for clinical purposes (asset organized by around patient ID and include their associated samples and files.

Data Snapshot

A slice of a single dataset or a view of all or part of one or more studies. For example, a Snapshot could be the data for samples funded by one organization, or the subset of individuals matching a specific criteria not common to everyone in their cohort. The Snapshot is the element that most users will interact with (i.e. analyze). 

Assets can be used to identify which data to include in data Snapshots. The data snapshot is a unit of access control management.

Access is granted to a data snapshot and applies to all data in view of the data snapshot. Data snapshots are immutable: they provide an unchanging view of the base data; changes to the base data are not visible in the data snapshot view. Making data snapshots immutable allows analysis to be reproducible over time. There may be exceptions; for example, revocation of consent might require removal of data. However, that is based on the dataset owner’s policy. The operation is enabled by TDR, but is not required.

Schema (data model)

Represents the organization of a dataset's primary data and metadata in interconnected tables (TSVs or CSVs).

The schema (data model) consists of

  • Entities
    The primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key.
  • Attributes/properties
    The columns in a database table (i.e. phenotypic data like demographic or lab results or genomic data metadata like the URI of the files in Google Cloud Storage)
  • Associations
    The unique identifiers that link data between tables (i.e. a subject_id column in the sample table that links samples to the subject and any data in the subject table)

Repository Permission Model

The available roles within the Data Repo, and the specific permissions afforded to each role are below.

Roles

Admin

The Admin role is an owner of the Data Repository. This role is assigned to technically trained individuals on the Data Repo development team to help maintain the service. For example, an Admin can assign Steward roles to other users if the original owners are no longer available.

Steward

A Steward, or Data Owner, is the person who created the dataset. While they are ultimately liable for the data, they can assign the hands-on data management to another person by assigning the Custodian role.

Custodian

The Custodian role is defined on a dataset. Someone may be the Custodian for one or more studies. A Custodian is responsible for creating data snapshots over datasets and controlling access to those snapshots.

Snapshot Creator

The Snapshot Creator role is defined on a dataset. Users with this role can read dataset data and create new Snapshots.

Reader

The Reader role is defined on a snapshot. Readers may be assigned by a dataset Custodian to get read-access to the snapshot data.

Discoverer

A Discoverer is someone using the repository to find data snapshots for analysis. They cannot read snapshot data unless they are given the Reader role.

Owner

The Owner role is assigned to the creator of a spend profile, which is a billing account used to fund data storage and querying. They can update, delete, or share this profile with other users.

User

The User role is defined on a spend profile. Users may link this spend profile to a dataset or snapshot that they create. They can also assign this role to other individuals.

Permissions

We define permissions for fairly fine-grained operations in the repository. The table below summarizes the permissions assigned to the various roles.

Object

Permission 

Name

Admin

Steward

Custodian

Snapshot

Creator

Reader

Discoverer

Owner

User

Repository

create_dataset

 

           

Repository

list_jobs

           

Repository

delete_jobs

           

Repository

delete

             

Repository

configure

             

Repository

share_policy::admin

             

Repository

share_policy::steward

           

Repository

read_policy::steward

 

           

Repository

read_policies

             

Repository

alter_policies

             

Spend Profile

update_metadata

           

 

Spend Profile

update_billing_account

           

 

Spend Profile

delete

           

 

Spend Profile

link

           

Spend Profile

read_policies

         

 

Spend Profile

share_policy::owner

         

 

Spend Profile

share_policy::user

           

Spend Profile

alter_policies

             

Dataset

read_dataset

 

         

Dataset

delete

 

           

Dataset

manage_schema

 

         

Dataset

read_data

 

       

Dataset

ingest_data

 

         

Dataset

soft_delete

 

         

Dataset

hard_delete

 

         

Dataset

link_snapshot

 

       

Dataset

unlink_snapshot

 

         

Dataset

list_snapshots

 

         

Dataset

share_policy::steward

           

Dataset

share_policy::custodian

 

           

Dataset

share_policy::ingester

 

           

>Dataset

read_policies

       

Dataset

alter_policies

             

Snapshot

delete

 

         

Snapshot

update_snapshot

 

         

Snapshot

read_data

 

 

   

Snapshot

discover_data

 

 

   

Snapshot

share_policy::steward

           

Snapshot

share_policy::custodian

 

           

Snapshot

share_policy::reader

 

         

Snapshot

share_policy::discoverer

 

         

Snapshot

read_policy::custodian

       

     

Snapshot

read_policy::steward

       

     

Snapshot

read_policy::

discoverer

       

   

Snapshot

read_policies

         

Snapshot

alter_policies

             

Was this article helpful?

1 out of 2 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.