Terra Data Repository: Overview

Anton Kovalsky
  • Updated

What is the Terra Data Repository and how can it help you? If you're a researcher, TDR can make it easier to find and access the data you want. If you're a dataset owner, it can help dataset get data into the hands of researchers. This article addresses these questions with an overview of the structure and function of the Terra Data Repository. The first section gives some background on the motivation behind the Data Repo's development. The second section outlines the components of the Data Repo. The third section summarizes the Data Repo's access controls.

What is the Data Repo, and why does it exist?

The Terra Data Repository (or TDR for short) is a data repo designed to address friction points for researchers and dataset owners using and delivering ever-larger datasets in the cloud. 

Benefits to researchers

  • Discovery: Easily find all the relevant data you're authorized to use with faceted, indexed search across datasets.
  • Scientific reproducibility: Data snapshots are fixed, allowing analysis to be reproducible over time.
  • Access control: Subset data to share unique views with distinct collaborators; integration with DUOS.

Benefits to data owners

  • Versatility: Fine-tuned and granular organization structures allow for building out focused data sets from larger ones, and sharing with specific users on a case-by-case basis.
  • Cost-saving: Single dataset can be sliced and diced in different ways to allow access to different overlapping subsets without making and storing extra copies.
  • Versioning and consent withdrawal: Raw datasets can be continually updated and delivered as new versions/releases via immutable snapshots.

Support for complex schemas / Balancing competing goals

  • Accept/store as much data as possible
    • Large computer readable genomic files (BAMs, CRAMs, VCFs)
    • Phenotypic data
    • Electronic health records
    • Epigenomics data
    • Other non-file data
  • Maximize data findability and usefulness, and facilitate cross-study analysis
    • Modular and nimble data storage
    • Fine-tuned and granular data organization

The first goal requires a very flexible data model (shema). The second places some constraints on the data model. The underlying structure of the TDR is designed to address both challenges.

Screen_Shot_2021-09-22_at_10.29.29_PM.png

Dataset | Data Snapshot | Asset | Schema

Dataset

A set of related data. The Terra Data Repository will support many datasets, managed by different people and containing different kinds of data. For example, one dataset might have a participant-sample schema primarily focused on sequencer files while another might have a schema primarily focused on electronic health records. Each piece of (primary) data is owned by exactly one dataset. The schema is flexible, to enable custodians to store and organize almost any existing dataset. 

Asset

A set of data and metadata that always goes together and to which a custodian has given access. For example, one biosample might include a BAM file, some sequencing quality metrics, information about tissue type, and information about the donor. The Terra Data Repo enables data owners to create assets that describe different views of what “goes together” for different people. When you create an asset, you specify which table and which columns of metadata are available to which individuals.

Assets can be used to identify which data to include in data Snapshots.

TDR allows different and overlapping asset specifications

  • Sample-centric assets for research purposes (asset organized primarily by sample ID and including associated files and patient health records).
  • Patient-centric assets for clinical purposes (asset organized by around patient ID and include their associated samples and files.

The ability use the same base data to create datasets with different tabular organization is a key goal of the Terra Data Repository. We want to empower users to be able to create complex interrelated tables to meet their needs.

Data Snapshot

A slice of a single dataset or a view of all or part of one or more studies. For example, a Snapshot could be the data for samples funded by one organization, or the subset of individuals matching a specific criteria not common to everyone in their cohort. The Snapshot is the element that most users will interact with (i.e. analyze). 

The data snapshot is a unit of access control management.

Access is granted to a data snapshot and applies to all data in view of the data snapshot. Data snapshots are immutable: they provide an unchanging view of the base data; changes to the base data are not visible in the data snapshot view. Making data snapshots immutable allows analysis to be reproducible over time. There may be exceptions; for example, revocation of consent might require removal of data. However, that is based on the dataset owner’s policy. The operation is enabled by TDR, but is not required.

Schema (data model)

Represents the organization of a dataset's primary data and metadata in interconnected tables (TSVs or CSVs).

The schema (data model) consists of

  • Entities
    The primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key.
  • Attributes/properties
    The columns in a database table (i.e. phenotypic data like demographic or lab results or genomic data metadata like the URI of the files in Google Cloud Storage)
  • Associations
    The unique identifiers that link data between tables (i.e. a subject_id column in the sample table that links samples to the subject and any data in the subject table)

Repository Permission Model

The available roles within the Data Repo, and the specific permissions afforded to each role are below.

Roles

Admin

The Admin role is an owner of the Data Repository. This role is assigned to technically trained individuals on the Data Repo development team to help maintain the service. For example, an Admin can assign Steward roles to other users if the original owners are no longer available.

Steward

A Steward, or Data Owner, is the person who created the dataset. While they are ultimately liable for the data, they can assign the hands-on data management to another person by assigning the Custodian role.

Custodian

The Custodian role is defined on a dataset. Someone may be the Custodian for one or more studies. A Custodian is responsible for creating data snapshots over datasets and controlling access to those snapshots.

Snapshot Creator

The Snapshot Creator role is defined on a dataset. Users with this role can read dataset data and create new Snapshots.

Reader

The Reader role is defined on a snapshot. Readers may be assigned by a dataset Custodian to get read-access to the snapshot data.

Discoverer

A Discoverer is someone using the repository to find data snapshots for analysis. They cannot read snapshot data unless they are given the Reader role.

Owner

The Owner role is assigned to the creator of a spend profile, which is a billing account used to fund data storage and querying. They can update, delete, or share this profile with other users.

User

The User role is defined on a spend profile. Users may link this spend profile to a dataset or snapshot that they create. They can also assign this role to other individuals.

Permissions

We define permissions for fairly fine-grained operations in the repository. The table below summarizes the permissions assigned to the various roles.

Object

Permission 

Name

Admin

Steward

Custodian

Snapshot

Creator

Reader

Discoverer

Owner

User

Repository

create_dataset

 

           

Repository

list_jobs

           

Repository

delete_jobs

           

Repository

delete

             

Repository

configure

             

Repository

share_policy::admin

             

Repository

share_policy::steward

           

Repository

read_policy::steward

 

           

Repository

read_policies

             

Repository

alter_policies

             

Spend Profile

update_metadata

           

 

Spend Profile

update_billing_account

           

 

Spend Profile

delete

           

 

Spend Profile

link

           

Spend Profile

read_policies

         

 

Spend Profile

share_policy::owner

         

 

Spend Profile

share_policy::user

           

Spend Profile

alter_policies

             

Dataset

read_dataset

 

         

Dataset

delete

 

           

Dataset

manage_schema

 

         

Dataset

read_data

 

       

Dataset

ingest_data

 

         

Dataset

soft_delete

 

         

Dataset

hard_delete

 

         

Dataset

link_snapshot

 

       

Dataset

unlink_snapshot

 

         

Dataset

list_snapshots

 

         

Dataset

share_policy::steward

           

Dataset

share_policy::custodian

 

           

Dataset

share_policy::ingester

 

           

>Dataset

read_policies

       

Dataset

alter_policies

             

Snapshot

delete

 

         

Snapshot

update_snapshot

 

         

Snapshot

read_data

 

 

   

Snapshot

discover_data

 

 

   

Snapshot

share_policy::steward

           

Snapshot

share_policy::custodian

 

           

Snapshot

share_policy::reader

 

         

Snapshot

share_policy::discoverer

 

         

Snapshot

read_policy::custodian

       

     

Snapshot

read_policy::steward

       

     

Snapshot

read_policy::

discoverer

       

   

Snapshot

read_policies

         

Snapshot

alter_policies

             

Was this article helpful?

1 out of 2 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.