Terra Data Repository (TDR): Overview

Leyla Tarhan
  • Updated

If you're interested in using Terra on Azure, please email terra-enterprise@broadinstitute.org.

If you're a dataset owner or steward, TDR can help get your dataset into the hands of researchers. If you're a researcher, it can make it easier to find and access the data you want. This article includes details about the Terra Data Repository and overview instructions to help data custodians, data stewards, and researchers get started using TDR. It includes an overview of the structure and function of the Terra Data Repository and a detailed list of Data Repo's access controls.

What is the Data Repo, and why does it exist?

The Terra Data Repository (TDR) is a platform designed to make it easier for dataset owners to share - and researchers to access - large datasets. It addresses the challenges researchers and data custodians face using and delivering extensive datasets in the cloud.

Dataset owner friction points and TDR solutions

  • Versatility
    Dataset owners can create fine-tuned organization structures and focused dataset subsets (snapshots) from larger ones. Snapshots of the same base dataset can be shared with specific users on a case-by-case basis. 
  • Cost-saving 
    By slicing and dicing a single TDR dataset in various ways, users can access different overlapping subsets without creating and storing copies of the data.
  • Versioning and consent withdrawal 
    Data curators can continuously update raw datasets and deliver them as new versions/releases through immutable snapshots. Researchers can access the most up-to-date data while preserving the ability to reproduce previous analyses.
  • Streamlined access control 
    Share restricted data and data subsets broadly with authorized users. Integration with the Data Use Oversight System (DUOS) enforces proper authorization controls while streamlining the data request process.

Researcher friction points and TDR solutions

  • Discovery 
    Find the relevant data you are authorized to use with faceted, indexed search capabilities across datasets.
  • Scientific reproducibility
    Data snapshots in TDR are fixed, so your analysis remains reproducible, even as datasets are modified or expanded.
  • Access control 
    Restricted data - including data subsets - is strictly controlled but easy to share with authorized colleagues. Integration with the Data Use Oversight System (DUOS) enforces proper authorization controls while streamlining the data request process.

Support for complex schemas / Competing goals

The underlying structure of TDR is designed to achieve two competing goals: accepting and storing as much data as possible while maximizing data findability and usefulness

Accept/store as much data as possible

  • Large computer-readable genomic files (BAMs, CRAMs, VCFs)
  • Phenotypic data
  • Electronic health records
  • Epigenomic data
  • Other non-file data

Maximize data findability and usefulness, and facilitate cross-study analysis

  • Modular and nimble data storage
  • Preferred data schema (TDR-specific data model)
  • Fine-tuned and granular data organization

The first goal requires a very flexible data model (schema) to accommodate diverse data types. The second places some constraints on the data model. The underlying structure of the TDR is designed to address both goals.

Definitions

Dataset | Data Snapshot | Asset | Schema

Diagram schematizing an example dataset and its assets, queries, and snapshots. The dataset is illustrated with a black-and-white table with 5 labeled columns. Blue boxes highlight individual columns to illustrate the dataset's assets. Yellow boxes highlight individual rows to illustrate individual queries. Green boxes highlight individual cells to illustrate the values in an example snapshot of the dataset.

Dataset

A set of related data. Custodians can store and organize almost any existing dataset in TDR. Datasets and snapshots can be managed by different people and contain different kinds of data. For example, one dataset might have a participant-sample schema primarily focused on sequencer files, while another might have a schema primarily focused on electronic health records. Each piece of (primary) data is owned by exactly one dataset. 

Asset

A set of data and metadata that always goes together and to which a custodian has given access. For example, one asset might include a BAM file, some sequencing quality metrics, information about tissue type, and information about the donor for a particular biosample.

Data owners have fine-grained control over access by creating assets that describe different views of what “goes together” for different people. When you create an asset, you specify which table and which columns of metadata from the whole dataset are available to which individuals.

Examples of different and overlapping asset specifications

  • Sample-centric assets for research purposes
    An asset is organized primarily by sample ID and includes associated files and patient health records.
  • Patient-centric assets for clinical purposes
    An asset is organized around patient ID and includes their associated samples and files.

Data Snapshot

A slice of a single dataset or a view of all or part of one or more studies. For example, a Snapshot could be the data for samples funded by one organization or the subset of individuals matching specific criteria not common to everyone in their cohort. The Snapshot is the element that most users will interact with (i.e., researchers will analyze). 

Assets can be used to identify which data to include in data Snapshots. The data snapshot is a unit of access control management.

Access is granted to a data snapshot and applies to all data in view of the data snapshot. Data snapshots are immutable: they provide an unchanging view of the base data; changes to the base data are not visible in the data snapshot view. Making data snapshots immutable allows analysis to be reproducible over time. There may be exceptions. For example, revocation of consent might require removing data based on the dataset owner’s policy. The operation is enabled by TDR but is not required.

Schema (data model)

Represents the organization of a dataset's primary data and metadata in interconnected tables (TSVs or CSVs).

The schema (data model) consists of

  • Entities
    The primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key.
  • Attributes/properties
    The columns in a database table (i.e., phenotypic data like demographic or lab results or genomic data metadata like the URI of the files in Google Cloud Storage)
  • Associations
    The unique identifiers that link data between tables (i.e., a subject_id column in the sample table that links samples to the subject and any data in the subject table)

Next steps: Ingesting data/using the TDR

Additional documentation covers step-by-step instructions for data custodians to ingest data into TDR and data users to create and use data snapshots. Many steps involve using Swagger APIs. Note that steps must be followed in the correct order, and Swagger must be authenticated at every step

Data custodians process overview

  1. Create a TDR Billing Profile (Azure)
  2. Define your dataset schema
  3. Create a TDR dataset in TDR (note that you can also Create a TDR dataset with APIs)
  4. Ingest data
  5. Create dataset assets
  6. Create snapshots

Data users (researchers) process overview

  1. Export a TDR snapshot (coming soon!)
  2. Use TDR snapshots with workflows (coming soon!)

Repository Permission Model (a deeper dive)

The available roles within the Data Repo and the specific permissions afforded to each role are below.

Admin

The Admin role is an owner of the Data Repository. This role is for technically trained Data Repo development team individuals to help maintain the service. For example, an Admin can assign Steward roles to other users if the original owners are no longer available.

Steward

A Steward, or Data Owner, is the person who created the dataset. While ultimately liable for the data, they can assign the hands-on data management to another person by assigning the Custodian role.

Custodian

The Custodian role is defined on a dataset. Someone may be the Custodian for one or more studies. A Custodian is responsible for creating data snapshots over datasets and controlling access to those snapshots.

Snapshot Creator

The Snapshot Creator role is defined on a dataset. Users with this role can read dataset data and create new Snapshots.

Reader

The Reader role is defined on a snapshot. Readers may be assigned by a dataset Custodian to get read access to the snapshot data.

Discoverer

A Discoverer is someone using TDR to find data snapshots for analysis. They cannot read snapshot data unless they are given the Reader role.

Owner

The Owner is the creator of a spend profile, a billing account used to fund data storage and querying. They can update, delete, or share this profile with other users.

User

The User role is defined on a spend profile. Users may link this spend profile to a dataset or snapshot that they create. They can also assign this role to other individuals.

Permissions

Permissions allow fairly fine-grained operations in the repository. Click to expand for tables with permissions assigned to the various roles for the repository, spend profile, dataset, and snapshot.

  • Object

    Permission 
    Name

    Admin

    Steward

    Custodian

    Snapshot
    Creator

    Reader

    Discoverer

    Owner

    User

    Repository

    create_dataset

     

               

    Repository

    list_jobs

               

    Repository

    delete_jobs

               

    Repository

    delete

                 

    Repository

    configure

                 

    Repository

    share_policy::admin

                 

    Repository

    share_policy::steward

               

    Repository

    read_policy::steward

     

               

    Repository

    read_policies

                 

    Repository

    alter_policies

                 
  • Object

    Permission 
    Name

    Admin

    Steward

    Custodian

    Snapshot
    Creator

    Reader

    Discoverer

    Owner

    User

    Spend Profile

    update_metadata

               

     

    Spend Profile

    update_billing_account

               

     

    Spend Profile

    delete

               

     

    Spend Profile

    link

               

    Spend Profile

    read_policies

             

     

    Spend Profile

    share_policy::owner

             

     

    Spend Profile

    share_policy::user

               

    Spend Profile

    alter_policies

                 
  • Object

    Permission 
    Name

    Admin

    Steward

    Custodian

    Snapshot
    Creator

    Reader

    Discoverer

    Owner

    User

    Dataset

    read_dataset

     

             

    Dataset

    delete

     

               

    Dataset

    manage_schema

     

             

    Dataset

    read_data

     

           

    Dataset

    ingest_data

     

             

    Dataset

    soft_delete

     

             

    Dataset

    hard_delete

     

             

    Dataset

    link_snapshot

     

           

    Dataset

    unlink_snapshot

     

             

    Dataset

    list_snapshots

     

             

    Dataset

    share_policy::steward

               

    Dataset

    share_policy::custodian

     

               

    Dataset

    share_policy::ingester

     

               

    Dataset

    read_policies

           

    Dataset

    alter_policies

                 
  • Object

    Permission 
    Name

    Admin

    Steward

    Custodian

    Snapshot
    Creator

    Reader

    Discoverer

    Owner

    User

    Snapshot

    update_snapshot

     

             

    Snapshot

    read_data

     

     

       

    Snapshot

    discover_data

     

     

       

    Snapshot

    share_policy::steward

               

    Snapshot

    share_policy::custodian

     

               

    Snapshot

    share_policy::reader

     

             

    Snapshot

    share_policy::discoverer

     

             

    Snapshot

    read_policy::custodian

           

         

    Snapshot

    read_policy::steward

           

         

    Snapshot

    read_policy::

    discoverer

           

       

    Snapshot

    read_policies

             

    Snapshot

    alter_policies

                 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.