Terra Data Repository (TDR): Overview

If you're a dataset owner or steward, TDR can help get your dataset into the hands of researchers. If you're a researcher, it can make it easier to find and access the data you want. This article includes details about the Terra Data Repository and overview instructions to help data custodians, data stewards, and researchers get started using TDR. It includes an overview of the structure and function of the Terra Data Repository and a detailed list of Data Repo's access controls.

What is the Data Repo, and why does it exist?

The Terra Data Repository (TDR) is a platform designed to make it easier for dataset owners to share - and researchers to access - large datasets. It addresses the challenges researchers and data custodians face using and delivering extensive datasets in the cloud.

Dataset owner's friction points and TDR solutions

Versatility
Dataset owners can create fine-tuned organization structures and focused dataset subsets (snapshots) from larger ones. Snapshots of the same base dataset can be shared with specific users on a case-by-case basis.
Cost-saving
By slicing and dicing a single TDR dataset in various ways, users can access different overlapping subsets without creating and storing copies of the data.
Versioning and consent withdrawal
Data curators can continuously update raw datasets and deliver them as new versions/releases through immutable snapshots. Researchers can access the most up-to-date data while preserving the ability to reproduce previous analyses.
Streamlined access control
Share restricted data and data subsets broadly with authorized users. Integration with the Data Use Oversight System (DUOS) enforces proper authorization controls while streamlining the data request process.

Researcher friction points and TDR solutions

Discovery
Find the relevant data you are authorized to use with faceted, indexed search capabilities across datasets.
Scientific reproducibility
Data snapshots in TDR are fixed, so your analysis remains reproducible, even as datasets are modified or expanded.
Access control
Restricted data - including data subsets - is strictly controlled but easy to share with authorized colleagues. Integration with the Data Use Oversight System (DUOS) enforces proper authorization controls while streamlining the data request process.

Support for complex schemas / Competing goals

The underlying structure of TDR is designed to achieve two competing goals: accepting and storing as much data as possible while maximizing data findability and usefulness.

Accept/store as much data as possible

Large computer-readable genomic files (BAMs, CRAMs, VCFs)
Phenotypic data
Electronic health records
Epigenomic data
Other non-file data

Maximize data findability and usefulness, and facilitate cross-study analysis

Modular and nimble data storage
Preferred data schema (TDR-specific data model)
Fine-tuned and granular data organization

The first goal requires a very flexible data model (schema) to accommodate diverse data types. The second places some constraints on the data model. The underlying structure of the TDR is designed to address both goals.

Definitions

Dataset | Data Snapshot | Asset | Schema

Dataset

A set of related data. Custodians can store and organize almost any existing dataset in TDR. Datasets and snapshots can be managed by different people and contain different kinds of data. For example, one dataset might have a participant-sample schema primarily focused on sequencer files, while another might have a schema primarily focused on electronic health records. Each piece of (primary) data is owned by exactly one dataset.

Asset

A set of data and metadata that always goes together and to which a custodian has given access. For example, one asset might include a BAM file, some sequencing quality metrics, information about tissue type, and information about the donor for a particular biosample.

Data owners have fine-grained control over access by creating assets that describe different views of what “goes together” for different people. When you create an asset, you specify which table and which columns of metadata from the whole dataset are available to which individuals.

Examples of different and overlapping asset specifications

Sample-centric assets for research purposes
An asset is organized primarily by sample ID and includes associated files and patient health records.
Patient-centric assets for clinical purposes
An asset is organized around patient ID and includes their associated samples and files.

Data Snapshot

A slice of a single dataset or a view of all or part of one or more studies. For example, a Snapshot could be the data for samples funded by one organization or the subset of individuals matching specific criteria not common to everyone in their cohort. The Snapshot is the element that most users will interact with (i.e., researchers will analyze).

Assets can be used to identify which data to include in data Snapshots. The data snapshot is a unit of access control management.

Access is granted to a data snapshot and applies to all data in view of the data snapshot. Data snapshots are immutable: they provide an unchanging view of the base data; changes to the base data are not visible in the data snapshot view. Making data snapshots immutable allows analysis to be reproducible over time. There may be exceptions. For example, revocation of consent might require removing data based on the dataset owner’s policy. The operation is enabled by TDR but is not required.

Schema (data model)

Represents the organization of a dataset's primary data and metadata in interconnected tables (TSVs or CSVs).

The schema (data model) consists of

Entities
The primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key.
Attributes/properties
The columns in a database table (i.e., phenotypic data like demographic or lab results or genomic data metadata like the URI of the files in Google Cloud Storage)
Associations
The unique identifiers that link data between tables (i.e., a subject_id column in the sample table that links samples to the subject and any data in the subject table)

Next steps: Ingesting data/using the TDR

Additional documentation covers step-by-step instructions for data custodians to ingest data into TDR and data users to create and use data snapshots. Many steps involve using Swagger APIs. Note that steps must be followed in the correct order, and Swagger must be authenticated at every step.

Data custodians process overview

Create a TDR Billing Profile (GCP) or Create a TDR Billing Profile (Azure)
Define your dataset schema
Create a TDR dataset in TDR (note that you can also Create a TDR dataset with APIs)
Ingest data
Create dataset assets
Create snapshots

Data users (researchers) process overview

Repository Permission Model (a deeper dive)

The available roles within the Data Repo and the specific permissions afforded to each role are below.

Admin

The Admin role is an owner of the Data Repository. This role is for technically trained Data Repo development team individuals to help maintain the service. For example, an Admin can assign Steward roles to other users if the original owners are no longer available.

Steward

A Steward, or Data Owner, is the person who created the dataset. While ultimately liable for the data, they can assign the hands-on data management to another person by assigning the Custodian role.

Custodian

The Custodian role is defined on a dataset. Someone may be the Custodian for one or more studies. A Custodian is responsible for creating data snapshots over datasets and controlling access to those snapshots.

Snapshot Creator

The Snapshot Creator role is defined on a dataset. Users with this role can read dataset data and create new Snapshots.

Reader

The Reader role is defined on a snapshot. Readers may be assigned by a dataset Custodian to get read access to the snapshot data.

Discoverer

A Discoverer is someone using TDR to find data snapshots for analysis. They cannot read snapshot data unless they are given the Reader role.

Owner

The Owner is the creator of a spend profile, a billing account used to fund data storage and querying. They can update, delete, or share this profile with other users.

User

The User role is defined on a spend profile. Users may link this spend profile to a dataset or snapshot that they create. They can also assign this role to other individuals.

Permissions

Permissions allow fairly fine-grained operations in the repository. Click to expand for tables with permissions assigned to the various roles for the repository, spend profile, dataset, and snapshot.

Object	Permission Name	Admin	Steward	Custodian	Snapshot Creator	Reader	Discoverer	Owner	User
Repository	create_dataset		✅
Repository	list_jobs	✅	✅
Repository	delete_jobs	✅	✅
Repository	delete	✅
Repository	configure	✅
Repository	share_policy::admin	✅
Repository	share_policy::steward	✅	✅
Repository	read_policy::steward		✅
Repository	read_policies	✅
Repository	alter_policies	✅

Object	Permission Name	Admin	Steward	Custodian	Snapshot Creator	Reader	Discoverer	Owner	User
Spend Profile	update_metadata							✅
Spend Profile	update_billing_account							✅
Spend Profile	delete							✅
Spend Profile	link							✅	✅
Spend Profile	read_policies	✅						✅
Spend Profile	share_policy::owner	✅						✅
Spend Profile	share_policy::user							✅	✅
Spend Profile	alter_policies	✅

Object	Permission Name	Admin	Steward	Custodian	Snapshot Creator	Reader	Discoverer	Owner	User
Dataset	read_dataset		✅	✅
Dataset	delete		✅
Dataset	manage_schema		✅	✅
Dataset	read_data		✅	✅	✅
Dataset	ingest_data		✅	✅
Dataset	soft_delete		✅	✅
Dataset	hard_delete		✅	✅
Dataset	link_snapshot		✅	✅	✅
Dataset	unlink_snapshot		✅	✅
Dataset	list_snapshots		✅	✅
Dataset	share_policy::steward	✅	✅
Dataset	share_policy::custodian		✅
Dataset	share_policy::ingester		✅
Dataset	read_policies	✅	✅	✅	✅
Dataset	alter_policies	✅

Object	Permission Name	Admin	Steward	Custodian	Snapshot Creator	Reader	Discoverer	Owner	User
Snapshot	update_snapshot		✅	✅
Snapshot	read_data		✅	✅		✅	✅
Snapshot	discover_data		✅	✅		✅	✅
Snapshot	share_policy::steward	✅	✅
Snapshot	share_policy::custodian		✅
Snapshot	share_policy::reader		✅	✅
Snapshot	share_policy::discoverer		✅	✅
Snapshot	read_policy::custodian					✅
Snapshot	read_policy::steward					✅
Snapshot	read_policy:: discoverer					✅	✅
Snapshot	read_policies	✅	✅	✅
Snapshot	alter_policies	✅

Loading…