If you're a dataset owner or steward, TDR can help get your dataset into the hands of researchers. If you're a researcher, it can make it easier to find and access the data you want. This article includes details about the Terra Data Repository and overview instructions to help data custodians, data stewards, and researchers get started using TDR. It includes an overview of the structure and function of the Terra Data Repository and a detailed list of Data Repo's access controls.
What is the Data Repo, and why does it exist?
The Terra Data Repository (TDR) is a platform designed to make it easier for dataset owners to share - and researchers to access - large datasets. It addresses the challenges researchers and data custodians face using and delivering extensive datasets in the cloud.
Dataset owner's friction points and TDR solutions
-
Versatility
Dataset owners can create fine-tuned organization structures and focused dataset subsets (snapshots) from larger ones. Snapshots of the same base dataset can be shared with specific users on a case-by-case basis. -
Cost-saving
By slicing and dicing a single TDR dataset in various ways, users can access different overlapping subsets without creating and storing copies of the data. -
Versioning and consent withdrawal
Data curators can continuously update raw datasets and deliver them as new versions/releases through immutable snapshots. Researchers can access the most up-to-date data while preserving the ability to reproduce previous analyses. -
Streamlined access control
Share restricted data and data subsets broadly with authorized users. Integration with the Data Use Oversight System (DUOS) enforces proper authorization controls while streamlining the data request process.
Researcher friction points and TDR solutions
-
Discovery
Find the relevant data you are authorized to use with faceted, indexed search capabilities across datasets.
-
Scientific reproducibility
Data snapshots in TDR are fixed, so your analysis remains reproducible, even as datasets are modified or expanded. -
Access control
Restricted data - including data subsets - is strictly controlled but easy to share with authorized colleagues. Integration with the Data Use Oversight System (DUOS) enforces proper authorization controls while streamlining the data request process.
Support for complex schemas / Competing goals
The underlying structure of TDR is designed to achieve two competing goals: accepting and storing as much data as possible while maximizing data findability and usefulness.
Accept/store as much data as possible
- Large computer-readable genomic files (BAMs, CRAMs, VCFs)
- Phenotypic data
- Electronic health records
- Epigenomic data
- Other non-file data
Maximize data findability and usefulness, and facilitate cross-study analysis
- Modular and nimble data storage
- Preferred data schema (TDR-specific data model)
- Fine-tuned and granular data organization
The first goal requires a very flexible data model (schema) to accommodate diverse data types. The second places some constraints on the data model. The underlying structure of the TDR is designed to address both goals.
Definitions
Dataset | Data Snapshot | Asset | Schema
Dataset
A set of related data. Custodians can store and organize almost any existing dataset in TDR. Datasets and snapshots can be managed by different people and contain different kinds of data. For example, one dataset might have a participant-sample schema primarily focused on sequencer files, while another might have a schema primarily focused on electronic health records. Each piece of (primary) data is owned by exactly one dataset.
Asset
A set of data and metadata that always goes together and to which a custodian has given access. For example, one asset might include a BAM file, some sequencing quality metrics, information about tissue type, and information about the donor for a particular biosample.
Data owners have fine-grained control over access by creating assets that describe different views of what “goes together” for different people. When you create an asset, you specify which table and which columns of metadata from the whole dataset are available to which individuals.
Examples of different and overlapping asset specifications
-
Sample-centric assets for research purposes
An asset is organized primarily by sample ID and includes associated files and patient health records. -
Patient-centric assets for clinical purposes
An asset is organized around patient ID and includes their associated samples and files.
Data Snapshot
A slice of a single dataset or a view of all or part of one or more studies. For example, a Snapshot could be the data for samples funded by one organization or the subset of individuals matching specific criteria not common to everyone in their cohort. The Snapshot is the element that most users will interact with (i.e., researchers will analyze).
Assets can be used to identify which data to include in data Snapshots. The data snapshot is a unit of access control management.
Access is granted to a data snapshot and applies to all data in view of the data snapshot. Data snapshots are immutable: they provide an unchanging view of the base data; changes to the base data are not visible in the data snapshot view. Making data snapshots immutable allows analysis to be reproducible over time. There may be exceptions. For example, revocation of consent might require removing data based on the dataset owner’s policy. The operation is enabled by TDR but is not required.
Schema (data model)
Represents the organization of a dataset's primary data and metadata in interconnected tables (TSVs or CSVs).
The schema (data model) consists of
-
Entities
The primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key. -
Attributes/properties
The columns in a database table (i.e., phenotypic data like demographic or lab results or genomic data metadata like the URI of the files in Google Cloud Storage) -
Associations
The unique identifiers that link data between tables (i.e., a subject_id column in the sample table that links samples to the subject and any data in the subject table)
Next steps: Ingesting data/using the TDR
Additional documentation covers step-by-step instructions for data custodians to ingest data into TDR and data users to create and use data snapshots. Many steps involve using Swagger APIs. Note that steps must be followed in the correct order, and Swagger must be authenticated at every step.
Data custodians process overview
- Create a TDR Billing Profile (GCP) or Create a TDR Billing Profile (Azure)
- Define your dataset schema
- Create a TDR dataset in TDR (note that you can also Create a TDR dataset with APIs)
- Ingest data
- Create dataset assets
- Create snapshots
Data users (researchers) process overview
Repository Permission Model (a deeper dive)
The available roles within the Data Repo and the specific permissions afforded to each role are below.
Admin
The Admin role is an owner of the Data Repository. This role is for technically trained Data Repo development team individuals to help maintain the service. For example, an Admin can assign Steward roles to other users if the original owners are no longer available.
Steward
A Steward, or Data Owner, is the person who created the dataset. While ultimately liable for the data, they can assign the hands-on data management to another person by assigning the Custodian role.
Custodian
The Custodian role is defined on a dataset. Someone may be the Custodian for one or more studies. A Custodian is responsible for creating data snapshots over datasets and controlling access to those snapshots.
Snapshot Creator
The Snapshot Creator role is defined on a dataset. Users with this role can read dataset data and create new Snapshots.
Reader
The Reader role is defined on a snapshot. Readers may be assigned by a dataset Custodian to get read access to the snapshot data.
Discoverer
A Discoverer is someone using TDR to find data snapshots for analysis. They cannot read snapshot data unless they are given the Reader role.
Owner
The Owner is the creator of a spend profile, a billing account used to fund data storage and querying. They can update, delete, or share this profile with other users.
User
The User role is defined on a spend profile. Users may link this spend profile to a dataset or snapshot that they create. They can also assign this role to other individuals.
Permissions
Permissions allow fairly fine-grained operations in the repository. Click to expand for tables with permissions assigned to the various roles for the repository, spend profile, dataset, and snapshot.
-
Object
Permission
NameAdmin
Steward
Custodian
Snapshot
CreatorReader
Discoverer
Owner
User
Repository
create_dataset
✅
Repository
list_jobs
✅
✅
Repository
delete_jobs
✅
✅
Repository
delete
✅
Repository
configure
✅
Repository
share_policy::admin
✅
Repository
share_policy::steward
✅
✅
Repository
read_policy::steward
✅
Repository
read_policies
✅
Repository
alter_policies
✅
-
Object
Permission
NameAdmin
Steward
Custodian
Snapshot
CreatorReader
Discoverer
Owner
User
Spend Profile
update_metadata
✅
Spend Profile
update_billing_account
✅
Spend Profile
delete
✅
Spend Profile
link
✅
✅
Spend Profile
read_policies
✅
✅
Spend Profile
share_policy::owner
✅
✅
Spend Profile
share_policy::user
✅
✅
Spend Profile
alter_policies
✅
-
Object
Permission
NameAdmin
Steward
Custodian
Snapshot
CreatorReader
Discoverer
Owner
User
Dataset
read_dataset
✅
✅
Dataset
delete
✅
Dataset
manage_schema
✅
✅
Dataset
read_data
✅
✅
✅
Dataset
ingest_data
✅
✅
Dataset
soft_delete
✅
✅
Dataset
hard_delete
✅
✅
Dataset
link_snapshot
✅
✅
✅
Dataset
unlink_snapshot
✅
✅
Dataset
list_snapshots
✅
✅
Dataset
share_policy::steward
✅
✅
Dataset
share_policy::custodian
✅
Dataset
share_policy::ingester
✅
Dataset
read_policies
✅
✅
✅
✅
Dataset
alter_policies
✅
-
Object
Permission
NameAdmin
Steward
Custodian
Snapshot
CreatorReader
Discoverer
Owner
User
Snapshot
update_snapshot
✅
✅
Snapshot
read_data
✅
✅
✅
✅
Snapshot
discover_data
✅
✅
✅
✅
Snapshot
share_policy::steward
✅
✅
Snapshot
share_policy::custodian
✅
Snapshot
share_policy::reader
✅
✅
Snapshot
share_policy::discoverer
✅
✅
Snapshot
read_policy::custodian
✅
Snapshot
read_policy::steward
✅
Snapshot
read_policy::
discoverer
✅
✅
Snapshot
read_policies
✅
✅
✅
Snapshot
alter_policies
✅