How can the Terra Data Repository help you? If you're a dataset owner, it can help get your dataset into the hands of researchers. If you're a researcher, TDR can make it easier to find and access the data you want. This article addresses these questions with an overview of the structure and function of the Terra Data Repository and a detailed list of Data Repo's access controls.
What is the Data Repo, and why does it exist?
The Terra Data Repository (or TDR for short) is a data repo designed to address friction points for researchers and dataset owners using and delivering ever-larger datasets in the cloud.
Data owners friction points and TDR solutions
- Versatility
Fine-tuned and granular organization structures allow for building out focused data sets from larger ones, and sharing with specific users on a case-by-case basis. Use the same base data to create datasets with different tabular organization. - Cost-saving
Single dataset can be sliced and diced in different ways to allow access to different overlapping subsets without making and storing extra copies. - Versioning and consent withdrawal
Raw datasets can be continually updated and delivered as new versions/releases via immutable snapshots.
Researcher friction points and TDR solutions
- Discovery (to be able to easily find all the relevant data you're authorized to use)
TDR includes faceted, indexed search across datasets. - Scientific reproducibility (even as datasets are modified or expanded)
Data snapshots are fixed, allowing analysis to be reproducible over time. - Access control (strict control of restricted data, but easy to share with authorized colleagues) Data is easily subset to let you share unique views with distinct collaborators; integration with DUOS enforces authorization controls.
Support for complex schemas / Competing goals
TDR is built to create complex interrelated tables to meet data owner and researcher needs.
- Accept/store as much data as possible
- Large computer readable genomic files (BAMs, CRAMs, VCFs)
- Phenotypic data
- Electronic health records
- Epigenomics data
- Other non-file data
- Maximize data findability and usefulness, and facilitate cross-study analysis
- Modular and nimble data storage
- Fine-tuned and granular data organization
The first goal requires a very flexible data model (shema). The second places some constraints on the data model. The underlying structure of the TDR is designed to address both goals.
Dataset | Data Snapshot | Asset | Schema
Dataset
A set of related data. The Terra Data Repository will support many datasets, managed by different people and containing different kinds of data. For example, one dataset might have a participant-sample schema primarily focused on sequencer files while another might have a schema primarily focused on electronic health records. Each piece of (primary) data is owned by exactly one dataset.
The TDR dataset schema (data model) is flexible, to enable custodians to store and organize almost any existing dataset.
Asset
A set of data and metadata that always goes together and to which a custodian has given access. For example, one asset might include a BAM file, some sequencing quality metrics, information about tissue type, and information about the donor for a particular biosample.
The Terra Data Repo allows data owners to have fine-grained control over access by creating assets that describe different views of what “goes together” for different people. When you create an asset, you specify which table and which columns of metadata are available to which individuals.
TDR allows different and overlapping asset specifications
- Sample-centric assets for research purposes (asset organized primarily by sample ID and including associated files and patient health records).
- Patient-centric assets for clinical purposes (asset organized by around patient ID and include their associated samples and files.
Data Snapshot
A slice of a single dataset or a view of all or part of one or more studies. For example, a Snapshot could be the data for samples funded by one organization, or the subset of individuals matching a specific criteria not common to everyone in their cohort. The Snapshot is the element that most users will interact with (i.e. analyze).
Assets can be used to identify which data to include in data Snapshots. The data snapshot is a unit of access control management.
Access is granted to a data snapshot and applies to all data in view of the data snapshot. Data snapshots are immutable: they provide an unchanging view of the base data; changes to the base data are not visible in the data snapshot view. Making data snapshots immutable allows analysis to be reproducible over time. There may be exceptions; for example, revocation of consent might require removal of data. However, that is based on the dataset owner’s policy. The operation is enabled by TDR, but is not required.
Schema (data model)
Represents the organization of a dataset's primary data and metadata in interconnected tables (TSVs or CSVs).
The schema (data model) consists of
- Entities
The primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key. - Attributes/properties
The columns in a database table (i.e. phenotypic data like demographic or lab results or genomic data metadata like the URI of the files in Google Cloud Storage) - Associations
The unique identifiers that link data between tables (i.e. a subject_id column in the sample table that links samples to the subject and any data in the subject table)
Repository Permission Model
The available roles within the Data Repo, and the specific permissions afforded to each role are below.
Roles
Admin
The Admin role is an owner of the Data Repository. This role is assigned to technically trained individuals on the Data Repo development team to help maintain the service. For example, an Admin can assign Steward roles to other users if the original owners are no longer available.
Steward
A Steward, or Data Owner, is the person who created the dataset. While they are ultimately liable for the data, they can assign the hands-on data management to another person by assigning the Custodian role.
Custodian
The Custodian role is defined on a dataset. Someone may be the Custodian for one or more studies. A Custodian is responsible for creating data snapshots over datasets and controlling access to those snapshots.
Snapshot Creator
The Snapshot Creator role is defined on a dataset. Users with this role can read dataset data and create new Snapshots.
Reader
The Reader role is defined on a snapshot. Readers may be assigned by a dataset Custodian to get read-access to the snapshot data.
Discoverer
A Discoverer is someone using the repository to find data snapshots for analysis. They cannot read snapshot data unless they are given the Reader role.
Owner
The Owner role is assigned to the creator of a spend profile, which is a billing account used to fund data storage and querying. They can update, delete, or share this profile with other users.
User
The User role is defined on a spend profile. Users may link this spend profile to a dataset or snapshot that they create. They can also assign this role to other individuals.
Permissions
We define permissions for fairly fine-grained operations in the repository. The table below summarizes the permissions assigned to the various roles.
Object |
Permission Name |
Admin |
Steward |
Custodian |
Snapshot Creator |
Reader |
Discoverer |
Owner |
User |
Repository |
create_dataset |
✅ |
|||||||
Repository |
list_jobs |
✅ |
✅ |
||||||
Repository |
delete_jobs |
✅ |
✅ |
||||||
Repository |
delete |
✅ |
|||||||
Repository |
configure |
✅ |
|||||||
Repository |
share_policy::admin |
✅ |
|||||||
Repository |
share_policy::steward |
✅ |
✅ |
||||||
Repository |
read_policy::steward |
✅ |
|||||||
Repository |
read_policies |
✅ |
|||||||
Repository |
alter_policies |
✅ |
|||||||
Spend Profile |
update_metadata |
✅ |
|||||||
Spend Profile |
update_billing_account |
✅ |
|||||||
Spend Profile |
delete |
✅ |
|||||||
Spend Profile |
link |
✅ |
✅ |
||||||
Spend Profile |
read_policies |
✅ |
✅ |
||||||
Spend Profile |
share_policy::owner |
✅ |
✅ |
||||||
Spend Profile |
share_policy::user |
✅ |
✅ |
||||||
Spend Profile |
alter_policies |
✅ |
|||||||
Dataset |
read_dataset |
✅ |
✅ |
||||||
Dataset |
delete |
✅ |
|||||||
Dataset |
manage_schema |
✅ |
✅ |
||||||
Dataset |
read_data |
✅ |
✅ |
✅ |
|||||
Dataset |
ingest_data |
✅ |
✅ |
||||||
Dataset |
soft_delete |
✅ |
✅ |
||||||
Dataset |
hard_delete |
✅ |
✅ |
||||||
Dataset |
link_snapshot |
✅ |
✅ |
✅ |
|||||
Dataset |
unlink_snapshot |
✅ |
✅ |
||||||
Dataset |
list_snapshots |
✅ |
✅ |
||||||
Dataset |
share_policy::steward |
✅ |
✅ |
||||||
Dataset |
share_policy::custodian |
✅ |
|||||||
Dataset |
share_policy::ingester |
✅ |
|||||||
>Dataset |
read_policies |
✅ |
✅ |
✅ |
✅ |
||||
Dataset |
alter_policies |
✅ |
|||||||
Snapshot |
delete |
✅ |
✅ |
||||||
Snapshot |
update_snapshot |
✅ |
✅ |
||||||
Snapshot |
read_data |
✅ |
✅ |
✅ |
✅ |
||||
Snapshot |
discover_data |
✅ |
✅ |
✅ |
✅ |
||||
Snapshot |
share_policy::steward |
✅ |
✅ |
||||||
Snapshot |
share_policy::custodian |
✅ |
|||||||
Snapshot |
share_policy::reader |
✅ |
✅ |
||||||
Snapshot |
share_policy::discoverer |
✅ |
✅ |
||||||
Snapshot |
read_policy::custodian |
✅ |
|||||||
Snapshot |
read_policy::steward |
✅ |
|||||||
Snapshot |
read_policy:: discoverer |
✅ |
✅ |
||||||
Snapshot |
read_policies |
✅ |
✅ |
✅ |
|||||
Snapshot |
alter_policies |
✅ |