Understanding TDR dataset schemas and how to define one.
If you already have your dataset schema in JSON format, click here to go right to Create your TDR dataset schema (Swagger).
What is the dataset schema?
Before you can create a TDR dataset, you need to define the structure of the data you’ll be ingesting by specifying the schema of the data. The schema is similar to a data model but with additional details specific to the data-hosting platform TDR).
The schema specifies the set of tables that will hold your data as well as the columns in each table. The schema also specifies relationships between tables - if, for example, the sample table includes a participant_id column that corresponds to a row in the participant table.
The schema is a template for the data you’ll ingest later.
Graphic illustration of a schema
Note that this is not a complete dataset schema
The schema’s function is to describe
- The tables that contain the data (i.e., sample, subject, etc.)
- The metadata (columns) in each table (i.e., BAM file, subject_age)
- Any relationships between tables (i.e., what samples correspond to which subjects)
A schema includes
-
Entities (i.e., tables)
The entity is the primary object the table contains. Each row in the table is a distinct entity with a unique ID key. (i.e., a “subject” entity for phenotypic data or a “sample” entity for genomic data). -
Metadata attributes (i.e., column labels)
Each table column contains a different sort of metadata (i.e., age or ancestry or lab results in the subject table or links to genomic data files in cloud storage in the sample table). The datatype (e.g., number, string) is part of the metadata (i.e., you specify the datatype when you define a column for the metadata). -
Associations
The unique identifiers that link data between tables (i.e., a subject_id column in the sample table that links samples with the subject)
What’s in a table?
- Each table tracks an entity - a particular kind of data or record. Ask yourself - what is the primary thing that this table contains (participant or sample, or subject data, for example)
- Generally, each row corresponds to one unique entity - a unique participant, or sample, or subject, for example.
- Columns correspond to data or metadata associated with that entity.
Define your schema
Think about the data you have and how to organize it in tables. Each table represents one entity of data and may refer to a separate table. For example, a participant or subject table might be tabular data associated with participants in a study - such as clinical or phenotypic data. A sample table would contain genomic sample data as well as a participant_id or subject_id attribute connecting it to the participant.
In almost all cases, you should not have multiple rows in a table corresponding to the same entityExample: If a patient table has multiple rows for different visits corresponding to the same patient, it’s usually better to reorganize the table to make each row unique. You could break down the table in this example into a patient table, with one row per patient, and a PatientVisit table, with uniquely specified visits with their own row and a unique ID. The PatientVisit table could include a patient_id column that references the unique patient IDs in the patient table.
Example table (not recommended)
Patient_id | LabPanel | VisitDate |
G12345 | https://bucket/lab-results2.csv | 01/10/2010 |
G12345 | https://bucket/lab-results1.csv | 01/10/2011 |
Notice there are two rows corresponding to the same patient (patient G12345).
Example table (recommended)
PatientVisit_id | Patient_id | LabPanel | VisitDate |
G12345_01102010 | G12345 | https://bucket/lab-results2.csv | 01/10/2010 |
G12345_01102001 | G12345 | https://bucket/lab-results1.csv | 01/10/2001 |
Notice that changing to a PatientVisit table enables each row to have a unique ID. The Patient_id will be in a separate, linked table.
Making data findable with a standardized schema
The more alike each dataset's schema is, the easier it will be for other people to find useful data in the data repo. We recommend using interoperable table and column names where possible.
Create your schema in TDR
Once you have an outline of the tables and their columns and relationships, you can create your dataset schema right in TDR.
You can also use the Swagger API to create TDR dataset schemas. For step-by-step instructions, click here.
Step-by-step instructions
Start by logging into data.terra.bio.
Step 1: Dataset information
In the intro form, complete the required fields and drop-downs for the dataset.
Required fields
- Name
- Cloud Platform
- Billing Profile
- Region
You can also add a description, designate stewards and custodians, and choose secure monitoring.
Step 2: Build schema and create the dataset
Instructions in the browser walk through dataset schema creation, table-by-table.
2.1. Use the blue buttons to create a table. Repeat for each table in your schema.
2.2. Use the second blue button to add columns to each table. You will select the column name and datatype. Repeat for each attribute (column) in each table.
2.3. Scroll down to the JSON view and copy the entire content (screenshot bleow). Use this in the next step: Create a dataset schema (Swagger).