Overview: Defining your TDR dataset schema

Before creating your dataset in TDR, you will need to define your dataset schema (a TDR-specific data model). This article is an overview of TDR dataset schemas with guidance on how to define one to match what's in your dataset. Once you have a plan for your schema, you can Write a TDR dataset schema.

What is the dataset schema?

A data model explicitly determines the structure of data. It organizes data elements and standardizes how the data elements relate to one another.

Your dataset schema specifies the tables, columns, and relationships that comprise the tabular data in your dataset. It is the physical representation of the dataset’s data model, with additional details specific to the data-hosting platform (TDR). The schema is key for setting up your dataset in the format you want, and updating it later as your data change.

Start by thinking of what data you have and how you have already organized it and how it may fit or need to be reorganized to fit the requirements for submission to TDR.

The schema specifies

The tables present in the dataset (such as sample, subject, etc.)
The columns within each table (for example sample_id, bam_file, subject_age) as well as the properties of those columns (what data type is expected, whether it's an array or integer, etc.)
The relationships, if any, between tables (such as which samples were collected from which subjects) and the properties of those relationships (e.g., directionality).

The schema is a template for the tabular data you’ll ingest later.

Graphic illustration of a schema

Diagram illustrating a TDR schema graphically. Connected boxes stand in for administrative information (the TDR billing project and dataset), and two example data tables: subject (an example of a clinical data table) and sample (an example of a biospecimen data table).
Note that this is not a complete dataset schema

What’s in a table? See the sample table below

Screenshot of an example TDR data table. The table contains 5 columns: sample_id, subject_id, crai_path, cram_path, and data_type. An orange box highlights one row. A green box highlights the subject_id column. A purple box highlights the cram_path column.

Each row is a single sample with a unique sample ID.
Data in tables can be linked. For example, in this table each sample is associated with its participant (in a separate subject table) by the subject_id (circled in green)
Links to object files in cloud storage are examples of metadata that can be kept in the sample table. In this example, the full URI to the sample cram is in the cram_path column (circled in purple).

Once data has been ingested, the tables in your dataset will contain the following.

Entities
Entities are specific instances of the object or concept a table contains, most commonly represented as an individual record in the table (such as the row circled in orange above). For example, a sample table may contain information about the biological samples sequenced within a dataset. Each record in the table then references a specific entity, in this case, a specific sample, generally identified by a unique ID key. When creating a table, you should ask yourself: What is the primary object or concept I want this table to contain?
Attributes
Attributes are properties that describe the entities found in a table and are represented as columns in the table (such as the columns circled in green and purple above). Each attribute contains a different piece of information about the entity, with the column name or label providing a description of what the column contains. In the example above, the attributes of the sample table include: “sample_id” (the unique identifier for a sample), the “subject_id” (the identifier for the subject the sample was collected from, which may join back to a subject table), the “crai_path” and “cram_path” (the GS paths to the .crai and .cram files associated with the sample, respectively), and “data_type” (the type of sequencing performed on the sample). When adding columns to a table, you should ask yourself: What information do I want to include about the entities in the table?
Relationships
Relationships describe the associations between entities, and can be used to link data between tables (e.g., the subject_id column in the sample table above can be used to link samples back to the subject they were collected from). When creating tables, you should ask yourself: how do these tables relate to one another, and how should a user be expected to navigate between them?

Defining your schema

Think about the data you have and how to organize it in tables based on the questions you asked yourself above. Each table should represent a single type of entity, and the columns in that table should describe that entity.

For example, imagine you have tabular data associated with the participants in a study. You may create a participant or subject table with a row for each participant and a column for each piece of information you want to include. Additionally, let’s say you have tabular data associated with the samples collected from the participants mentioned above. You may create a sample table with a row for each sample and a column for each piece of information you want to include about a sample (for example, this could contain genomic sample information as well as a participant_id or subject_id attribute to connect it back to the participant table).

In almost all cases, there shouldn't be multiple rows corresponding to the same entityExample: If a patient table has multiple rows for different visits corresponding to the same patient, it’s usually better to reorganize the table to make each row unique. In this example, you could break down the table into a Patient table, with one row per patient, and a PatientVisit table, where each uniquely specified visit has its own row and a unique ID. The PatientVisit table could include a patient_id column that references the unique patient IDs in the patient table.

Example tables (recommended organization)

PatientVisit

PatientVisit_id	Patient_id	Lab_panel	Visit_date
G12345_01102010	G12345	https://bucket/lab-results2.csv	01/10/2010
G12345_01102001	G12345	https://bucket/lab-results1.csv	01/10/2001

Patient

Patient_id	Gender
G12345	female

Note how changing to a PatientVisit table gives each row a unique ID. The Patient_id can be found in a separate, linked table with additional patient information.

Example table (not recommended)

Patient_id	Lab_panel	Visit_date	Gender
G12345	https://bucket/lab-results2.csv	01/10/2010	female
G12345	https://bucket/lab-results1.csv	01/10/2011	female

Note how there are two rows corresponding to the same patient (patient G12345), which prevents the Patient_id from being a true unique ID key for the table.

Making data findable with a standardized schema

The more alike each dataset's schema is, the easier it will be for other people to find useful data in the data repo. We recommend using interoperable table and column names where possible.

See GA4GH Data Use Ontology.

Next: Write your TDR schema

Once you have an outline of the tables and their columns and relationships, see How to write a TDR dataset schema to learn how to write your schema's JSON, or see Build a schema in TDR to learn how to set up your schema through the TDR web interface.