Defining your TDR dataset schema

Allie Cliffe
  • Updated

Before creating your dataset in TDR, you will need to define your dataset schema (a TDR-specific data model). This article is an overview of TDR dataset schemas with guidance on how to define one to match what's in your dataset. Once you have the schema, you can create a dataset in the UI (including generating a schema JSON). 

If you already have your dataset schema in JSON format, and would rather use APIs to create your dataset, you can go right to Create your TDR dataset schema (Swagger)

What is the dataset schema?

The dataset schema specifies the tables, columns, and relationships that comprise the tabular data in a dataset. It is the physical representation of the dataset’s data model, with additional details specific to the data-hosting platform (TDR).

The schema specifies

  • The tables present in the dataset (such as sample, subject, etc.)
  • The columns within each table (for example sample_id, bam_file, subject_age) as well as the properties of those columns (what data type is expected, whether it's an array or integer, etc.)
  • The relationships, if any, between tables (such as which samples were collected from which subjects) and the properties of those relationships (e.g., directionality).

The schema is a template for the tabular data you’ll ingest later.

Graphic illustration of a schema

Note that this is not a complete dataset schema

What’s in a table? See the sample table below

TDR-Define-schema_Example-sample-table_Screenshot.pngOnce data has been ingested, the tables in your dataset will contain the following.

  • Entities
    Entities are specific instances of the object or concept a table contains, most commonly represented as an individual record in the table (such as the row circled in orange above). For example, a sample table may contain information about the biological samples sequenced within a dataset. Each record in the table then references a specific entity, in this case, a specific sample, generally identified by a unique ID key. When creating a table, you should ask yourself: What is the primary object or concept I want this table to contain?
  • Attributes
    Attributes are properties that describe the entities found in a table and are represented as columns in the table (such as the columns circled in green and purple above). Each attribute contains a different piece of information about the entity, with the column name or label providing a description of what the column contains. In the example above, the attributes of the sample table include: “sample_id” (the unique identifier for a sample), the “subject_id” (the identifier for the subject the sample was collected from, which may join back to a subject table), the “crai_path” and “cram_path” (the GS paths to the .crai and .cram files associated with the sample, respectively), and “data_type” (the type of sequencing performed on the sample). When adding columns to a table, you should ask yourself: What information do I want to include about the entities in the table?
  • Relationships
    Relationships describe the associations between entities, and can be used to link data between tables (e.g., the subject_id column in the sample table above can be used to link samples back to the subject they were collected from). When creating tables, you should ask yourself: how do these tables relate to one another, and how should a user be expected to navigate between them?

Defining your schema

Think about the data you have and how to organize it in tables based on the questions you asked yourself above. Each table should represent a single type of entity, and the columns in that table should describe that entity.

For example, imagine you have tabular data associated with the participants in a study. You may create a participant or subject table with a row for each participant and a column for each piece of information you want to include. Additionally, let’s say you have tabular data associated with the samples collected from the participants mentioned above. You may create a sample table with a row for each sample and a column for each piece of information you want to include about a sample (for example, this could contain genomic sample information as well as a participant_id or subject_id attribute to connect it back to the participant table). 

In almost all cases, there shouldn't be multiple rows corresponding to the same entityExample: If a patient table has multiple rows for different visits corresponding to the same patient, it’s usually better to reorganize the table to make each row unique. In this example, you could break down the table into a Patient table, with one row per patient, and a PatientVisit table, where each uniquely specified visit has its own row and a unique ID. The PatientVisit table could include a patient_id column that references the unique patient IDs in the patient table.

Example tables (recommended organization)


PatientVisit_id Patient_id Lab_panel Visit_date
G12345_01102010 G12345 https://bucket/lab-results2.csv 01/10/2010
G12345_01102001 G12345 https://bucket/lab-results1.csv 01/10/2001


Patient_id Gender
G12345 female

Note how changing to a PatientVisit table gives each row a unique ID. The Patient_id can be found in a separate, linked table with additional patient information.

Example table (not recommended)

Patient_id Lab_panel Visit_date Gender
G12345 https://bucket/lab-results2.csv 01/10/2010 female
G12345 https://bucket/lab-results1.csv 01/10/2011 female

Note how there are two rows corresponding to the same patient (patient G12345), which prevents the Patient_id from being a true unique ID key for the table.

Making data findable with a standardized schema

The more alike each dataset's schema is, the easier it will be for other people to find useful data in the data repo. We recommend using interoperable table and column names where possible. 

See GA4GH Data Use Ontology.

Next: Create your dataset in TDR

Once you have an outline of the tables and their columns and relationships, you can create your dataset right in TDR. Depending on your comfort level, you can choose to create the dataset in TDR itself (with a graphic interface), or you can use API calls (Swagger)

Was this article helpful?



Please sign in to leave a comment.