Before creating your dataset in TDR, you will need to define your dataset schema (a TDR-specific data model). This article is an overview of TDR dataset schemas with guidance on how to define one to match what's in your dataset. Once you have a plan for your schema, you can Write a TDR dataset schema.
What is the dataset schema?
The dataset schema specifies the tables, columns, and relationships that comprise the tabular data in a dataset. It is the physical representation of the dataset’s data model, with additional details specific to the data-hosting platform (TDR). The schema is key for setting up your dataset in the format you want, and updating it later as your data change.
The schema specifies
- The tables present in the dataset (such as sample, subject, etc.)
- The columns within each table (for example sample_id, bam_file, subject_age) as well as the properties of those columns (what data type is expected, whether it's an array or integer, etc.)
- The relationships, if any, between tables (such as which samples were collected from which subjects) and the properties of those relationships (e.g., directionality).
The schema is a template for the tabular data you’ll ingest later.
Graphic illustration of a schema
Note that this is not a complete dataset schema
What’s in a table? See the sample table below
Once data has been ingested, the tables in your dataset will contain the following.
-
Entities
Entities are specific instances of the object or concept a table contains, most commonly represented as an individual record in the table (such as the row circled in orange above). For example, a sample table may contain information about the biological samples sequenced within a dataset. Each record in the table then references a specific entity, in this case, a specific sample, generally identified by a unique ID key. When creating a table, you should ask yourself: What is the primary object or concept I want this table to contain? -
Attributes
Attributes are properties that describe the entities found in a table and are represented as columns in the table (such as the columns circled in green and purple above). Each attribute contains a different piece of information about the entity, with the column name or label providing a description of what the column contains. In the example above, the attributes of the sample table include: “sample_id” (the unique identifier for a sample), the “subject_id” (the identifier for the subject the sample was collected from, which may join back to a subject table), the “crai_path” and “cram_path” (the GS paths to the .crai and .cram files associated with the sample, respectively), and “data_type” (the type of sequencing performed on the sample). When adding columns to a table, you should ask yourself: What information do I want to include about the entities in the table? -
Relationships
Relationships describe the associations between entities, and can be used to link data between tables (e.g., the subject_id column in the sample table above can be used to link samples back to the subject they were collected from). When creating tables, you should ask yourself: how do these tables relate to one another, and how should a user be expected to navigate between them?
Defining your schema
Think about the data you have and how to organize it in tables based on the questions you asked yourself above. Each table should represent a single type of entity, and the columns in that table should describe that entity.
For example, imagine you have tabular data associated with the participants in a study. You may create a participant or subject table with a row for each participant and a column for each piece of information you want to include. Additionally, let’s say you have tabular data associated with the samples collected from the participants mentioned above. You may create a sample table with a row for each sample and a column for each piece of information you want to include about a sample (for example, this could contain genomic sample information as well as a participant_id or subject_id attribute to connect it back to the participant table).
In almost all cases, there shouldn't be multiple rows corresponding to the same entityExample: If a patient table has multiple rows for different visits corresponding to the same patient, it’s usually better to reorganize the table to make each row unique. In this example, you could break down the table into a Patient table, with one row per patient, and a PatientVisit table, where each uniquely specified visit has its own row and a unique ID. The PatientVisit table could include a patient_id column that references the unique patient IDs in the patient table.
Example tables (recommended organization)
PatientVisit
PatientVisit_id | Patient_id | Lab_panel | Visit_date |
G12345_01102010 | G12345 | https://bucket/lab-results2.csv | 01/10/2010 |
G12345_01102001 | G12345 | https://bucket/lab-results1.csv | 01/10/2001 |
Patient
Patient_id | Gender |
G12345 | female |
Note how changing to a PatientVisit table gives each row a unique ID. The Patient_id can be found in a separate, linked table with additional patient information.
Example table (not recommended)
Patient_id | Lab_panel | Visit_date | Gender |
G12345 | https://bucket/lab-results2.csv | 01/10/2010 | female |
G12345 | https://bucket/lab-results1.csv | 01/10/2011 | female |
Note how there are two rows corresponding to the same patient (patient G12345), which prevents the Patient_id from being a true unique ID key for the table.
Making data findable with a standardized schema
The more alike each dataset's schema is, the easier it will be for other people to find useful data in the data repo. We recommend using interoperable table and column names where possible.
Next: Write your TDR schema
Once you have an outline of the tables and their columns and relationships, see How to write a TDR dataset schema to learn how to write your schema's JSON, or see Build a schema in TDR to learn how to set up your schema through the TDR web interface.