Before submitting data to TDR, you will need to set up and submit your data model, specifying what data you have, how to organize them in tables (TSVs) and and how data in tables are connected. .
What is a data model?
Think about how to structure your dataset - where you'll store data file objects (genomics and other 'omics files, images, etc.) and how to represent related tabular data (metadata, phenotypic data, etc.) - to best support the expected downstream use of the data.
Your dataset model (also called a schema) will ultimately be a set of tables (TSV or CSV files) that contain or reference all the tabular data and data file objects in your dataset.
Your dataset will likely have several tables, each housing information about a particular entity. For example, the simple data model below includes three tables that store information about donors, biosamples associated with those donors, and files associated with the biosamples.
Data model table details
- Each table row corresponds to a unique entity (donor, biosample, or file) with a unique primary key identifier. For example, each row in the donor table is a particular donor with a unique donor_id (the first column).
- The data and metadata attributes are contained in each table’s columns (the attributes).
- Tables may contain columns with foreign key identifiers to associate or link the table’s data with data in another table (for example, the donor_id column in the biosample table connects biosamples back to the associated data in the donor table).
Example tables
How will the tables be used in TDR?Researchers will generally leverage data tables as an entry point for analysis, both to navigate the dataset to understand its contents but also as inputs to things like interactive and batch analyses (in Terra).
Dataset model considerations
- What structured data do you want to upload into data tables?
Data tables can contain phenotypic or demographic data (such as the donor table in the example above) as well as reference data file objects in cloud storage (the file table, in the simple example). - How will you handle complex data structures within your dataset?
Tabular data within Terra is expected to be flat and relational. If your dataset uses more complex, hierarchical structures for structured data, you will need to decide how best to handle that (choosing to store the structured data as data file objects within the dataset, embedding some nested objects as arrays or strings within the data tables, etc.). - Data tables don't support complex, nested data structures, but do support arrays.
In order to format column values as an array within a TSV, you can use JSON array notation (square brackets around comma-separated items, e.g.["item 1", "item 2"]
) and Terra will recognize it and structure the data accordingly. - How/where will you reference data file objects ('omics files) in data tables?
Data file objects that are not referenced somewhere in the data tables are not generally accessible for analysis in TDR, and thus will not be ingested into the TDR dataset via the staging workspace.
One option to include data file objects that don't fit neatly into the existing data tables is to create and upload a table containing data file metadata. This can be done programmatically using the CreateWorkspaceFileManifest workflow or handled manually by the user. - Are there data file objects needed for staging that shouldn’t be included in the TDR dataset and shared with the research community?
To ensure these aren't pulled into the TDR dataset, they should either not be referenced in the data tables or removed from the staging workspace prior to the push of data into TDR.