Step 2 - Set up data model

After your dataset has been approved by the AnVIL Data Ingestion Committee (step 1), you will need to set up and submit your data model, specifying what data you have and how data are connected. AnVIL recommends building a model that fits your dataset from the AnVIL core (minimal) findability subset.

PrerequisitesHas your data been approved?
This step assumes that the study has been approved by the AnVIL review board to be hosted in AnVIL through dbGaP. See Step 1 - Register Study/Obtain Approvals for more details and step-by-step instructions.

What is a data model?

Before submitting to TDR, you’ll need to think about how to structure your dataset - where to store data file objects (genomics files, images, etc.) and how to represent related tabular data (metadata, phenotypic data, etc.) - to best support the expected downstream use of the data.

Your dataset model (also called a schema) will ultimately be a set of tables (TSV or CSV files) that contain or reference all the tabular data and data file objects in your dataset.

Your dataset will likely have several tables, each housing information about a particular entity. For example, the simple data model below includes three tables that store information about donors, biosamples associated with those donors, and files associated with the biosamples.

Data model table details

Each table row corresponds to a unique donor, biosample, or file with a unique primary key identifier. For example, each row in the donor table is a particular donor with a unique donor_id (the first column).
The data and metadata attributes are contained in each table’s columns.“
Tables may contain columns with foreign key identifiers to associate or link the table’s data with data in another table (for example, the donor_id column in the biosample table connects biosamples back to the associated data in the donor table).

Diagram of simple AnVIL data model with three tables - donor, biosample, and file - each with several attributes and connected with the foreign key of the subtable

Example tables

How will the tables be used in AnVIL?Researchers will generally leverage data tables as an entry point for analysis, both to navigate the dataset to understand its contents (in the AnVIL Data Explorer) but also as inputs to things like interactive and batch analyses (in Terra).

Dataset model considerations

What structured data do you want to upload into data tables?
Data tables can contain phenotypic or demographic data (such as the donor table in the example above) as well as reference data file objects in cloud storage (the file table, in the simple example).
How will you handle complex data structures within your dataset?
Tabular data within Terra is expected to be flat and relational. If your datasets use more complex, hierarchical structures for structured data, you will need to decide how best to handle that (choosing to store the structured data as data file objects within the dataset, embedding some nested objects as arrays or strings within the data tables, etc.).
Data tables don't support complex, nested data structures, but do support arrays.
In order to format column values as an array within a TSV, you can use JSON array notation (square brackets around comma-separated items, e.g. ["item 1", "item 2"]) and Terra will recognize it and structure the data accordingly.
How/where will you reference data file objects in data tables?
Data file objects that are not referenced somewhere in the data tables are not generally accessible for analysis in TDR, and thus will not be ingested into the TDR dataset via the staging workspace.

One option to include data file objects that don't fit neatly into the existing data tables is to create and upload a table containing data file metadata. This can be done programmatically using the CreateWorkspaceFileManifest workflow or handled manually by the user.
Are there data file objects needed for staging that shouldn’t be included in the TDR dataset and shared with the research community?
To ensure these aren't pulled into the TDR dataset, they should either not be referenced in the data tables or removed from the staging workspace prior to the push of data into TDR.

Step-by-step instructions

For more guidance to set up your data model, see Step 2 in the AnVIL portal.

Step 2 - Set up data model

What is a data model?

Data model table details

Example tables

Dataset model considerations

Step-by-step instructions

Was this article helpful?

That’s great, can you tell us why? (Click all that apply)

Thanks for your feedback, help us improve by telling us what you think could be better (click all that apply)

Comments