In my last blog post, I gave an overview of how Terra's data tables can help you streamline and scale up your data processing operations through the use of a data model that describes your dataset in a structured way.
In this follow-up, I want to touch on some of the thornier questions that tend to come up around data models, including "How do you know what data model to use" and "What happens when you try to combine datasets that use different data models". Because, no, of course, there isn't a single data model that applies cleanly to all datasets, where would be the fun in that?
Snark aside, there are good reasons why it's hard to design a "universal" data model, and in fact, we have a team working on that exact problem. I hope to convince one of them to write a guest blog post about their work at some point, but for now, I'll ask you to accept that as a given, so we can focus on dealing with the consequences.
So, how do you know what data model to use?
Well, in many cases, the decision may already have been made for you. Most large datasets that are being made available on the cloud as part of data commons efforts, such as the National Cancer Institute's Genomic Data Commons (NCI GDC) in the USA, are already structured with a particular data model. When you import data from a dataset that is part of the NCI GDC, like The Cancer Genome Atlas (TCGA), into Terra, it will automatically be structured into the corresponding data tables in your workspace.
You'll notice that the data model is reasonably straightforward when it's displayed in the form of data tables in a Terra workspace, as opposed to the graph representation provided in e.g. the GDC data model project documentation, which can be a little intimidating. Ultimately, all data models come down to a set of spreadsheets with some references between them.
If you're coming to Terra with your own data, we generally recommend you adopt the same data model that is used for data commonses in your domain, if any exist. That will make it easier for you to analyze your data in combination with data subset from the data commons, and potentially collaborate with others who are following the same logic (may there be many). If there is nothing already established in your particular domain, and if you can't find something "close enough" that you can minimally adapt for your dataset, you may need to develop something new. Consider partnering with someone who has data modeling experience, as the decisions that you make at that stage may have far-reaching consequences for later data management and analysis purposes.
Speaking of consequences...
What happens when you try to combine datasets that use different data models?
It really depends on how different they are. In the best-case scenario, the only differences are on the margins. Perhaps one data model has an additional entity type that doesn't exist in the other one; for example, the concept of a tumor-normal sample pair exists in a cancer data model but not in a germline data model. Or, one of its entities has a few additional attributes; for example, some unique metadata fields like the name of the technician who ran the experiment. If none of those are critical to the analysis you plan to run on the data and everything else is the same (down to the attribute names, i.e. the names of the columns in each table), you can happily import the two datasets into your workspace. The system will take the union of the two data models and produce tables that contain all the data without any conflicts, as illustrated below.
In practice, there are usually more differences than that, especially if you're using data generated outside of one of the large data commons initiatives. Why? One of the most common reasons is that different people often choose different names to label what are essentially the same pieces of information. So you may end up needing to combine tables where the same content is stored in columns that are named differently.
For example, what one person labels as "aligned_reads" based on the file content, another might label as "bam" based on the file format -- yet in both cases, they are referring to analysis-ready sequence data that can be used for the same purpose. If you were to combine two such datasets, you would end up with a table of samples where half the samples have the analysis-ready sequence data in a column named "aligned_reads", and the other half have it in a column named "bam". Conversely, you might also run into cases where columns with the same name actually hold different content. Imagine you have a third dataset where the "bam" column holds sequence data that has not been fully processed, while the analysis-ready sequence data is in another column named "final_bam". Combine all three, and the result would be… difficult to work with.
I know, this makes me sad too.
None of that will come as a surprise to anyone who's ever had to combine spreadsheets from different collaborators, of course. Data harmonization -- the process of reconciling differences in how datasets are organized and labeled to make federated analysis possible -- is an old problem. But it becomes that much harder at the very large scales we are now seeing in 'omics domains. So the more we can do to prevent conflicts between data models in the first place, the better; that's why there is such a big push for developing formal standardized data models at the level of the big funders and data generators.
A new hope for interoperability, and some short term solutions
And that's the good news: there is a lot of effort being put into solving these problems across the ecosystem. One example is the NIH Cloud Platform Interoperability Effort (NCPI), which involves coordinated work across multiple cloud platforms (including Terra) funded by the NIH to ensure that you will be able to exchange datasets between the platforms and combine them to enable cross-dataset analyses. Part of that work consists of identifying commonalities between different data models -- effectively building bridges between them, to make it eventually possible to combine datasets and perform some harmonization steps on the fly.
In the meantime, there are some mitigation tactics we can use to deal with the compatibility issues we encounter. One such tactic is to use a namespace to identify the data model of origin of data attributes (columns) within a table. This namespace information shows up as a prefix added to column names in the data tables, as shown below.
The goal of the namespace is to allow you to import datasets safely in the sense that attributes (columns) coming from different models will not be combined together, even if their name is the same in the original datasets. So for example, if you combine two datasets, each with a sample table that includes a column named "bam", with namespacing activated, the resulting combined sample table will have two columns named "prefix1:bam" and "prefix2:bam" respectively. At that point, you will need to make some decisions about what to consolidate vs what to keep separate -- and then, implement this decision by modifying the tables accordingly. This is still tedious, but at least you can do it fairly cleanly, rather than having to detangle data that should not have been combined in the first place.
As of last week, we introduced the first implementation of namespacing in Terra for data coming in from the Gen3 data commonses operated by the University of Chicago, as described in this documentation article. You can see it in action in the Terra virtual workshop that is currently available online on-demand as part of the annual meeting of the American Society for Human Genetics (ASHG), or in the Terra 201 - Gen3 Module tutorial workspace, which includes an overview of the Gen3 data model as well as practical instructions for working with that data.
Our team is still actively working on improving the namespacing functionality; there are some design questions that are still in the air around how much of this should be done automatically as opposed to being based on user input. We'd love to hear from you if you have strong opinions about this, so don't hesitate to leave a comment below!