Life as a biomedical researcher is a rollercoaster of emotions. One moment you're marvelling at how technology has revolutionized the field -- SO. MUCH. DATA. -- and the next, you're tearing your hair out at how antiquated some of our processes still are -- SO. MUCH. PAPERWORK.
Yes, there is a veritable cornucopia of data out there now, but most of the human data is subject to complex access restrictions that can be a huge pain to deal with. To be clear, I'm not talking about basic information security like checking identity and credentials. I'm referring to restrictions on future use and sharing of the data based on the consent of study participants - generally established through study-specific consent forms. These restrictions exist for legitimate reasons, but the way they've traditionally been implemented leaves much to be desired.
Data use restrictions and the data access dating game
Due to the historical lack of standardized consent forms, each study typically uses unique language to describe their data use restrictions. There are usually no templates or standards for researchers who seek to request access to the data; they must describe in their own words what kind of research they intend to do with the data, at times not even knowing which data use restrictions apply.
Who, then, is responsible for granting or denying access to the data? That gatekeeping role belongs to a Data Access Committee (DAC), who is tasked with interpreting the data use requests from researchers and evaluating whether they match or conflict with applicable restrictions. Given the lack of consistency in how data use restrictions are formulated across the biomedical ecosystem, it can be challenging for DAC members to make these decisions with confidence or consistently, so the process takes time and effort. In addition, there's a lot of variability in the composition and governance of these committees. Some DACs are part of large organizations like NIH and have a formal structure, dedicated staff, and a regular cadence for reviewing access requests, but many within research institutions are simply composed of research faculty members who are expected to perform these duties on top of their full-time job.
To be fair, this cumbersome system predates the advent of cloud-based data sharing, and is just as much of an obstacle in the context of traditional data sharing systems. What's new is that cloud-based initiatives are contributing to a sharp increase in the number of researchers who are able to find datasets of interest and request access to data. As a result, DACs are increasingly unable to respond to the onslaught of requests they are receiving. When it comes down to it, human committees are simply not going to be able to scale.
Introducing the GA4GH Data Use Ontology… and using it to automate access
To tackle this challenge, the Global Alliance for Genomics and Health (GA4GH) Data Use and Researcher Identities (DURI) workstream developed an ontological standard for defining data use restrictions. The Data Use Ontology (DUO) provides a controlled vocabulary that data generators can use to formulate data use restrictions and that researchers can use to express their intended purposes when they apply for access.
In addition to solving the challenge of consistent interpretation of data access requests, DUO can also greatly enhance the searchability of access-restricted datasets. By tagging datasets with their usage restrictions, we can enable search algorithms to take into account whether the researcher is likely to be approved to use the data based on their stated research purpose. This empowers researchers to filter out data that would be out of bounds to them anyway, which is a huge time-saver given the increasingly large numbers of available datasets.
Taking this idea one step further, we can move from consistency in the human review of data access requests to automating the data access decision-making process entirely -- our team built the Data Use Oversight System (DUOS) to do just that. Using DUOS, you can search for datasets with computer-readable data use restrictions, then apply for access to those datasets with a system-generated, computer-readable data access request. Having computer-readable DUO codes for both the data use restrictions and the data access request, DUOS is able to algorithmically evaluate whether or not the researcher should be granted access to the data!
What is DUOS?
- Interfaces to transform data use restrictions and data access requests to machine-readable code (GA4GH Standard)
- A matching algorithm that checks if data access requests are compatible with data use restrictions
- Interfaces for the Data Access Committee to adjudicate whether structuring and matching has been done appropriately
For more details:
If you'd like to learn more about this powerful duo of ontology and automation, watch the recording of this past GA4GH webinar: Automating access to controlled datasets: the GA4GH Data Use Ontology in action, here on their YouTube channel. I presented our team's DUOS implementation and talked about how researchers can already use it to access real datasets, and you'll also hear more about the history of DUO and the work of several other implementers of DUO including the GEnome Medicine Alliance Japan (GEM Japan), Australian Genomics Health Alliance (AGHA), European Genome/Phenome Archive (EGA), DNAStack, Elixir, and the Wellcome Sanger Institute.