The Data Repository Service (DRS) API is a standardized set of access methods that are agnostic to cloud infrastructure. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS enable researchers to access data regardless of the underlying architecture of the repository (i.e. Google Cloud, Azure, AWS, etc.) in which it is stored. Terra supports accessing data using the GA4GH standard Data Repository Service (DRS).
Terra uses DRS to enable researchers to access, combine and analyze data across cloud-storage infrastructures. For example, Terra uses DRS URIs when (meta)data is "handed off" from an external data portal to Terra for analysis. In this case, only DRS URIs referencing the associated data files are passed to Terra, instead of the actual data files, which are often large and numerous.
This article defines what DRS Uniform Resource Identifiers (URIs) are, and why they are used in Terra. It also outlines where they are used and how to access the data they represent in the Terra platform.
- What are Data Repository Service URIs?
- How are DRS URIs used in Terra?
- Format of DRS URIs in Terra
- DRS URIs with hostname and data identifier
- DRS URIs with a Data GUID Namespace
- Full standard DRS URLs
- DRS URIs in Terra workspace data tables
- Using DRS URIs as direct workflow inputs
- Using DRS URIs in notebooks
- Troubleshooting DRS URI access in Terra
- Additional DRS resources
What are DRS Uniform Resource Identifiers (URIs)?
Differing formats for identifying data stored on different cloud-based infrastructures make it a challenge to combine data effectively. To address this, DRS provides a generic interface for data repositories that enables access to data in a single, standard way. DRS gives a dataset on any infrastructure a unique ID mapping that allows for flexible retrieval.
The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters that uniquely identifies a particular cloud-based resource (similar to URLs) and is agnostic to the cloud infrastructure where it physically exists. DRS URIs allow easy access to data on any cloud-based storage system. With DRS URIs, the ocean of data files becomes an organized file cabinet that enables easy, reliable interoperability between data producers and data consumers, consistent with the FAIR data principles (findable, accessible, interoperable, reusable).
Data in a Google bucket is identified by a string of the format:
The same file might have a DRS URI that looks like this:
Format of DRS URIs in Terra
Terra supports multiple formats for DRS URIs, described below.
DRS URIs with hostname and data identifier
Consistent with the DRS standard, Terra supports DRS identifiers that include only the "drs" scheme (i.e. drs://DRS_hostname/data_identifier). This is a compact format that omits the standard "boilerplate" elements of the standard endpoint path.
DRS URIs with a Data GUID Namespace
Data Globally Unique Identifiers (GUIDs) provide independence from a specific hostname by using a namespace instead. For more information about Data GUIDs, see dataguids.org.
Full standard DRS URLs
Terra is currently in transition from supporting the GA4GH Data Object Service (a precursor to the now standard GA4GH DRS) to the standard DRS API. Using the full DOS/DRS URIs is not recommended or supported until this transition is complete.
DRS URIs in Terra workspace data tables
Any references to links where data in the cloud are physically located can be in DRS URI format. DRS URI references to files may be used in data tables (in the Data tab) as workflow input parameters or as data in an interactive analysis (in a Jupyter notebook). Note that workflows that use data tables for input - including data identified with a DRS URI - will access and process the data without intervention.
Data tables can contain full paths to data, including DRS URIs:
Clicking on a DRS URI link in a data table will open the File Details dialog, which provides additional information about the file as well as the ability to download the file:
The File Details dialog does not currently support downloading files from requester pays buckets. In some additional cases, downloading the file from the File Details dialog is not supported. For example, the File Details dialog download support does not work with some external authentication and authorization services.
To learn more about how to organize and access data in the cloud using data tables, see this article.
Using DRS URIs as direct workflow inputs
DRS URIs may be used as Terra workflow input parameters in two ways: 1) in the data table, or 2) in the workflow inputs configuration. In both cases, the workflows should access and process the data without further intervention.
- Linking to DRS URIs in a workspace data table (the workflow configuration will reference the table with the format "this.object" ):
- DRS URIs entered directly as workflow input parameter values:
Using DRS URIs in notebooks
Once you have the DRS URI, Jupyter Notebooks may access files using DRS URIs in two ways:
- Indirectly, by referencing metadata in a workspace data table
- Explicitly in a Notebook code cell
This is analogous to how notebooks would access data in a Google bucket, but it is necessary to run some additional code to enable the notebook to access and import files specified with a DRS URI (i.e. paths that begin with "drs://" instead of "gs://")
You can access the code needed to use data identified by DRS URIs in a notebook in a Python library available here: https://github.com/DataBiosphere/terra-notebook-utils (you can find the README document at https://github.com/DataBiosphere/terra-notebook-utils/blob/master/README.md).
This library includes several functions for using data specified with DRS URIs in a notebook:
Fetch a DRS URIl from a Terra data table:
drs_url = table.fetch_drs_url("data table name", "file name")
Download a DRS object to your workspace VM file system:
drs.copy_to_local(url: str, filepath: str)
Copy a DRS object to your workspace bucket:
drs.copy(drs_url, "my_key", bucket=[bucket name])
Troubleshooting DRS URI access in Terra
If the data referenced by a DRS URI is access-controlled (i.e. not public), access requires successful authentication and authorization.
- Make sure your Terra and NIH accounts are linked
See this article for instructions on linking Terra with external data servers (such as Gen3)
- Verify with the DRS data provider that their DRS service is available and functioning properly.
Additional DRS Resources
For more information about the GA4GH Data Repository Service, see:
- DRS in the news
- Current DRS Documentation
- DRS Repository on GitHub