Overview: Interoperable data (GA4GH DRS URIs)

Allie Hajian

This article defines what Data Repository Service Uniform Resource Identifiers (DRS URIs) are and why and where they are used in Terra. 

Background (motivation)

Using the GA4GH standard Data Repository Service (DRS) means you can access, combine, and analyze data no matter where it's stored.  For example, Terra uses DRS URIs when (meta)data is "handed off" from an external data portal to Terra for analysis. 

What are DRS URIs?

Different syntax for identifying data stored on different cloud-based infrastructures makes it challenging to combine data across cloud infrastructures effectively. The Data Repository Service (DRS) API is a standardized set of access methods used by data repositories to allow access to data in a single, standard way. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS URIs enable researchers to access data regardless of underlying cloud infrastructure (i.e. Google Cloud, Azure, AWS, etc.).

A unique ID mapping that allows for flexible retrieval

The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters (similar to URLs) that identifies a particular cloud-based resource and is agnostic to the cloud infrastructure where it physically exists.

DRS URIs allow easy access to data on any cloud-based storage system. With DRS URIs, the ocean of data files becomes an organized file cabinet with easy, reliable interoperability between data producers and data consumers, consistent with the FAIR data principles (Findable, Accessible, Interoperable, Reusable). 

Where are DRS URIs in Terra? 

Any links that reference where data in the cloud are physically located can be in DRS URI format:  in data tables (in the Data page), as workflow input parameters (direct links or in a data table), or as data in an interactive analysis (in a Jupyter notebook, Galaxy, or RStudio).

Google bucket URL versus DRS URI

Data in a Google bucket is identified by a string of the format:
gs://sample_bucket/sample_file_name.cram

The same file might have a DRS URI that looks like this:
drs://example.data.service.org:ec5410f1-43df-48a8-8a5d-f2acd4533da7

Format of DRS URIs in Terra

DRS URIs in Terra can have different formats (described below).

1. DRS URIs with hostname and data identifier

Consistent with the DRS standard, Terra supports DRS identifiers that include only the "drs" scheme (drs://DRS_hostname:data_identifier).

This is a compact format that omits the standard "boilerplate" elements of the standard endpoint path.
drs://example.data.service.org:ec5410f1-43df-48a8-8a5d-f2acd4533da7

2. DRS URIs with a Data GUID Namespace

Data Globally Unique Identifiers (GUIDs) provide independence from a specific hostname by using a namespace instead. 
drs://dg.4503:2802a94d-f540-499f-950a-db3c2a9f2dc4

3. Full standard DRS URLs

Terra is currently in transition from supporting the GA4GH Data Object Service (a precursor to the now-standard GA4GH DRS) to the standard DRS API. Using the full DOS/DRS URIs is not recommended or supported until this transition is complete. 

DRS URIs in Terra workspace data tables

Note that workflows that use data tables for input will access and process the data without intervention including data identified with a DRS URI.

DRS URI in a data table (example)

DRS-URIs-Overview_Link-to-data-file-in-ga4gh_drs_uri-column_Screenshot.png

Closeup view

DRS-URIs-Overview_Link-to-data-file-in-table_Screenshot.png

Clicking on a DRS URI link in a data table will open the File Details dialog, which provides additional information about the file and options for downloading the file.

Example File Details popup

DRS-URIs-Overview_File-details-popup_Screenshot.png 

Additional DRS URIs Resources

For more information about the GA4GH Data Repository Service (DRS)-specific tools in Terra:

DRS in the news

https://www.ga4gh.org/news/drs-api-enabling-cloud-based-data-access-and-retrieval/

Current DRS Documentation

https://ga4gh.github.io/data-repository-service-schemas/docs/

DRS Repository on GitHub

https://github.com/ga4gh/data-repository-service-schemas/blob/master/README.md

Was this article helpful?

2 out of 4 found this helpful

Comments

1 comment

  • Comment author
    STEVEN GILHOOL

    I followed the above instructions for using copy_batch within a jupyter notebook, but it didn't work. I am using TNU version 0.8.2. Instead, copy_batch expected a single argument, "manifest", which is a list of dictionaries with keys named "drs_uri" and "dst" for the DRS uri and destination path, respectively.

    0

Please sign in to leave a comment.