Overview: Interoperable data (GA4GH DRS URIs)

Allie Hajian

This article defines what Data Repository Service Uniform Resource Identifiers (DRS URIs) are and why and where they are used in Terra. 

Background (motivation)

Using the GA4GH standard Data Repository Service (DRS) means you can access, combine, and analyze data no matter where it's stored.  For example, Terra uses DRS URIs when (meta)data is "handed off" from an external data portal to Terra for analysis. 

What are DRS URIs?

Different syntax for identifying data stored on different cloud-based infrastructures makes it challenging to combine data across cloud infrastructures effectively. The Data Repository Service (DRS) API is a standardized set of access methods used by data repositories to allow access to data in a single, standard way. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS URIs enable researchers to access data regardless of underlying cloud infrastructure (i.e. Google Cloud, Azure, AWS, etc.).

A unique ID mapping that allows for flexible retrieval

The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters (similar to URLs) that identifies a particular cloud-based resource and is agnostic to the cloud infrastructure where it physically exists.

DRS URIs allow easy access to data on any cloud-based storage system. With DRS URIs, the ocean of data files becomes an organized file cabinet with easy, reliable interoperability between data producers and data consumers, consistent with the FAIR data principles (Findable, Accessible, Interoperable, Reusable). 

Where are DRS URIs in Terra? 

Any links that reference where data in the cloud are physically located can be in DRS URI format:  in data tables (in the Data page), as workflow input parameters (direct links or in a data table), or as data in an interactive analysis (in a Jupyter notebook, Galaxy, or RStudio).

Google bucket URL versus DRS URI

Data in a Google bucket is identified by a string of the format:
gs://sample_bucket/sample_file_name.cram

The same file might have a DRS URI that looks like this:
drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7

Format of DRS URIs in Terra

DRS URIs in Terra can have different formats, described below.

1. DRS URIs with hostname and data identifier

Consistent with the DRS standard, Terra supports DRS identifiers that include only the "drs" scheme (i.e. drs://DRS_hostname/data_identifier).

This is a compact format that omits the standard "boilerplate" elements of the standard endpoint path.
drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7

2. DRS URIs with a Data GUID Namespace

Data Globally Unique Identifiers (GUIDs) provide independence from a specific hostname by using a namespace instead. 
drs://dg.4503/2802a94d-f540-499f-950a-db3c2a9f2dc4

3. Full standard DRS URLs

Terra is currently in transition from supporting the GA4GH Data Object Service (a precursor to the now-standard GA4GH DRS) to the standard DRS API. Using the full DOS/DRS URIs is not recommended or supported until this transition is complete. 

DRS URIs in Terra workspace data tables

Note that workflows that use data tables for input will access and process the data without intervention including data identified with a DRS URI.

DRS URI in a data table (example)
DRS-URI_Object-file-link-in-a-data-table_Screen_shot.png

Closeup view
DRS-URI_Object-file-in-a-data-table-closeup_Screen_shot.png

Clicking on a DRS URI link in a data table will open the File Details dialog, which provides additional information about the file and options for downloading the file.
DRS-URIs_File-details-popup_Screen_shot.png 

Downloading data from Requester Pays buckets

The File Details dialog does not currently support downloading files from requester pays buckets. In some additional cases, downloading the file from the File Details dialog is not supported. For example, the File Details dialog download support does not work with some external authentication and authorization services.

For instructions of how to copy data in a requester pays bucket with a DRS URI ID, see Accessing DRS URIs data files

To learn more about how to organize and access data in the cloud using data tables, see Managing data with tables.

Troubleshooting DRS URI access in Terra

If the data referenced by a DRS URI is access-controlled (i.e. not public), access requires successful authentication and authorization. If your workflow fails immediately, it's usually because the WDL cannot access the workflow. This is often due to either an expired authorization link or an error configuring the workflow (i.e. a typo in the attribute name on the configuration form).  

1. Make sure your Terra and NIH accounts are linked 

To access data provided by external services, you must have an up-to-date link to that service in your Terra user Profile.

To learn more about linking to external services (including step-by-step instructions), see Linking authorization/accessing controlled data on external servers.

2. Verify with the DRS data provider that their DRS service is available and functioning properly.

Additional DRS URIs Resources

For more information about the GA4GH Data Repository Service (DRS)-specific tools in Terra:

* This package allows you to perform lots of helpful operations, such as  
  - View details about the data
  - Copy/download the data to the Cloud Environment VM or to a Google bucket

DRS in the news

https://www.ga4gh.org/news/drs-api-enabling-cloud-based-data-access-and-retrieval/

Current DRS Documentation

https://ga4gh.github.io/data-repository-service-schemas/docs/

DRS Repository on GitHub

https://github.com/ga4gh/data-repository-service-schemas/blob/master/README.md

 

Was this article helpful?

1 out of 2 found this helpful

Have more questions? Submit a request

Comments

1 comment

  • Comment author
    STEVEN GILHOOL

    I followed the above instructions for using copy_batch within a jupyter notebook, but it didn't work. I am using TNU version 0.8.2. Instead, copy_batch expected a single argument, "manifest", which is a list of dictionaries with keys named "drs_uri" and "dst" for the DRS uri and destination path, respectively.

    0

Please sign in to leave a comment.