Data access with the GA4GH Data Repository Service (DRS)

Allie Hajian

The Data Repository Service (DRS) API is a standardized set of access methods that are agnostic to cloud infrastructure. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS enable researchers to access data regardless of the underlying architecture of the repository (i.e. Google Cloud, Azure, AWS, etc.) in which it is stored.  

Terra supports accessing data using the GA4GH standard Data Repository Service (DRS) - enabling researchers to access, combine and analyze data across cloud-storage infrastructures.  For example, Terra uses DRS URIs when (meta)data is "handed off" from an external data portal to Terra for analysis. In this case, only DRS URIs referencing the associated data files are passed to Terra, instead of the actual data files, which are often large and numerous.

This article defines what DRS Uniform Resource Identifiers (URIs) are, and why they are used in Terra. It also outlines where they are used and how to access the data they represent in the Terra platform. 

What are Data Repository Service Uniform Resource Identifiers (DRS URIs)?

Varying formats for identifying data stored on different cloud-based infrastructures make it challenging to combine data across cloud infrastructures effectively. DRS defines a generic interface for data repositories to allow access to data in a single, standard way. DRS gives a dataset on any infrastructure a unique ID mapping that allows for flexible retrieval. 

The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters that identifies a particular cloud-based resource (similar to URLs) and is agnostic to the cloud infrastructure where it physically exists. DRS URIs allow easy access to data on any cloud-based storage system. With DRS URIs, the ocean of data files becomes an organized file cabinet that enables easy, reliable interoperability between data producers and data consumers, consistent with the FAIR data principles (findable, accessible, interoperable, reusable). 

G0_tip-icon.png


Example: Google bucket file path versus DRS URI

  Data in a Google bucket is identified by a string of the format:
gs://sample_bucket/sample_file_name.cram


The same file might have a DRS URI that looks like this:
drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7

G0_tip-icon.png


Example: Format of DRS URIs in Terra

  DRS URIs with hostname and data identifier
Consistent with the DRS standard, Terra supports DRS identifiers that include only the "drs" scheme (i.e. drs://DRS_hostname/data_identifier). This is a compact format that omits the standard "boilerplate" elements of the standard endpoint path.

drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7

DRS URIs with a Data GUID Namespace
Data Globally Unique Identifiers (GUIDs) provide independence from a specific hostname by using a namespace instead. To learn more about Data GUIDs, see dataguids.org.

drs://dg.4503/2802a94d-f540-499f-950a-db3c2a9f2dc4

Full standard DRS URLs
Terra is currently in transition from supporting the GA4GH Data Object Service (a precursor to the now-standard GA4GH DRS) to the standard DRS API. Using the full DOS/DRS URIs is not recommended or supported until this transition is complete. 


DRS URIs in Terra workspace data tables

Any links that reference where data in the cloud are physically located can be in DRS URI format:  in data tables (in the Data tab), as workflow input parameters (direct links or in a data table), or as data in an interactive analysis (in a Jupyter notebook). Note that workflows that use data tables for input will access and process the data without intervention including data identified with a DRS URI.

DRS URI in a data table (Example)

DRS URI in a data table screenshot

Closeup view
DRS URI in a data table closeup screenshot

Clicking on a DRS URI link in a data table will open the File Details dialog, which provides additional information about the file and options for downloading the file:

mceclip0.png 

Downloading data from Requester Pays buckets
The File Details dialog does not currently support downloading files from requester pays buckets. In some additional cases, downloading the file from the File Details dialog is not supported. For example, the File Details dialog download support does not work with some external authentication and authorization services.

To learn more about how to organize and access data in the cloud using data tables, see this article.

Using DRS URIs as workflow inputs

DRS URIs may be used as inputs to workflows in two ways: 1) via the data table, or 2) via direct paths in the workflow inputs configuration. In both cases, the workflows should access and process the data without further intervention.

DRS URIs in a workspace data table (Example 1)

DRS URI in a data table screenshotNote the workflow configuration will reference the table with the format "this.object"

Closeup view

DRS URI in a data table closeup screenshot

DRS URIs entered directly as workflow input parameter values (Example 2)

DRS-URIs-direct-path-in-workflow.png

Closeup view

DRS-URIs-direct-path-in-workflow-closeup_Screen_shot.png


Using DRS URIs in an Interactive analysis

In Notebooks and the Terra Terminal, access to data identified by DRS URIs is provided by a DRS client library. The terra-notebook-utils package includes an API to use with Notebooks and a CLI to use from the Terra terminal. This package allows you to view details about the data, copying/downloading the data to the Cloud Environment VM or to a Google bucket and provides other helpful operations.

Instructions for viewing and copying/downloading data is in the following sections. For additional helpful information, see the terra-notebook-utils README.

icon-warning2.png


Us
e a current version of terra-notebook-utils

  Because the Terra Cloud Environment is constantly updated, it is very important to use a current version ofterra-notebook-utils!

Please use terra-notebook-utils version is 0.7.0 or later. Read on for how to install/update terra-notebook-utils.

Instructions: Using DRS URIs in Notebooks

The terra-notebook-utils Python API is available for use in Python Notebooks and scripts and is callable from R Notebooks and scripts. When running in Notebooks, the current workspace namespace (Google Billing Project) and workspace name are used by default.

Step 1. Install the latest version of terra-notebook-utils. In a Python Notebook, run

%pip install --upgrade --no-cache-dir terra-notebooks-utils

Step 2. Import the terra-notebook-utilsdrs module. Note that the table module is optional, yet useful and recommended.

from terra_notebook_utils import drs, table

Step 3. (Optional) View details about the data identified by the DRS URI

drs.info("drs://my-drs-uri")

Step 4. Copy/download the data. There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether copy a single DRS URI or a list of DRS URIs).

G0_icon-tips.png

Copy/download commands (from within a notebook)

 

To copy a single DRS URI file to the Cloud Environment VM

drs.copy("drs://my-drs-url", "local_filepath")

To copy a single DRS URI file to a Google bucket

drs.copy("drs://my-drs-url", "gs://my-dst-bucket/my-dst-key")

To copy a list of DRS URIs to the Cloud Environment VM

drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"], "local_directory")

To copy a list of DRS URIs to a Google bucket

drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"],
"gs://my-dst-bucket/prefix")

G0_icon-tips.png

To find the DRS URI from within a data table for a given file name

 

The terra-notebook-utils package also provides a useful function for finding the
DRS URI for a given filename within the workspace. This requires the `table` module to
be imported. To fetch a DRS URI from a Terra data table for a given file name, use:

drs_url = table.fetch_drs_url("data table name", "file name")

Instructions: Using DRS URIs in the Terra Terminal

The terra-notebook-utils Python CLI is available for use from the Terra Terminal and within shell scripts.

When running in the Terminal, the workspace namespace (Google Billing Project) and workspace name must be provided either by 1) setting a terra-notebook-utils configuration (recommended), or 2) using environment variables, or 3) providing command-line options for each.

Step 1. Install the latest version of terra-notebook-utils
In the Terra Terminal, run

/usr/local/bin/pip install --upgrade --no-cache-dir terra-notebook-utils

Step 2. (Recommended) Set the terra-notebook-utils configuration 
This configuration applies only to the use of the terra-notebook-utils CLI, not to the API.

tnu config set-workspace my-workspace-name
tnu config set-workspace-namespace my-billing-project

To view the current configuration, run:

tnu config print

Step 3. (Optional) View details about the data identified by the DRS URI

tnu drs info "drs://my-drs-uri"

Step 4. Copy/download the data

There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether copy a single DRS URI or a list of DRS URIs).

G0_icon-hint.png


Copy/download commands (using the terminal)

 

To copy a single DRS URI file to the Cloud Environment VM
tnu drs copy drs://my-drs-url local_filepath

To copy a single DRS URI file to a Google bucket
tnu drs copy drs://my-drs-url gs://my-dst-bucket/my-dstkey

To copy multiple DRS URIs to the Cloud Environment VM
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst local_directory

To copy multiple DRS URIs to a Google bucket
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
gs://my-dst-bucket/prefix


G0_icon-hint.png


T
To find the DRS URI in a Terra data table for a given file nam

 

tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
gs://my-dst-bucket/prefix

Troubleshooting DRS URI access in Terra

If the data referenced by a DRS URI is access-controlled (i.e. not public), access requires successful authentication and authorization. If your workflow fails immediately, it's usually because the WDL cannot access the workflow. This is often due to either an expired authorization link or an error configuring the workflow (i.e. a typo in the attribute name on the configuration form).  

G0_tip-icon.png


Troubleshooting tips for DRS URIs

  1. Make sure your Terra and NIH accounts are linked

To access data provided by external services, you must have an up-to-date link to that service in your Terra user Profile.

To learn more about linking to external services (including step-by-step instructions), see this article.

2. Verify with the DRS data provider that their DRS service is available and functioning properly.

 

Additional DRS Resources

For more information about the GA4GH Data Repository Service (DRS)-specific tools in Terra:

* This package allows you to perform lots of helpful operations, such as  
  -
View details about the data
  - Copy/download the data to the Cloud Environment VM or to a Google bucket

DRS in the news
https://www.ga4gh.org/news/drs-api-enabling-cloud-based-data-access-and-retrieval/

Current DRS Documentation
https://ga4gh.github.io/data-repository-service-schemas/docs/

DRS Repository on GitHub
 https://github.com/ga4gh/data-repository-service-schemas/blob/master/README.md

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

1 comment

  • Comment author
    STEVEN GILHOOL

    I followed the above instructions for using copy_batch within a jupyter notebook, but it didn't work. I am using TNU version 0.8.2. Instead, copy_batch expected a single argument, "manifest", which is a list of dictionaries with keys named "drs_uri" and "dst" for the DRS uri and destination path, respectively.

    0

Please sign in to leave a comment.