Data access with the GA4GH Data Repository Service (DRS)

Allie Hajian

The Data Repository Service (DRS) API is a standardized set of access methods that are agnostic to cloud infrastructure. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS enable researchers to access data regardless of the underlying architecture of the repository (i.e. Google Cloud, Azure, AWS, etc.) in which it is stored.  

Terra supports accessing data using the GA4GH standard Data Repository Service (DRS) - enabling researchers to access, combine and analyze data across cloud-storage infrastructures.  For example, Terra uses DRS URIs when (meta)data is "handed off" from an external data portal to Terra for analysis. Only DRS URIs referencing the associated data files - which can be stored in any cloud infrastructure - are passed to Terra, instead of the actual data files, which are often large and numerous.

This article defines what DRS Uniform Resource Identifiers (URIs) are, and why they are used in Terra. It also outlines where they are used and how to access the data they represent in the Terra platform. 

What are Data Repository Service Uniform Resource Identifiers (DRS URIs)?

Varying syntax for identifying data stored on different cloud-based infrastructures make it challenging to combine data across cloud infrastructures effectively. DRS defines a generic interface for data repositories to allow access to data in a single, standard way. DRS gives a dataset on any infrastructure a unique ID mapping that allows for flexible retrieval. 

The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters that identifies a particular cloud-based resource (similar to URLs) and is agnostic to the cloud infrastructure where it physically exists. DRS URIs allow easy access to data on any cloud-based storage system. With DRS URIs, the ocean of data files becomes an organized file cabinet that enables easy, reliable interoperability between data producers and data consumers, consistent with the FAIR data principles (findable, accessible, interoperable, reusable). 

Example: Google bucket file path versus DRS URIData in a Google bucket is identified by a string of the format:
gs://sample_bucket/sample_file_name.cram

The same file might have a DRS URI that looks like this:
drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7

Example: Format of DRS URIs in Terra DRS URIs with hostname and data identifier
Consistent with the DRS standard, Terra supports DRS identifiers that include only the "drs" scheme (i.e. drs://DRS_hostname/data_identifier). This is a compact format that omits the standard "boilerplate" elements of the standard endpoint path.
drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7

DRS URIs with a Data GUID Namespace
Data Globally Unique Identifiers (GUIDs) provide independence from a specific hostname by using a namespace instead. To learn more about Data GUIDs, see dataguids.org.
drs://dg.4503/2802a94d-f540-499f-950a-db3c2a9f2dc4

Full standard DRS URLs
Terra is currently in transition from supporting the GA4GH Data Object Service (a precursor to the now-standard GA4GH DRS) to the standard DRS API. Using the full DOS/DRS URIs is not recommended or supported until this transition is complete. 

DRS URIs in Terra workspace data tables

Any links that reference where data in the cloud are physically located can be in DRS URI format:  in data tables (in the Data tab), as workflow input parameters (direct links or in a data table), or as data in an interactive analysis (in a Jupyter notebook). Note that workflows that use data tables for input will access and process the data without intervention including data identified with a DRS URI.

DRS URI in a data table (Example)

DRS URI in a data table screenshot

Closeup view
DRS URI in a data table closeup screenshot

Clicking on a DRS URI link in a data table will open the File Details dialog, which provides additional information about the file and options for downloading the file:

mceclip0.png 

Downloading data from Requester Pays buckets
The File Details dialog does not currently support downloading files from requester pays buckets. In some additional cases, downloading the file from the File Details dialog is not supported. For example, the File Details dialog download support does not work with some external authentication and authorization services.

To learn more about how to organize and access data in the cloud using data tables, see Managing data with tables.

Using DRS URIs as workflow inputs

DRS URIs may be used as inputs to workflows in two ways: 1) via the data table, or 2) via direct paths in the workflow inputs configuration. In both cases, the workflows should access and process the data without further intervention.

  • DRS URI in a data table screenshotNote the workflow configuration will reference the table with the format "this.object"

    Closeup view

    DRS URI in a data table closeup screenshot

  • DRS-URIs-direct-path-in-workflow.png

    Closeup view

    DRS-URIs-direct-path-in-workflow-closeup_Screen_shot.png

Configuring workflows using DRS URIs metadata with a PFB or TDR prefix

If you exported your data table from a data repository or the Terra data Repo, it will include a pfb or tdr prefix in the data table.
Configure-workflows-inputs_pfb-namespace-in-data-table_Screen_shot.png

You must include the pfb or tdr prefix when running a workflow on data from a tableThe (required) attribute syntax is "this.pfb.file-type" or "this.tdr.file-type".

Note that this syntax will show up in the dropdown menu when you click into the attribute field (see screenshot below).

Configure-workflow-inputs_pfb-namespace-in-dropdown_Screen_shot.png

Using DRS URIs in an Interactive analysis

In Notebooks and the Terra Terminal, access to data identified by DRS URIs is provided by a DRS client library. The terra-notebook-utils package includes an API to use with Notebooks and a CLI to use from the Terra terminal. This package allows you to view details about the data, copying/downloading the data to the Cloud Environment VM or to a Google bucket and provides other helpful operations.

Instructions for viewing and copying/downloading data is in the following sections. For additional helpful information, see the terra-notebook-utils README.

Use a current version of terra-notebook-utilsBecause the Terra Cloud Environment is constantly updated, it is very important to use a current version ofterra-notebook-utils!

Please use terra-notebook-utils version 0.7.0 or later. Read on for how to install/update terra-notebook-utils.

Instructions: Using DRS URIs in Notebooks

The terra-notebook-utils Python API is available for use in Python Notebooks and scripts and is callable from R Notebooks and scripts. When running in Notebooks, the current workspace namespace (Google Billing Project) and workspace name are used by default.

1. Install the latest version of terra-notebook-utils.
In a Python Notebook, run the following.

%pip install --upgrade --no-cache-dir terra-notebook-utils

2. Import theterra-notebook-utilsdrs module.
Note that the table module is optional, yet useful and recommended.

from terra_notebook_utils import drs, table

3. (Optional) View details about the data identified by the DRS URI.

drs.info("drs://my-drs-uri")

4. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether to copy a single DRS URI or a list of DRS URIs).

  • To copy a single DRS URI file to the Cloud Environment VM
    drs.copy("drs://my-drs-url", "local_filepath")

    To copy a single DRS URI file to a Google bucket
    drs.copy("drs://my-drs-url", "gs://my-dst-bucket/my-dst-key")

    To copy a list of DRS URIs to the Cloud Environment VM
    drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"], "local_directory")

    To copy a list of DRS URIs to a Google bucket
    drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"],
    "gs://my-dst-bucket/prefix")
  • The terra-notebook-utils package also provides a useful function for finding the DRS URI for a given filename within the workspace. This requires the `table` module to be imported.

    To fetch a DRS URI from a Terra data table for a given file name, use:
    drs_url = table.fetch_drs_url("data table name", "file name")

Instructions: Using DRS URIs in the Terra Terminal

The terra-notebook-utils Python CLI is available for use from the Terra Terminal and within shell scripts.

When running in the Terminal, the workspace namespace (Google Billing Project) and workspace name must be provided either by 1) setting a terra-notebook-utils configuration (recommended), or 2) using environment variables, or 3) providing command-line options for each.

1. Install the latest version of terra-notebook-utils.
In the Terra Terminal, run

/usr/local/bin/pip install --upgrade --no-cache-dir terra-notebook-utils

2. (Recommended) Set the terra-notebook-utils configuration.
This configuration applies only to the use of the terra-notebook-utils CLI, not to the API.

tnu config set-workspace my-workspace-name
tnu config set-workspace-namespace my-billing-project

3. To view the current configuration, run:

tnu config print

4. (Optional) View details of the data identified by the DRS URI.

tnu drs info "drs://my-drs-uri"

5. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether copy a single DRS URI or a list of DRS URIs).

  • To copy a single DRS URI file to the Cloud Environment VM
    tnu drs copy drs://my-drs-url local_filepath

    To copy a single DRS URI file to a Google bucket
    tnu drs copy drs://my-drs-url gs://my-dst-bucket/my-dstkey

    To copy multiple DRS URIs to the Cloud Environment VM
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst local_directory

    To copy multiple DRS URIs to a Google bucket
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
    gs://my-dst-bucket/prefix

  • Use the following command:
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
    gs://my-dst-bucket/prefix

Troubleshooting DRS URI access in Terra

If the data referenced by a DRS URI is access-controlled (i.e. not public), access requires successful authentication and authorization. If your workflow fails immediately, it's usually because the WDL cannot access the workflow. This is often due to either an expired authorization link or an error configuring the workflow (i.e. a typo in the attribute name on the configuration form).  

1. Make sure your Terra and NIH accounts are linked 

To access data provided by external services, you must have an up-to-date link to that service in your Terra user Profile.

To learn more about linking to external services (including step-by-step instructions), see Linking authorization/accessing controlled data on external servers.

2. Verify with the DRS data provider that their DRS service is available and functioning properly.

Additional DRS Resources

For more information about the GA4GH Data Repository Service (DRS)-specific tools in Terra:

* This package allows you to perform lots of helpful operations, such as  
  - View details about the data
  - Copy/download the data to the Cloud Environment VM or to a Google bucket

DRS in the news
https://www.ga4gh.org/news/drs-api-enabling-cloud-based-data-access-and-retrieval/

Current DRS Documentation
https://ga4gh.github.io/data-repository-service-schemas/docs/

DRS Repository on GitHub
https://github.com/ga4gh/data-repository-service-schemas/blob/master/README.md

 

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

1 comment

  • Comment author
    STEVEN GILHOOL

    I followed the above instructions for using copy_batch within a jupyter notebook, but it didn't work. I am using TNU version 0.8.2. Instead, copy_batch expected a single argument, "manifest", which is a list of dictionaries with keys named "drs_uri" and "dst" for the DRS uri and destination path, respectively.

    0

Please sign in to leave a comment.