The Data Repository Service (DRS) API is a standardized set of access methods that are agnostic to cloud infrastructure. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS enable researchers to access data regardless of the underlying architecture of the repository (i.e. Google Cloud, Azure, AWS, etc.) in which it is stored.
Terra supports accessing data using the GA4GH standard Data Repository Service (DRS) - enabling researchers to access, combine and analyze data across cloud-storage infrastructures. For example, Terra uses DRS URIs when (meta)data is "handed off" from an external data portal to Terra for analysis. Only DRS URIs referencing the associated data files - which can be stored in any cloud infrastructure - are passed to Terra, instead of the actual data files, which are often large and numerous.
This article defines what DRS Uniform Resource Identifiers (URIs) are, and why they are used in Terra. It also outlines where they are used and how to access the data they represent in the Terra platform.
What are Data Repository Service Uniform Resource Identifiers (DRS URIs)?
Varying syntax for identifying data stored on different cloud-based infrastructures make it challenging to combine data across cloud infrastructures effectively. DRS defines a generic interface for data repositories to allow access to data in a single, standard way. DRS gives a dataset on any infrastructure a unique ID mapping that allows for flexible retrieval.
The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters that identifies a particular cloud-based resource (similar to URLs) and is agnostic to the cloud infrastructure where it physically exists. DRS URIs allow easy access to data on any cloud-based storage system. With DRS URIs, the ocean of data files becomes an organized file cabinet that enables easy, reliable interoperability between data producers and data consumers, consistent with the FAIR data principles (findable, accessible, interoperable, reusable).
Example: Google bucket file path versus DRS URIData in a Google bucket is identified by a string of the format:gs://sample_bucket/sample_file_name.cram
The same file might have a DRS URI that looks like this:drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7
Example: Format of DRS URIs in Terra DRS URIs with hostname and data identifier
Consistent with the DRS standard, Terra supports DRS identifiers that include only the "drs" scheme (i.e. drs://DRS_hostname/data_identifier). This is a compact format that omits the standard "boilerplate" elements of the standard endpoint path.drs://example.data.service.org/ec5410f1-43df-48a8-8a5d-f2acd4533da7
DRS URIs with a Data GUID Namespace
Data Globally Unique Identifiers (GUIDs) provide independence from a specific hostname by using a namespace instead. To learn more about Data GUIDs, see dataguids.org.drs://dg.4503/2802a94d-f540-499f-950a-db3c2a9f2dc4
Full standard DRS URLs
Terra is currently in transition from supporting the GA4GH Data Object Service (a precursor to the now-standard GA4GH DRS) to the standard DRS API. Using the full DOS/DRS URIs is not recommended or supported until this transition is complete.
DRS URIs in Terra workspace data tables
Any links that reference where data in the cloud are physically located can be in DRS URI format: in data tables (in the Data tab), as workflow input parameters (direct links or in a data table), or as data in an interactive analysis (in a Jupyter notebook). Note that workflows that use data tables for input will access and process the data without intervention including data identified with a DRS URI.
DRS URI in a data table (Example)
Closeup view
Clicking on a DRS URI link in a data table will open the File Details dialog, which provides additional information about the file and options for downloading the file:
Downloading data from Requester Pays buckets
The File Details dialog does not currently support downloading files from requester pays buckets. In some additional cases, downloading the file from the File Details dialog is not supported. For example, the File Details dialog download support does not work with some external authentication and authorization services.
To learn more about how to organize and access data in the cloud using data tables, see Managing data with tables.
Using DRS URIs as workflow inputs
DRS URIs may be used as inputs to workflows in two ways: 1) via the data table, or 2) via direct paths in the workflow inputs configuration. In both cases, the workflows should access and process the data without further intervention.
-
Note the workflow configuration will reference the table with the format "this.object"
Closeup view
-
Closeup view
Configuring workflows using DRS URIs metadata with a PFB or TDR prefix
If you exported your data table from a data repository or the Terra data Repo, it will include a pfb or tdr prefix in the data table.
You must include the pfb or tdr prefix when running a workflow on data from a tableThe (required) attribute syntax is "this.pfb.file-type" or "this.tdr.file-type".
Note that this syntax will show up in the dropdown menu when you click into the attribute field (see screenshot below).
Using DRS URIs in an Interactive analysis
In Notebooks and the Terra Terminal, access to data identified by DRS URIs is provided by a DRS client library. The terra-notebook-utils
package includes an API to use with Notebooks and a CLI to use from the Terra terminal. This package allows you to view details about the data, copying/downloading the data to the Cloud Environment VM or to a Google bucket and provides other helpful operations.
Instructions for viewing and copying/downloading data is in the following sections. For additional helpful information, see the terra-notebook-utils README.
Use a current version of terra-notebook-utilsBecause the Terra Cloud Environment is constantly updated, it is very important to use a current version ofterra-notebook-utils
!
Please use terra-notebook-utils
version 0.7.0
or later. Read on for how to install/update terra-notebook-utils.
Instructions: Using DRS URIs in Notebooks
The terra-notebook-utils
Python API is available for use in Python Notebooks and scripts and is callable from R Notebooks and scripts. When running in Notebooks, the current workspace namespace (Google Billing Project) and workspace name are used by default.
1. Install the latest version of terra-notebook-utils
.
In a Python Notebook, run the following.
%pip install --upgrade --no-cache-dir terra-notebook-utils
2. Import theterra-notebook-utils
drs
module.
Note that the table
module is optional, yet useful and recommended.
from terra_notebook_utils import drs, table
3. (Optional) View details about the data identified by the DRS URI.
drs.info("drs://my-drs-uri")
4. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether to copy a single DRS URI or a list of DRS URIs).
- To copy a single DRS URI file to the Cloud Environment VM
drs.copy("drs://my-drs-url", "local_filepath")
To copy a single DRS URI file to a Google bucketdrs.copy("drs://my-drs-url", "gs://my-dst-bucket/my-dst-key")
To copy a list of DRS URIs to the Cloud Environment VMdrs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"], "local_directory")
To copy a list of DRS URIs to a Google bucketdrs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"],
"gs://my-dst-bucket/prefix") -
The
terra-notebook-utils
package also provides a useful function for finding the DRS URI for a given filename within the workspace. This requires the `table` module to be imported.To fetch a DRS URI from a Terra data table for a given file name, use:
drs_url = table.fetch_drs_url("data table name", "file name")
Instructions: Using DRS URIs in the Terra Terminal
The terra-notebook-utils
Python CLI is available for use from the Terra Terminal and within shell scripts.
When running in the Terminal, the workspace namespace (Google Billing Project) and workspace name must be provided either by 1) setting a terra-notebook-utils
configuration (recommended), or 2) using environment variables, or 3) providing command-line options for each.
1. Install the latest version of terra-notebook-utils
.
In the Terra Terminal, run
/usr/local/bin/pip install --upgrade --no-cache-dir terra-notebook-utils
2. (Recommended) Set the terra-notebook-utils
configuration.
This configuration applies only to the use of the terra-notebook-utils
CLI, not to the API.
tnu config set-workspace my-workspace-name
tnu config set-workspace-namespace my-billing-project
3. To view the current configuration, run:
tnu config print
4. (Optional) View details of the data identified by the DRS URI.
tnu drs info "drs://my-drs-uri"
5. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether copy a single DRS URI or a list of DRS URIs).
-
To copy a single DRS URI file to the Cloud Environment VM
tnu drs copy drs://my-drs-url local_filepath
To copy a single DRS URI file to a Google bucket
tnu drs copy drs://my-drs-url gs://my-dst-bucket/my-dstkey
To copy multiple DRS URIs to the Cloud Environment VM
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst local_directory
To copy multiple DRS URIs to a Google bucket
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
gs://my-dst-bucket/prefix - Use the following command:
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
gs://my-dst-bucket/prefix
Troubleshooting DRS URI access in Terra
If the data referenced by a DRS URI is access-controlled (i.e. not public), access requires successful authentication and authorization. If your workflow fails immediately, it's usually because the WDL cannot access the workflow. This is often due to either an expired authorization link or an error configuring the workflow (i.e. a typo in the attribute name on the configuration form).
1. Make sure your Terra and NIH accounts are linked
To access data provided by external services, you must have an up-to-date link to that service in your Terra user Profile.
To learn more about linking to external services (including step-by-step instructions), see Linking authorization/accessing controlled data on external servers.
2. Verify with the DRS data provider that their DRS service is available and functioning properly.
Additional DRS Resources
For more information about the GA4GH Data Repository Service (DRS)-specific tools in Terra:
- Access to data identified by DRS URIs is provided by a DRS client library (terra-notebook-utils package*)
- API to use with Notebooks
- CLI to use from the Terra terminal
- terra-notebooks-utils README
* This package allows you to perform lots of helpful operations, such as
- View details about the data
- Copy/download the data to the Cloud Environment VM or to a Google bucket
DRS in the news
https://www.ga4gh.org/news/drs-api-enabling-cloud-based-data-access-and-retrieval/
Current DRS Documentation
https://ga4gh.github.io/data-repository-service-schemas/docs/
DRS Repository on GitHub
https://github.com/ga4gh/data-repository-service-schemas/blob/master/README.md