How to access data with DRS URIs

Allie Cliffe
  • Updated

Learn how to bring data with a Data Repository Services API Uniform Resource Identifier (DRS URI) into your workspace storage (Google bucket) or Cloud Environment persistent disk to use in an interactive analysis (Jupyter notebook, Galaxy, or RStudio).

Running a workflow with DRS URIs inputs? See How to use DRS URIs in a workflow.

Overview

Reasons to copy DRS URI data files

  • To run an interactive analysis (Jupyter notebook, Galaxy, or RStudio) on data with a DRS URI, you first need to pull the data into your workspace Cloud Environment persistent disk.
  • To copy primary data into workspace storage.

Commands to access data with DRS URIs

DRS URI-specific commands (including copy) are provided by a DRS client library, terra-notebook-utils. The package includes an API to use with Python-based Jupyter notebooks and a command line interface (CLI) to use from the Terra terminal. For R-based Notebooks, use the Bioconductor AnVIL package.

What is in the terra-notebook-utils package?

This package includes commands for viewing details about the data, copying/downloading the data to the Cloud Environment VM or external cloud storage (Azure blob or Google bucket), and other helpful operations.

Instructions for viewing and copying/downloading data is in the following sections. For additional helpful information, see the terra-notebook-utils README.

Use a current version of terra-notebook-utilsBecause the Terra Cloud Environment is constantly updated, it is very important to use a current version ofterra-notebook-utils!

Please use terra-notebook-utils version 0.12.0 or later. Read on for how to install/update terra-notebook-utils.

How to use DRS URIs in the terminal

The terra-notebook-utils Python CLI is available for use from the Terra Terminal and within shell scripts. 

For instructions on how to access the workspace terminal, see Using the terminal and interactive analysis shell in Terra.

Step-by-step instructions

1. Install the latest version of terra-notebook-utils.
In the Terra Terminal, run

pip install --upgrade --no-cache-dir terra-notebook-utils

2. (Recommended) Set the terra-notebook-utils configuration.
This configuration applies only to the use of the terra-notebook-utils CLI, not to the API.

tnu config set-workspace my-workspace-name
tnu config set-workspace-namespace my-billing-project

Setting the workspace environment variablesWhen running in the terminal, the workspace namespace  (Terra Billing project) and workspace name must be provided either by 1) setting a terra-notebook-utils configuration (recommended), or 2) using environment variables, or 3) providing command-line options for each.

3. To view the current configuration, run the following.

tnu config print

4. (Optional) View details of the data identified by the DRS URI.

tnu drs info "drs://my-drs-uri"

5. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether copy a single DRS URI or a list of DRS URIs).

  • To copy a single DRS URI file to the Cloud Environment VM
    tnu drs copy drs://my-drs-url local_filepath

    To copy a single DRS URI file to a Google bucket
    tnu drs copy drs://my-drs-url gs://my-dst-bucket/my-dstkey

    To copy multiple DRS URIs to the Cloud Environment VM
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst local_directory

    To copy multiple DRS URIs to a Google bucket
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
    gs://my-dst-bucket/prefix

  • Use the following command:
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
    gs://my-dst-bucket/prefix

How to use DRS URIs in a Python notebook

The terra-notebook-utils Python API is available in Python notebooks and scripts.

Specifying the destination location with workspace environment variablesWhen running in notebooks, the current workspace namespace (Terra Billing project) and workspace name are used by default. These are used to specify the destination (i.e. the persistent disk) where the the data will be copied to.  

Step-by-step instructions

1. Install the latest version of  terra-notebook-utils.
In a Python Notebook, run the following.

%pip install --upgrade --no-cache-dir terra-notebook-utils

2. Import the terra-notebook-utilsdrs module.
Note that the table module is optional, yet useful and recommended.

from terra_notebook_utils import drs, table

3. (Optional) View details about the data identified by the DRS URI.

drs.info("drs://my-drs-uri")

4. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether to copy a single DRS URI or a list of DRS URIs).

  • To copy a single DRS URI file to the Cloud Environment VM
    drs.copy("drs://my-drs-url", "local_filepath")

    To copy a single DRS URI file to a Google bucket
    drs.copy("drs://my-drs-url", "gs://my-dst-bucket/my-dst-key")

    To copy a list of DRS URIs to the Cloud Environment VM
    drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"], "local_directory")

    To copy a list of DRS URIs to a Google bucket
    drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"],
    "gs://my-dst-bucket/prefix")
  • The terra-notebook-utils package also provides a useful function for finding the DRS URI for a given filename within the workspace. This requires the `table` module to be imported.

    To fetch a DRS URI from a Terra data table for a given file name, use:
    drs_url = table.fetch_drs_url("data table name", "file name")

Downloading data from Requester Pays buckets

The File Details dialog does not currently support downloading files from requester pays buckets. In some additional cases, downloading the file from the File Details dialog is not supported. For example, the File Details dialog download support does not work with some external authentication and authorization services.

For instructions of how to copy data in a requester pays bucket with a DRS URI ID, see Accessing DRS URIs data files

To learn more about how to organize and access data in the cloud using data tables, see Managing data with tables.

Troubleshooting DRS URI access in Terra

If the data referenced by a DRS URI is access-controlled (i.e. not public), access requires successful authentication and authorization. If your workflow fails immediately, it's usually because the WDL cannot access the data. This is often due to either an expired authorization link or an error configuring the workflow (for example, if there is a typo in the attribute name on the configuration form).  

1. Make sure your Terra and external (such as NIH, BioData Catalyst, etc.) accounts are linked 

To access data provided by external services, you must have an up-to-date link to that service in your Terra user Profile - External Identities tab.

To learn more about linking to external services (including step-by-step instructions), see Linking authorization/accessing controlled data on external servers.

2. Verify with the DRS data provider that their DRS service is available and functioning properly.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.