How to access data with DRS URIs

Allie Cliffe
  • Updated

Learn how to bring data with a Data Repository Services API Uniform Resource Identifier (DRS URI) into your workspace storage (Google bucket) or Cloud Environment persistent disk to use in an interactive analysis (Jupyter notebook, Galaxy, or RStudio).

Running a workflow with DRS URIs inputs? See How to use DRS URIs in a workflow.

Overview

Reasons to copy DRS URI data files

  • To run an interactive analysis (Jupyter Notebook, Galaxy, or RStudio) on data with a DRS URI, you first need to pull the data into your workspace's Cloud Environment persistent disk.
  • To copy primary data into workspace storage.

Commands to access data with DRS URIs

When working from Python-based notebooks and the command line, use the DRS URI-specific commands provided by a DRS client library, terra-notebook-utils. The package includes an API to use with Python-based Jupyter notebooks and a command line interface (CLI) to use from your workspace's terminal.

When working from R-based Notebooks, use the Bioconductor AnVIL package.

What is in the terra-notebook-utils package? This package includes commands for viewing details about the data, copying/downloading the data to the Cloud Environment VM or external cloud storage (Azure blob or Google bucket), and other helpful operations.

Instructions for viewing and copying/downloading data are in the following sections. For additional helpful information, see the terra-notebook-utils README.

Use a current version of terra-notebook-utilsBecause the Terra Cloud Environment is constantly updated, it is very important to use a current version ofterra-notebook-utils!

Please use terra-notebook-utils version 0.13.0 or later. Read on for how to install/update terra-notebook-utils.

How to use DRS URIs in the terminal

The terra-notebook-utils Python CLI is available for use from the Terra Terminal and within shell scripts. 

To learn how to access the workspace terminal, see Using the terminal and interactive analysis shell in Terra.

Step-by-step instructions

1. Install the latest version of terra-notebook-utils.
In the Terra Terminal, run

pip install --upgrade --no-cache-dir terra-notebook-utils

2. (Recommended) Set the terra-notebook-utils configuration.
This configuration applies only to the use of the terra-notebook-utils CLI, not to the API.

tnu config set-workspace my-workspace-name
tnu config set-workspace-namespace my-billing-project

Setting the workspace environment variablesWhen running in the terminal, the workspace namespace  (Terra Billing project) and workspace name must be provided. There are three ways to do this:
1. Option 1: Set a terra-notebook-utils configuration (recommended)
2. Option 2: use environment variables
3. Option 3: provide command-line options for each

3. To view the current configuration, run the following.

tnu config print

4. (Optional) View details of the data identified by the DRS URI.

tnu drs info "drs://my-drs-uri"

5. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether copy a single DRS URI or a list of DRS URIs).

  • To copy a single DRS URI file to the Cloud Environment VM
    tnu drs copy drs://my-drs-url local_filepath

    To copy a single DRS URI file to a Google bucket
    tnu drs copy drs://my-drs-url gs://my-dst-bucket/my-dstkey

    To copy multiple DRS URIs to the Cloud Environment VM
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst local_directory

    To copy multiple DRS URIs to a Google bucket
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
    gs://my-dst-bucket/prefix

  • Use the following command:
    tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
    gs://my-dst-bucket/prefix

How to use DRS URIs in a Python notebook

The terra-notebook-utils Python API is available in Python notebooks and scripts.

Specifying the files' destination with workspace environment variablesWhen working in notebooks, the current workspace namespace (Terra Billing project) and workspace name are used by default. These are used to specify the destination (the persistent disk) where the data will be copied.  

Step-by-step instructions

1. Install the latest version of  terra-notebook-utils.
In a Python Notebook, run the following.

%pip install --upgrade --no-cache-dir terra-notebook-utils

2. Import the terra-notebook-utilsdrs module.
Note that the table module is optional, yet useful and recommended.

from terra_notebook_utils import drs, table

3. (Optional) View details about the data identified by the DRS URI.

drs.info("drs://my-drs-uri")

4. Copy/download the data.
There are several options available depending on your use case (such as whether to copy to the Cloud Environment VM or a Google bucket and whether to copy a single DRS URI or a list of DRS URIs).

DRS data in GCP onlyCurrently the tables functionality only works with data stored in Google. We are working to add functionality for DRS URIs that reference data stored in Azure cloud.

  • To copy a single DRS URI file to the Cloud Environment VM
    drs.copy("drs://my-drs-url", "local_filepath")

    To copy a single DRS URI file to a Google bucket
    drs.copy("drs://my-drs-url", "gs://my-dst-bucket/my-dst-key")

    To copy a list of DRS URIs to the Cloud Environment VM
    drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"], "local_directory")

    To copy a list of DRS URIs to a Google bucket
    drs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"],
    "gs://my-dst-bucket/prefix")
  • The terra-notebook-utils package also provides a useful function for finding the DRS URI for a given filename within the workspace. This requires the `table` module to be imported.

    To fetch a DRS URI from a Terra data table for a given file name, use:
    drs_url = table.fetch_drs_url("data table name", "file name")

Troubleshooting DRS URI access in Terra

If the data referenced by a DRS URI is access-controlled (i.e., not public), access requires successful authentication and authorization. For example, if a workflow that operates over DRS URIs fails immediately, it's usually because the WDL cannot access the data. This is often due to either an expired authorization link or an error configuring the workflow (for example, if there is a typo in the attribute name on the configuration form).  

1. Make sure your Terra and external (such as NIH, BioData Catalyst, etc.) accounts are linked. 

To access data provided by external services, you must have an up-to-date link to that service in your Terra user Profile - External Identities tab.

To learn more about linking to external services (including step-by-step instructions), see How to access controlled data on external servers.

2. Verify the DRS data provider

Check that their DRS service is available and functioning properly.

Try unlinking and re-linking your external profileIf you've followed the troubleshooting steps above and are still having problems, it can sometims work to unlink your account and immediately re-link it.

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.