Learn how to bring data with a Data Repository Services API Uniform Resource Identifier (DRS URI) into your workspace storage (Google bucket) or Cloud Environment persistent disk to use in an interactive analysis (Jupyter notebook, Galaxy, or RStudio).
Running a workflow with DRS URIs inputs? See How to use DRS URIs in a workflow.
Overview
Reasons to copy DRS URI data files
- To run an interactive analysis (Jupyter notebook, Galaxy, or RStudio) on data with a DRS URI, you first need to pull the data into your workspace Cloud Environment persistent disk.
- To copy primary data into workspace storage.
Commands to access data with DRS URIs
DRS URI-specific commands (including copy) are provided by a DRS client library, terra-notebook-utils
. The package includes an API to use with Python-based Jupyter notebooks and a command line interface (CLI) to use from the Terra terminal. For R-based Notebooks, use the Bioconductor AnVIL package.
What is in the terra-notebook-utils package?
This package includes commands for viewing details about the data, copying/downloading the data to the Cloud Environment VM or external cloud storage (Azure blob or Google bucket), and other helpful operations.
Instructions for viewing and copying/downloading data is in the following sections. For additional helpful information, see the terra-notebook-utils README.
Use a current version of terra-notebook-utilsBecause the Terra Cloud Environment is constantly updated, it is very important to use a current version ofterra-notebook-utils
!
Please use terra-notebook-utils
version 0.12.0
or later. Read on for how to install/update terra-notebook-utils.
How to use DRS URIs in the terminal
The terra-notebook-utils
Python CLI is available for use from the Terra Terminal and within shell scripts.
For instructions on how to access the workspace terminal, see Using the terminal and interactive analysis shell in Terra.
Step-by-step instructions
1. Install the latest version of terra-notebook-utils
.
In the Terra Terminal, run
pip install --upgrade --no-cache-dir terra-notebook-utils
2. (Recommended) Set the terra-notebook-utils
configuration.
This configuration applies only to the use of the terra-notebook-utils
CLI, not to the API.
tnu config set-workspace my-workspace-name
tnu config set-workspace-namespace my-billing-project
Setting the workspace environment variablesWhen running in the terminal, the workspace namespace (Terra Billing project) and workspace name must be provided either by 1) setting a terra-notebook-utils
configuration (recommended), or 2) using environment variables, or 3) providing command-line options for each.
3. To view the current configuration, run the following.
tnu config print
4. (Optional) View details of the data identified by the DRS URI.
tnu drs info "drs://my-drs-uri"
5. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether copy a single DRS URI or a list of DRS URIs).
-
To copy a single DRS URI file to the Cloud Environment VM
tnu drs copy drs://my-drs-url local_filepath
To copy a single DRS URI file to a Google bucket
tnu drs copy drs://my-drs-url gs://my-dst-bucket/my-dstkey
To copy multiple DRS URIs to the Cloud Environment VM
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst local_directory
To copy multiple DRS URIs to a Google bucket
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
gs://my-dst-bucket/prefix - Use the following command:
tnu drs copy-batch drs://my-drs-url1 drs://my-drs-url2 --dst
gs://my-dst-bucket/prefix
How to use DRS URIs in a Python notebook
The terra-notebook-utils
Python API is available in Python notebooks and scripts.
Specifying the destination location with workspace environment variablesWhen running in notebooks, the current workspace namespace (Terra Billing project) and workspace name are used by default. These are used to specify the destination (i.e. the persistent disk) where the the data will be copied to.
Step-by-step instructions
1. Install the latest version of terra-notebook-utils
.
In a Python Notebook, run the following.
%pip install --upgrade --no-cache-dir terra-notebook-utils
2. Import the terra-notebook-utils
drs
module.
Note that the table
module is optional, yet useful and recommended.
from terra_notebook_utils import drs, table
3. (Optional) View details about the data identified by the DRS URI.
drs.info("drs://my-drs-uri")
4. Copy/download the data.
There are several options available depending on your use-case (i.e. whether to copy to the Cloud Environment VM or a Google bucket and whether to copy a single DRS URI or a list of DRS URIs).
-
To copy a single DRS URI file to the Cloud Environment VM
drs.copy("drs://my-drs-url", "local_filepath")
To copy a single DRS URI file to a Google bucketdrs.copy("drs://my-drs-url", "gs://my-dst-bucket/my-dst-key")
To copy a list of DRS URIs to the Cloud Environment VMdrs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"], "local_directory")
To copy a list of DRS URIs to a Google bucketdrs.copy_batch(["drs://my-drs-url1", "drs://my-drs-url2"],
"gs://my-dst-bucket/prefix") -
The
terra-notebook-utils
package also provides a useful function for finding the DRS URI for a given filename within the workspace. This requires the `table` module to be imported.To fetch a DRS URI from a Terra data table for a given file name, use:
drs_url = table.fetch_drs_url("data table name", "file name")
Downloading data from Requester Pays buckets
The File Details dialog does not currently support downloading files from requester pays buckets. In some additional cases, downloading the file from the File Details dialog is not supported. For example, the File Details dialog download support does not work with some external authentication and authorization services.
For instructions of how to copy data in a requester pays bucket with a DRS URI ID, see Accessing DRS URIs data files.
To learn more about how to organize and access data in the cloud using data tables, see Managing data with tables.
Troubleshooting DRS URI access in Terra
If the data referenced by a DRS URI is access-controlled (i.e. not public), access requires successful authentication and authorization. If your workflow fails immediately, it's usually because the WDL cannot access the data. This is often due to either an expired authorization link or an error configuring the workflow (for example, if there is a typo in the attribute name on the configuration form).
1. Make sure your Terra and external (such as NIH, BioData Catalyst, etc.) accounts are linked
To access data provided by external services, you must have an up-to-date link to that service in your Terra user Profile - External Identities tab.
To learn more about linking to external services (including step-by-step instructions), see Linking authorization/accessing controlled data on external servers.
2. Verify with the DRS data provider that their DRS service is available and functioning properly.