Analyzing data from a workspace bucket in a notebook

Allie Hajian

The virtual machine running a Jupyter Notebook has its own storage, which is distinct from the workspace bucket. To analyze data in your workspace bucket in a notebook, you need to import the data to the cloud environment persistent disk.

This article walks you through how to copy data from the workspace bucket to your Cloud Environment's persistent disk for a notebook analysis.

Step 1. Set environment variables in a Jupyter Notebook

Setting the environment variables lets the notebook grab variables such as the workspace name and Google bucket directly. This makes cleaner and more flexible notebooks that don't require you to hardcode these variables in. Use the syntax below, exactly as it's written.

Python kernel

import os

BILLING_PROJECT_ID = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE = os.environ['WORKSPACE_NAME']
bucket = os.environ['WORKSPACE_BUCKET']

R kernel

project <- Sys.getenv('WORKSPACE_NAMESPACE')
workspace <- Sys.getenv('WORKSPACE_NAME')
bucket <- Sys.getenv('WORKSPACE_BUCKET')

Step 2. Copy files in a workspace bucket to a notebook with bash commands

Python kernel

# Copy all files from the workspace bucket to the notebook disk
!gcloud storage cp $bucket/* .

# Run list command to see if file is in the notebook disk
!ls

R kernel

# Copy all files from the workspace bucket to the notebook disk
system(paste0("gcloud storage cp ", bucket, "/* ."),intern=TRUE)

# Run list command to see if file is in the bucket
system(paste0("ls ."),intern=TRUE)

Using bash commands in notebooks Here are a few tips to keep in mind when running bash commands in a Jupyter or R notebook:

- These commands will work only if you run the commands in Step 1 to set the environment variables.

- The workspace bucket is a Google bucket, so basic bash commands in the notebooks need to be preceded by !gcloud storage. Once you execute these cells, the data files should be visible in the workspace bucket.

- To save all generated files after the notebook runs, use the commands below. If you want to copy individual files, you can replace * with the file name to copy.

- You can also add the -R flag to recursively copy everything in the file tree from the point you specify (example: gcloud storage cp -R gs://bucket-name/bucket_directory/).

Was this article helpful?

3 out of 3 found this helpful

Comments

2 comments

  • Comment author
    Claudia Chu

    How do I get a gsurl I can pass to IGV? I have tried getting the download path from drs.access(drs_bam) and using that link to pass into IGV. But IGV doesn't access these kinds of links.

     

    I also tried using IGV within my workspace and it wasn't able to recognize any of the bam paths. 

    Am I missing something?

     

     

    0
  • Comment author
    Pamela Bretscher

    Hi Claudia,

    Thanks for your question! Could you share the workspace where you are seeing this issue with Terra Support by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

    1. Toggle the "Share with support" button to "Yes"
    2. Click Save

    I'll be happy to take a look at this!

    Kind regards,

    Pamela

    0

Please sign in to leave a comment.