Copying notebook output to a Google bucket

Data generated in a Jupyter Notebook is inaccessible from outside your virtual laptop in the cloud (integrated development environment - IDE - in the cloud), even for collaborators working in the same workspace! To transfer data generated within a notebook to more permanent, accessible storage, follow steps 1 and 2 (choose Python or R kernel for exact code) below.

You must have permission to upload to/download from a Google Bucket!This includes the workspace bucket (you must be a workspace owner or writer).

Step 1. Set environment variables

Setting the environment variables lets the notebook grab variables such as the workspace name and Google Bucket ID directly. This makes cleaner and more flexible notebooks that don't require you to hardcode these variables. Use the syntax below.

For a Python kernel notebook, use the following code.
```
import os
bucket = os.environ['WORKSPACE_BUCKET']
```
For a notebook with an R kernel, use the following code.
```
bucket <- Sys.getenv('WORKSPACE_BUCKET')
```

Rerun after pausing the laptop in the cloud virtual machine (VM)Note: If you pause the laptop in the cloud running your notebook (the Cloud Environment), you need to rerun the code cells that set the environment variables (step 1). This is because the workspace variables are part of the Cloud Environment (and not the virtual disk associated with the cluster), and they will go away when the laptop in the cloud VM is paused.

Step 2. Save output files to workspace storage with bash commands

Note: The workspace cloud storage is a Google Bucket, so bash commands in the notebooks need to be preceded by gcloud storage.

What to expect

These commands will work only if you run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.

To save all generated files after the notebook runs, use the commands below. To copy individual files, replace * with the file name to copy.

# Copy all files in the notebook into the bucket
!gcloud storage cp ./* $bucket

# Run list command to see if file is in the bucket
!gcloud storage ls $bucket

# Copy all files generated in the notebook into the bucket
system(paste0("gcloud storage cp ./* ",bucket),intern=TRUE)

# Run list command to see if file is in the bucket
system(paste0("gcloud storage ls ",bucket),intern=TRUE)

Comments

11 comments

Matt Bookman
- April 15, 2019 19:49
Note that the above new functionality does not currently work for getting the WORKSPACE_NAME for workspaces with that have space in the name:

https://github.com/DataBiosphere/leonardo/issues/864

Also an alternative way to get the environment variables is through the Python `os` module:
```
import os

BILLING_PROJECT_ID = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE = os.environ['WORKSPACE_NAME']
bucket = os.environ['WORKSPACE_BUCKET']
```
This removes the need to pull out the 0th element from the SList returned from the "!command".
0
Robert Title
- Edited April 22, 2019 15:18
Issue 864 should be released this week.

I like the python approach better than shelling out `echo` commands. Here's an R version:

project <- Sys.getenv('WORKSPACE_NAMESPACE')
workspace <- Sys.getenv('WORKSPACE_NAME')
bucket <- Sys.getenv('WORKSPACE_BUCKET')

0
Adelaide Rhodes
- April 22, 2019 15:22
It looks like this will be made permanent when the issue is closed this week.

@Matt Bookman -- Thanks for an update to the code, I will update that in the documentation.

0
Matt Bookman
- April 23, 2019 16:30
I recreated my cluster today and confirmed that WORKSPACE_NAME variable was set correctly for a notebook whose workspace name has a space in it.

So https://github.com/DataBiosphere/leonardo/issues/864 is fixed.

0
Adelaide Rhodes
- April 26, 2019 17:05
Okay, a little more information was provided in your email that I am sharing here.

If the cluster has not been "recreated" recently, meaning completely replaced, this might not work.

Network clusters can be on "pause" and "restarted" but that is not the same as "recreated".

"Recreating" a cluster will cause any newly installed programs and notebook outputs to be cleared out, so please back up your work to the workspace bucket. [I am working on a blog post on some simple commands that facilitate this.]

0
James Gatter
- Edited November 05, 2019 19:43
bucket <- Sys.getenv('WORKSPACE_BUCKET')

Returns an empty string for me. Hmm was the variable changed?

Edit: print(Sys.getenv()) does not show any environmental variable that corresponds to it. 'CLUSTER_NAME' shows the ID for the notebook but not the bucket ID of the workspace.

0
Sushma Chaluvadi
- December 02, 2019 15:29
Hi James,

Were you perhaps working on an older Runtime Environment? Just made a fresh Runtime (with R kernel) and was able to run all the above commands successfully.

I also ran print(Sys.getenv()) and while the Cluster name is listed at the top, if you scroll to the last few lines of the output form the command you can see the three variables:
```
WORKSPACE_BUCKET        gs://fc-b8cc752e-77bf-481b-ba6a-315d2a0c0b78
WORKSPACE_NAME          sushmac_sandbox
WORKSPACE_NAMESPACE     help-gatk
```
If you try again with a new runtime and the variables do not work please let us know!
0

James Gatter

December 02, 2019 16:12

With a new runtime it still didn't work! `Sys.getenv()` prints only the text below. I think maybe I'm using one of the old custom images (us.gcr.io/broad-dsp-gcr-public/terra-jupyter-bioconductor:0.0.2). I'll see if there's anything newer and get back to you.

_R_CHECK_COMPILATION_FLAGS_KNOWN_
                        -Wformat -Werror=format-security -Wdate-time
CLUSTER_NAME            saturn-7ed3c13e-810b-4ac2-8233-72dc65985121
...
GOOGLE_PROJECT          shalek-lab-firecloud
...
USER                    jupyter-user
WELDER_ENABLED          true
WELDER_UID              1001
WELDER_USER             welder-user
WORKSPACE_NAME          dropseq_scCloud
WORKSPACE_NAMESPACE     shalek-lab-firecloud

James Gatter
- December 02, 2019 16:26
They seemed to have fixed it!!!
https://github.com/DataBiosphere/terra-docker/blob/master/terra-jupyter-bioconductor/CHANGELOG.md

0
Sushma Chaluvadi
- December 02, 2019 16:32
Awesome! Are you un-blocked?

0
Mahabubur rahman
- March 25, 2021 12:03
I am quite ashamed to ask this type of question here but I need to know where should I put the gsutil command in my function code. I have use cloud scheduler to send msg to pub sub and it trigger the function. Now I need to add this copy command so that my jupyter notebook result can be stored to specific bucket. Please help me. I have been stuck with it for the quite some time. I use the method describe to below blog and use their code.

https://medium.com/bliblidotcom-techblog/how-to-deploy-and-schedule-jupyter-notebook-on-google-cloud-18a7ca23b463#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjZhOGJhNTY1MmE3MDQ0MTIxZDRmZWRhYzhmMTRkMTRjNTRlNDg5NWIiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MTY1MDMwNjEsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwOTc3NzYwODQ0Mzk1MzY1OTcxOCIsImVtYWlsIjoibWFtdW4wNzA5NzI4M0BnbWFpbC5jb20iLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiYXpwIjoiMjE2Mjk2MDM1ODM0LWsxazZxZTA2MHMydHAyYTJqYW00bGpkY21zMDBzdHRnLmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tIiwibmFtZSI6Ik1haGFidWJ1ciByYWhtYW4iLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EtL0FPaDE0R2pMVkRQNVYyakdraDFNTXZOZTVUMDBpMG52X3RMcXVMWW15eFFrPXM5Ni1jIiwiZ2l2ZW5fbmFtZSI6Ik1haGFidWJ1ciIsImZhbWlseV9uYW1lIjoicmFobWFuIiwiaWF0IjoxNjE2NTAzMzYxLCJleHAiOjE2MTY1MDY5NjEsImp0aSI6IjI0YzNhNDc5ZTZhN2FhYThmNmE4NjNlYTEwYmE2MmFiNThiY2RhM2EifQ.YgZuPN6sZ2b0OHIWIaLqhy8Qa5UhSMyZ4bastTVY0LLcy_HYtmSECsl2S49Tu8NN69ftOtwWXhNT0VnUxVSX3YK1lnDvmo2BTT4CVXeUNLVnPzeKcdn_nnfoX5utMn4WrfF0m6EWirJ5tPyhIoOeO6o6345XxAhKHl8NPLRsTxudNavIhtDolFOZ-1G8SVmtyQtLoXl7lOzlq0YC3ZrshkHSDV4f_8yttEzD9PrV45PfQHmF41aS4oQfIDjPcnp2j3p2FjE3UEGG8zACne9QvkUS6BAHWLfd5URs3bMDCF9zbvolegpGVaPf0p3TrQIDksNLJk6A91yK5TkxVuku6Q

0

Please sign in to leave a comment.

Copying notebook output to a Google bucket

Step 1. Set environment variables

Step 2. Save output files to workspace storage with bash commands

What to expect

Was this article helpful?

That’s great, can you tell us why? (Click all that apply)

Thanks for your feedback, help us improve by telling us what you think could be better (click all that apply)

Comments