Copying notebook output to a Google bucket
FollowData generated by running an analysis in a Jupyter notebook is saved to the disk associated with the virtual notebook runtime. When the runtime is deleted, the data are as well. To transfer data generated within a notebook to more permanent storage, use steps 1 and 2 ("a" for Python and "b" for R) below.
Note that you will need to rerun the code cells that set the environment variables (step 1) after pausing or stopping the notebook cloud environment. This is because the workspace variables are part of the cloud environment (and not the virtual disk associated with the cluster), and they will go away when the notebook cluster is stopped or paused.
Contents
1. Setting environment variables in a Jupyter Notebook
- Python kernel
- R kernel
2. Saving notebook files with bash commands
- Python kernel
- R kernel
1. Setting environment variables in a Jupyter Notebook
Setting the environment variables lets the notebook grab variables such as the workspace name and Google bucket directly. This makes cleaner and more flexible notebooks that don't require you to hardcode these variables in. Use the syntax below:
Python kernel
import os
BILLING_PROJECT_ID = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE = os.environ['WORKSPACE_NAME']
bucket = os.environ['WORKSPACE_BUCKET']
R kernel
project <- Sys.getenv('WORKSPACE_NAMESPACE')
workspace <- Sys.getenv('WORKSPACE_NAME')
bucket <- Sys.getenv('WORKSPACE_BUCKET')
2. Saving output files permanently to a bucket with bash commands
Note: the workspace bucket is a Google bucket, so basic bash commands in the notebooks need to be preceded by "gsutil."
These commands will only work if you have run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.
To save all generated files after the notebook runs, use the commands below. If you want to copy individual files, you can replace `*` with the file name to copy.
Python kernel
# Copy all files in the notebook into the bucket
!gsutil cp ./* $bucket
# Run list command to see if file is in the bucket
!gsutil ls $bucket
R kernel
# Copy all files generated in the notebook into the bucket
system(paste0("gsutil cp ./* ",bucket),intern=TRUE)
# Run list command to see if file is in the bucket
system(paste0("gsutil ls ",bucket),intern=TRUE)
Comments
10 comments
Note that the above new functionality does not currently work for getting the WORKSPACE_NAME for workspaces with that have space in the name:
https://github.com/DataBiosphere/leonardo/issues/864
Also an alternative way to get the environment variables is through the Python `os` module:
This removes the need to pull out the 0th element from the SList returned from the "!command".
Issue 864 should be released this week.
I like the python approach better than shelling out `echo` commands. Here's an R version:
project <- Sys.getenv('WORKSPACE_NAMESPACE')
workspace <- Sys.getenv('WORKSPACE_NAME')
bucket <- Sys.getenv('WORKSPACE_BUCKET')
It looks like this will be made permanent when the issue is closed this week.
@Matt Bookman -- Thanks for an update to the code, I will update that in the documentation.
I recreated my cluster today and confirmed that WORKSPACE_NAME variable was set correctly for a notebook whose workspace name has a space in it.
So https://github.com/DataBiosphere/leonardo/issues/864 is fixed.
Okay, a little more information was provided in your email that I am sharing here.
If the cluster has not been "recreated" recently, meaning completely replaced, this might not work.
Network clusters can be on "pause" and "restarted" but that is not the same as "recreated".
"Recreating" a cluster will cause any newly installed programs and notebook outputs to be cleared out, so please back up your work to the workspace bucket. [I am working on a blog post on some simple commands that facilitate this.]
bucket <- Sys.getenv('WORKSPACE_BUCKET')
Returns an empty string for me. Hmm was the variable changed?
Edit: print(Sys.getenv()) does not show any environmental variable that corresponds to it. 'CLUSTER_NAME' shows the ID for the notebook but not the bucket ID of the workspace.
Hi James,
Were you perhaps working on an older Runtime Environment? Just made a fresh Runtime (with R kernel) and was able to run all the above commands successfully.
I also ran print(Sys.getenv()) and while the Cluster name is listed at the top, if you scroll to the last few lines of the output form the command you can see the three variables:
If you try again with a new runtime and the variables do not work please let us know!
With a new runtime it still didn't work! `Sys.getenv()` prints only the text below. I think maybe I'm using one of the old custom images (us.gcr.io/broad-dsp-gcr-public/terra-jupyter-bioconductor:0.0.2). I'll see if there's anything newer and get back to you.
They seemed to have fixed it!!!
https://github.com/DataBiosphere/terra-docker/blob/master/terra-jupyter-bioconductor/CHANGELOG.md
Awesome! Are you un-blocked?
Please sign in to leave a comment.