How (and why) to save data generated in a notebook to a Workspace bucket

Allie Hajian
  • Updated

When working in a notebook, your cloud environment VM includes a detachable persistent disk that maintains generated data even when you recreate the cloud environment. There are times, however, that you will want to copy the data to more permanent cloud storage: when archiving data, for example, or to allow collaborators or workflows access outside your notebook VM). This article describes how to move data from the notebook to a Google bucket (including your workspace bucket) storage when working in a Jupyter notebook in Terra. 

Additional resources: For a deeper dive into the back end of a Terra notebook and to understand why notebooks have these characteristics, see this article about key notebook components or this article about key notebook operations.

Real-time updates you can make to the Cloud Environment without losing data

When you create a Cloud Environment for your interactive analysis on Terra, you can choose the parameters for detachable persistent disk storage. A 50GB disk is included by default, and when you delete or recreate the Cloud Environment, you have the option of keeping or deleting it.

If your Cloud Environment doesn't include a persistent disk, you can still make some updates without losing generated data. 

  • Increase or decrease the number of CPUs or memory
    During this update, the Cloud Environment will pause, update, and then restart. The update will take a couple of minutes to complete and you will not be able to continue editing or running the notebook while it's completing.

  • Increase the disk size or change the number of workers (when the number of workers is > 2)
    During this update, you can continue to work in your notebook without pausing your Cloud Environment. When the update is finished, you will see a banner confirming the update. 

If you want to simultaneously change both the workers and CPU/memory, we advise doing it sequentially: first update the CPUs/memory, wait for the notebook Cloud Environment to restart, and then adjust the workers.

Any other Cloud Environment changes (e.g. decreasing the disk size or changing the environment type) require deleting the existing Cloud Environment and creating a new one. When you create the new Cloud Environment, any generated data files and installed packages not stored on the Persistent Disk will be lost. Please backup files as appropriate.

Additional resources: To learn more about Persistent Disks in your workspace Cloud Environment, see this article

How and why to copy output data to the Workspace bucket


Why copy data to the Workspace bucket?


Use generated data as input for a workflow, or share with collaborators
Files generated by the notebook are not automatically saved in the Workspace bucket and are not accessible outside your personal virtual Cloud Environment. Note that you will need to copy data to the Workspace bucket if you want colleagues to have access. This is true even if you are working in a shared Workspace, since Cloud Environments and Persistent Disk are unique to each user.

Save generated data to storage other than PD
In addition, if you opted out of including a Persistent Disk when you created your Cloud Environment, you will lose installed packages and output data generated in a notebook if you delete or reconfigure a cluster in some ways without explicitly saving your output to the workspace bucket.

Archive data
If you want to archive data, especially if you want to copy it to less expensive Nearline or Coldline storage, you will first need to copy it to an external bucket. 

You will not lose your data if you pause the Cloud Environment, since the VM or cluster goes away but the Persistent Disk does not. In fact, when you re-open your notebook, the VM creates more quickly as the disk does not need to be recreated. 

To move generated data to permanent cloud storage, make sure to explicitly save your outputs in the workspace bucket by following the directions below.

1. Set environment variables in a Jupyter Notebook

Setting the environment variables lets the notebook grab variables such as the workspace name and Google bucket directly. This makes cleaner and more flexible notebooks that don't require you to hardcode these variables in. Use the syntax below:

Python kernel

import os

bucket = os.environ['WORKSPACE_BUCKET']

R kernel 

project <- Sys.getenv('WORKSPACE_NAMESPACE')
workspace <- Sys.getenv('WORKSPACE_NAME')
bucket <- Sys.getenv('WORKSPACE_BUCKET')

2. Save output files to a bucket with bash commands

Note: the workspace bucket is a Google bucket, so basic bash commands in the notebooks need to be preceded by "gsutil."

These commands will only work if you have run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.

To save all generated files after the notebook runs, use the commands below. If you want to copy individual files, you can replace `*` with the file name to copy.

Python kernel

# Copy all files in the notebook into the bucket
!gsutil cp ./* $bucket

# Run list command to see if file is in the bucket
!gsutil ls $bucket

R kernel

# Copy all files generated in the notebook into the bucket
system(paste0("gsutil cp ./* ",bucket),intern=TRUE)

# Run list command to see if file is in the bucket
system(paste0("gsutil ls ",bucket),intern=TRUE)

What to do if you've lost your notebook data?

Your notebook file (and any data explicitly saved to your bucket) are stored in the Workspace bucket. This means you can rerun the notebook to regenerate any output data (though you will pay for this, of course).

Was this article helpful?

2 out of 2 found this helpful

Have more questions? Submit a request



Please sign in to leave a comment.