How (and why) to save data generated in a notebook to workspace storage

Allie Hajian
  • Updated

Your Jupyter Cloud Environment includes a detachable persistent disk (PD) that stores generated data even if you re-create the environment's virtual machine (VM). Sometimes, though, you will want to copy your data to more permanent cloud storage: e.g., when archiving data, or to allow collaborators or workflows access. This article describes how to move data from Jupyter VM storage to a Google bucket (including your workspace storage) when working in notebooks in Terra. 

Interested in a deeper dive? Terra's Cloud Environment To understand what's happening on the back end and why notebooks and RStudio analyses have these characteristics, see this article about key notebook components or this article about key notebook operations.

Why copy generated data to workspace storage?

Below are the primary reasons you might want to copy data generated in a notebook analysis to workspace storage (or external Google bucket).  

Use generated data as input for a workflow

Files generated by a notebook are not automatically saved in workspace storage (Google bucket) and are not accessible outside your personal virtual Jupyter Cloud Environment. In order for your workflows to access generated notebook files as input data, you will need to copy them to workspace storage. 

Share generated data with collaborators - even in a shared workspace

For the same reason, you need to copy data to workspace storage if you want colleagues to have access. This is true even if you are working in a shared workspace, since each user has their own Cloud Environment and Persistent Disk (PD) that is inaccessible to anyone else.

Archive data

If you want to archive data, especially if you want to copy it to less expensive Nearline or Coldline storage, first copy it to an external bucket. 

To safeguard data when re-creating or deleting the PD

Sometimes you may want to reconfigure your Cloud Environment (e.g., if you are moving between a notebook and RStudio analysis) or delete your PD. Be careful - in some case you can lose all or some generated data unless you explicitly save your output to workspace or external storage (i.e., Google bucket).

For example, if you want to decrease your PD (because you overestimated how much you would need and don't want to pay for unused space), you should back up your data before decreasing the disk size, in case the part of the disk that is deleted includes some of your data. 

Don't lose data when running both Jupyter and RStudio!You have to re-create the Cloud Environments when swapping between RStudio and Jupyter in the same workspace. Be careful not to delete or reduce your persistent disk when you do this -- otherwise, you might lose your data. Instead, only increase disk size and keep the same disk type

How to copy data to workspace storage

To move generated data to permanent cloud storage, make sure to explicitly save your outputs in the workspace bucket by following the directions below in a Jupyter or RStudio notebook.

Step 1. Set environment variables

Setting the environment variables lets the notebook grab variables such as the workspace name and Google bucket directly. This makes cleaner and more flexible notebooks that don't require you to hardcode these variables.

Run the commands below in a code cell.

  • import os

    BILLING_PROJECT_ID = os.environ['WORKSPACE_NAMESPACE']
    WORKSPACE = os.environ['WORKSPACE_NAME']
    bucket = os.environ['WORKSPACE_BUCKET']
  • project <- Sys.getenv('WORKSPACE_NAMESPACE')
    workspace <- Sys.getenv('WORKSPACE_NAME')
    bucket <- Sys.getenv('WORKSPACE_BUCKET')

Step 2. Save output files to a bucket with bash commands

The workspace storage is a Google bucket, so basic bash commands in notebooks need to be preceded by "gcloud storage."

These commands will work only if you have run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.

To save all generated files after the notebook runs, use the commands below. If you want to copy individual files, you can replace `*` with the file name to copy.

  • # Copy all files in the notebook into the bucket
    !gcloud storage cp ./* $bucket

    # Run list command to see if file is in the bucket
    !gcloud storage ls $bucket
  • # Copy all files generated in the notebook into the bucket
    system(paste0("gcloud storage cp ./* ",bucket),intern=TRUE)

    # Run list command to see if file is in the bucket
    system(paste0("gcloud storage ls ",bucket),intern=TRUE)

What to do if you lose your notebook data

Your notebook (.ipynb) file is saved in workspace storage (i.e., Google bucket). This means you can rerun the notebook to regenerate any output data (you will pay for this, of course). 

Was this article helpful?

2 out of 2 found this helpful

Comments

0 comments

Please sign in to leave a comment.