This document outlines the updated functionality of Jupyter Notebooks on Terra with the addition of a “persistent disk” to your cloud environment. As of September, 2020. Users are advised to save their data in the directory /home/jupyter-user/notebooks, where your disk is mounted. You can browse the directory structure in the terminal view of your cloud environment.
What is "persistent disk" and how does it work?
Why is persistent disk so useful?
What situations benefit from the persistent disk
How do I use persistent disk?
A note about auto-synching behavior
Users working in their own Terra workspaces have access to two dedicated storage locations generated on their billing project – the workspace bucket, and the virtual machine (VM) backing your cloud environment. This article describes a change to the VM’s storage that streamlines your work by introducing persistence to the VM’s storage. By making the VM’s storage “detachable” and “persistent” – meaning that the storage can be detached from a VM prior to it’s deletion, so that it may persist and be reattached to a newly created VM – users get to keep the portion of their VM’s storage dedicated to storing packages their notebook code is built upon, input files necessary for their analysis, and outputs they’ve generated (without having to remember to move anything).
What is “persistent disk” and how does it work?
When you create a cloud environment using the standard VM option, you automatically get a persistent disk attached. The “disk” in persistent disk refers specifically to the VM storage, which users can access by launching a Cloud Environment, and then opening a terminal view into the VM by clicking on the terminal icon on the Cloud Environment button:
Within the terminal, users can navigate their virtual machine’s file structure using bash commands, just as they would if they were working on a local machine. Until now, the storage that’s accessed this way was not persistent – anytime you had to delete your cluster for any reason, anything stored here would be lost.
With this change, the VM’s structure now includes a persistent storage location. This disk is mounted to the directory /home/jupyter-user/notebooks so remember that it has to be saved there if you want it to persist. Anything saved outside of this directory is not saved to the persistent disk, and will still be lost on deletion.
When updating/replacing a VM, you’ll now also be prompted to select whether or not to persist your disk
Why is the persistent disk so helpful?
There are a number of reasons why a VM may be deleted, either at the user’s discretion, or because they are forced to do so. Some scenarios that require environment deletion:
- If the user decides to make certain changes to their environment (e.g selecting a different image), they'll delete the old VM before recreating the updated one.
- The VM enters an error state or becomes unresponsive
- Our Notebooks Best Practices guidelines suggest regularly deleting and recreating VMs in order to run with all of the latest updates described in our release notes
- In some cases, VMs are automatically deleted every two weeks to ensure they have the latest updates
In the past, this often meant tedious reinitialization for notebooks that either required time-consuming package installation or that need to have certain files copied to the VM in order to use as inputs for that notebook’s analysis. Even worse, if a user forgot or didn’t realize they were deleting the location where they’d stored their outputs, their results would be lost, and the work would have to be repeated from scratch.
The persistent disk allows users to keep important analysis data (Installed packages, input, output data) across environment deletion and re-creation, removing the tedium of setting up some types of notebook every time, and making the experience more analogous to working on a local machine.
What situations benefit from the persistent disk?
Notebook users must manage a few types of data that can potentially be lost in the process of VM deletion/recreation:
- Input data (e.g. genomics files, tabular data)
- Output data files (e.g. tabular data)
- Figures (e.g. PDFs, PNGs, JPEGs, etc)
- Packages installed on the VM
There are generally three scenarios where the utility of persistence is especially obvious:
- When running an analysis that requires a very lengthy initialization (i.e. package installation).
- When running a notebook that expects a certain input be present on the VM. For example, this Encode tutorial downloads results created by one of its workflows into the notebook VM.
- Whenever it’s desirable to be able to hold on to your outputs/results, either because that’s where you want to keep them organized, or because some outputs plug back into other parts of your analysis.
How do I use persistent disk?
When you click on the Cloud Environment button, you see this window, outlining the various options for configuring your environment. At the bottom of this window, you'll see a box for entering the size of your persistent disk.
If you modify the configuration of an environment you've been using, you'll see the "Update" button activate.
Clicking this button will show you this message, letting you know that your work will be preserved through deletion and recreation:
On other hand, you can click "Delete Environment Options" to see the options shown below. If you don't want to save the contents of your detachable persistent disk, you can select the "Delete everything, including persistent disk." If you select this, just make sure you've moved anything you do wish to keep from the VM backing the cloud environment to another location, such as your workspace bucket. Selecting the default option, "Keep persistent disk, delete application and compute profile", will delete the VM after detaching the persistent disk. This disk will be automatically reattached the next time you spin up a cloud environment, assuming you select the standard VM.
Clicking “Delete” here will result the following window, where you can select configuration before creating a new VM. If you choose the standard VM, it will automatically reattach the saved disk. If you choose a Spark mode (clicking the “Customize” button shown below will show additional options), this storage will NOT reattach to that cloud environment because spark and hail application configurations don't support the persistent disk feature, but will be saved until the next time you choose the standard VM option and click “Create”.
You can also click “Delete Persistent Disk” at this stage, if you’ve changed your mind about saving whatever is stored there. If you do that you’ll see a similar menu as before, but with only the option of deleting the persistent disk. Note that you can't delete a persistent disk that's attached to a cloud environment without first deleting that environment. If you go to your Cloud Environments page under your profile (through the hamburger menu), you'll see separate items for the Environment application itself and the detachable persistent disk. You can delete either of these from this location, but again, the option to delete the detachable disk will be deactivated until you've detached the disk by deleting the Environment first.
If you need to identify your persistent disk in the Google Cloud Console, you can click on the "details" button for the persistent disk to see the name of your persistent disk.
A note about auto-syncing behavior
You should be aware that the feature that enables Terra to frequently auto-save your notebook back to the workspace may also affect files that you store on the VM's persistent disk. When you use a notebook in a Terra workspace, the VM creates subdirectories named after the workspace in the /notebooks/ location, and Terra's auto-syncing feature regularly interacts with the notebooks in these subdirectories.
If you're storing anything on the VM's persistent disk that you don't want to be affected by the auto-syncing behavior, for example, notebooks that you would like to keep private, we recommend keeping these types of files in a specifically named subdirectory under /notebooks/, that is not named after a workspace, such as /notebooks/no-sync/ to avoid problems.