(06-14-2021) This article has been edited with changes for the upcoming Project per Workspace functionality. You can update with additional changes, but please do not publish! Once the PPW feature is out, all the articles will be released at one time!
Slack Allie Hajian (@jhajian) with any questions.
Terra attaches a persistent disk (PD) to your cloud compute where you can store data (such as generated data and installed libraries) even if you delete or update your cloud environment VM. PDs also act as a safeguard to protect your data in case something goes wrong with the VM.
A minimal cost per hour is associated with maintaining the disk even when the cloud compute is paused or deleted.
If you delete your cloud compute, but keep your PD, the PD will be reattached when creating the next cloud compute.
This document outlines updated functionality of Jupyter Notebooks on Terra with the addition of a “persistent disk” to your cloud environment (September, 2020). Users are advised to save their data in the directory /home/jupyter-user/notebooks, where your disk is mounted. You can browse the directory structure in the cloud environment terminal.
What is "persistent disk" and how does it work?
Why is persistent disk so useful?
What situations benefit from the persistent disk
How do I use persistent disk?
A note about auto-synching behavior
Terra workspaces have two dedicated storage locations – the workspace bucket, and the virtual machine (VM) backing your cloud environment. This article describes a change that streamlines your work by introducing persistence to the VM storage. The VM storage is “detachable” - meaning it can be detached from a VM prior to its deletion - and “persistent” – so that it exists independent of (and may be reattached) to the VM. The advantage is you can keep the portion of VM dedicated to storing packages your notebook code is built upon, input files necessary for your analysis, and outputs you’ve generated (without having to remember to move anything to the workspace bucket).
What is “persistent disk” and how does it work?
When you create a cloud environment using the standard VM option, you automatically get a persistent disk attached. The “disk” in persistent disk refers specifically to the VM storage, which users can access by launching a Cloud Environment, and then opening a terminal view into the VM by clicking on the terminal icon on the Cloud Environment button:
Within the terminal, users can navigate their virtual machine’s file structure using bash commands, just as they would if they were working on a local machine. Until now, the storage that’s accessed this way was not persistent – anytime you had to delete your cluster for any reason, anything stored here would be lost.
With this change, the VM’s structure now includes a persistent storage location. This disk is mounted to the directory /home/jupyter-user/notebooks so remember that it has to be saved there if you want it to persist. Anything saved outside of this directory is not saved to the persistent disk, and will still be lost on deletion.
When updating/replacing a VM, you’ll now also be prompted to select whether or not to persist your disk
Why is the persistent disk so helpful?
There are a number of reasons why a cloud environment may be deleted, either at the user’s discretion, or because they are forced to do so. Some scenarios that require environment deletion:
- If the user decides to make certain changes to their environment (e.g. changing cloud compute profile), they'll delete the old cloud environment before recreating the updated one.
- The cloud environment enters an error state or becomes unresponsive
- Our Notebooks Best Practices guidelines suggest regularly deleting and recreating VMs in order to run with all of the latest updates described in our release notes
- In some cases, cloud environments are automatically deleted every two weeks to ensure they have the latest updates
In the past, this often meant tedious reinitialization for notebooks that either required time-consuming package installation or that need to have certain files copied to the cloud environment in order to use as inputs for that interactive analysis. Even worse, if a user forgot or didn’t realize they were deleting the location where they’d stored their outputs, their results would be lost, and the work would have to be repeated from scratch.
The persistent disk allows users to keep important analysis data (Installed packages, input, output data) across environment deletion and re-creation, removing the tedium of setting up some types of notebook every time, and making the experience more analogous to working on a local machine.
What situations benefit from the persistent disk?
Notebook users must manage a few types of data that can potentially be lost in the process of VM deletion/recreation:
- Input data (e.g. genomics files, tabular data)
- Output data files (e.g. tabular data)
- Figures (e.g. PDFs, PNGs, JPEGs, etc)
- Packages installed on the cloud environment
There are three scenarios where persistence is especially useful:
- When running an analysis that requires a very lengthy initialization (i.e. package installation).
- When running a notebook that expects a certain input be present on the cloud environment. For example, this Encode tutorial downloads results created by one of its workflows into the notebook cloud environment.
- Whenever it’s desirable to be able to hold on to your outputs/results, either because that’s where you want to keep them organized, or because some outputs plug back into other parts of your analysis.
How do I use persistent disk?
When you click on the Cloud Environment button, you see this window, outlining the various options for configuring your environment. At the bottom is a box for entering the size of your persistent disk.
If you modify the configuration of an environment you've been using, you'll see the "Update" button activate.
Clicking this button will show you this message, letting you know that your work will be preserved through deletion and recreation:
Important! Please note that decreasing your persistent disk will cause active R code and any files on the PD will be removed, meaning that you could lose things you're working on if you choose to decrease the PD size in the middle of an analysis. Updating the PD with a smaller disk size will trigger a warning message to this effect:
On other hand, you can click "Delete Environment Options" to see the options shown below. If you don't want to save the contents of your detachable persistent disk, you can select the "Delete everything, including persistent disk." If you select this, just make sure you've moved anything you do wish to keep from the VM backing the cloud environment to another location, such as your workspace bucket. Selecting the default option, "Keep persistent disk, delete application and compute profile", will delete the VM after detaching the persistent disk. This disk will be automatically reattached the next time you spin up a cloud environment, assuming you select the standard VM.
Clicking “Delete” here will result the following window, where you can select configuration before creating a new VM. If you choose the standard VM, it will automatically reattach the saved disk. If you choose a Spark mode (clicking the “Customize” button shown below will show additional options), this storage will NOT reattach to that cloud environment because spark and hail application configurations don't support the persistent disk feature, but will be saved until the next time you choose the standard VM option and click “Create”.
You can also click “Delete Persistent Disk” at this stage, if you’ve changed your mind about saving whatever is stored there. If you do that you’ll see a similar menu as before, but with only the option of deleting the persistent disk. Note that you can't delete a persistent disk that's attached to a cloud environment without first deleting that environment. If you go to your Cloud Environments page under your profile (through the hamburger menu), you'll see separate items for the Environment application itself and the detachable persistent disk. You can delete either of these from this location, but again, the option to delete the detachable disk will be deactivated until you've detached the disk by deleting the Environment first.
If you need to identify your persistent disk in the Google Cloud Console, you can click on the "details" button for the persistent disk to see the name of your persistent disk.
In some cases, you may want to copy data from your interactive cloud environment to another location, to keep from losing work while deleting or modifying your persistent disk. For detailed instructions on copying files from your interactive environment to your workspace bucket, see this article.
A note about auto-syncing behavior
You should be aware that the feature that enables Terra to frequently auto-save your notebook back to the workspace may also affect files that you store on the VM's persistent disk. When you use a notebook in a Terra workspace, the VM creates subdirectories named after the workspace in the /notebooks/ location, and Terra's auto-syncing feature regularly interacts with the notebooks in these subdirectories.
If you're storing anything on the VM's persistent disk that you don't want to be affected by the auto-syncing behavior, for example, notebooks that you would like to keep private, we recommend keeping these types of files in a specifically named subdirectory under /notebooks/, that is not named after a workspace, such as /notebooks/no-sync/ to avoid problems.