Overview: Cloud environment storage (detachable persistent disks)

Terra attaches a persistent disk (PD) to your Cloud Environment virtual machine (VM) where you can store data - including generated data and installed libraries - even if you delete or update your Cloud Environment.

Persistent Disks incur costs It costs money to maintain the PD even when the Cloud Environment is paused or deleted. Its cost per month is shown in the user interface in the Cloud Environments section under your profile, as well as within the Cloud Environment Setup pane when you're customizing your environment. The default PD costs $2 per month.

Always save important data to workspace storage (Google Bucket)!!
It's better to think of Persistent Disk storage as temporary - for while you are analyzing with a Notebook or other interactive analysis app. If something happens to your Cloud Environment (it's in an error state, or is accidentally deleted), it is usually not possible to recover any data. For this reason, we strongly recommend copying any data you really need/want to keep to workspace storage. See Saving data from an interactive analysis to workspace storage for step-by-step instructions.

Terra storage overview

Terra workspaces have two dedicated storage locations: the workspace bucket and the Cloud Environment virtual machine (VM) Persistent Disk (PD). Like a USB drive, the PD can be detached from the VM before deleting or recreating the Cloud Environment, then attached to a new one. The PD lets you keep your notebook's code packages, input files, and outputs - without having to move anything to workspace storage (i.e., Google bucket).

Collaborators cannot access your persistent diskThe PD is unique to your Cloud Environment, and your Cloud Environment is unique to you. To share your data with colleagues who have access to your workspace, or to use the data as input to a workflow, you need to copy the data from the PD to the workspace's storage bucket.

What is a persistent disk and how does it work?

The default Cloud Environment virtual machine automatically comes with 50GB of storage (the persistent disk) attached.

You can access files in the PD by launching a Jupyter Cloud Environment and opening a VM terminal (to launch a terminal, click on the terminal icon in the right sidebar of any workspace page when a Jupyter Cloud Environment is running).

Note: The terminal will open in a new browser tab.

Types of persistent disksStandard: standard hard disk drives, the least expensive option.
Solid State Drive: solid state drives, which are more expensive but faster and more power-efficient than standard hard disk drives.
Balanced: a combination of standard and solid-state drives (SSD). They are an alternative to SSD persistent disks that balance performance and cost.

The PD file directory

From a VM terminal in your Terra workspace, you can navigate the PD’s file structure using bash commands, just as you would on a local machine. Any files you want to save (or "persist") must be saved to the directory where the PD is mounted. Anything saved outside this directory is not saved to the persistent disk and will be lost when the Cloud Environment is deleted.

RStudio PDs are mounted to the directory /home/rstudio
Jupyter PDs are mounted to /home/jupyter

To find the name of the mount point for your Jupyter Cloud Environment PD, run lbsk from the VM terminal. This will return the MOUNTPOINT for any associated disks.

Persistent disks save time and reduce errors

The persistent storage option saves time because you don't have to reinitialize the Cloud Environment each time you use an app that requires a long time to install packages or load input files.

Preserving your persistent disk also safeguards your data after you delete or update your Cloud Environment. These files might include:

Input data (e.g., genomics files, tabular data)
Generated data files (from an interactive analysis)
Figures (e.g., PDFs, PNGs, JPEGs, etc)
Packages installed on the Cloud Environment

When is persistence especially useful? - When running an analysis that requires a very lengthy initialization (i.e., package installation).
- When running a notebook that expects a certain input from the Cloud Environment. For example, this Encode tutorial downloads results created by one of its workflows into the Jupyter Cloud Environment for further analysis.
- Whenever you want to save your outputs/results, either to keep them organized or because some outputs are used in other parts of your analysis.

Scenarios when you might need to delete or re-create the Cloud Environment:
Persistence is also useful when you need to delete or re-create the Cloud Environment but want to save the progress on your analysis. This is necessary in the following situations:
- To make certain types of changes to the Cloud Environment. Some types of changes to the compute profile cannot be updated within an existing VM (e.g. adding GPUs), and if you want to use the data in your PD with such a configuration, you will need to delete the existing VM and create a new one.
- To run your notebook with the latest updates described in our release notes. Our Notebooks Best Practices guidelines suggest recreating Cloud Environments regularly for this reason).

Comments

2 comments

Nicole Deflaux
- Edited January 21, 2021 19:59
Sometimes `pip install --upgrade <pkg>` does not work successfully and people need to troubleshoot.

Now that package installs are written to the Terra detachable persistent disk, one approach is to delete and recreate that disk to troubleshoot BUT it's easy to forget that you have some important files on that disk, then delete it during a troubleshooting session and regret the deletion.

An alternative is to troubleshoot by starting from an empty `packages/` directory. For example, open a terminal and then run the following commands to move all currently installed packages to another directory so that they are no longer visible to pip, Python, and Jupyter:
```
cd $HOME/notebooks

export PKG_STASH_DIR=packages-as-of-$(date +"%Y%m%d")

mkdir $PKG_STASH_DIR

# Move all the currently installed packages out of the existing destination directory for package installations.
mv packages/* $PKG_STASH_DIR
```
Now you can retry the `pip` commands to install the packages again!
0
Tiffany Miller
- August 25, 2021 20:40
Note that Rstudio rmd files are not auto-syncing to the Google bucket. This feature should be released by the end of 2021.

0

Please sign in to leave a comment.

Overview: Cloud environment storage (detachable persistent disks)

Terra storage overview

What is a persistent disk and how does it work?

The PD file directory

Persistent disks save time and reduce errors

Was this article helpful?

That’s great, can you tell us why? (Click all that apply)

Thanks for your feedback, help us improve by telling us what you think could be better (click all that apply)

Comments