Terra provides infrastructure for running and writing interactive analyses with Jupyter Notebooks, which are files that contain analysis code and embedded documentation. This article is to help enhance your ability to do interactive analyses with a deeper understanding of key components (i.e. Billing Projects) in Terra's notebooks environment.
A second article addresses key operations and how they impact your work.
Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies.
Glossary of Key Components
This section defines terms, and explains where notebook files and their output live -- when you are working with the files and when you are not.
- Jupyter kernel
- Notebook service (Leonardo)
- Compute Engine VM
- Notebook runtime environment cluster boot disk (sometimes called "persistent disk")
- Docker containers
- Cloud Storage
- Dataproc Cluster
The Jupyter kernel is the computer program that runs while you have a notebook open. The kernel process maintains the runtime state of the Jupyter notebook.
Terra supports R and Python kernels. When cells in the notebook are executed, they are interpreted by this language-specific kernel. You select the Jupyter kernel when you create a notebook.
Note about accessing analysis output: Until you save a notebook, the output is stored in the kernel only and will be deleted if you stop the kernel before saving.
The Notebook Service manages the compute environment you use to edit and run your notebook. In Terra, the notebook service is "Leonardo," and the two terms are often used interchangeably.
Notebook Runtime Environment (aka "Cluster"; aka "Compute Engine VM")
When you interact with your notebook in a web browser on your own computer, the characters you type and code you execute are all sent to the Jupyter kernel process running on a Google Compute Engine virtual machine (VM) or runtime environment. Much of the discussion in this document involves understanding the Compute Engine VM as a host for your notebooks.
In the rest of this article, your "notebook VM" refers to the Compute Engine VM that hosts your notebooks.
When you create your notebook environment, by default you create a single VM. However, the Terra environment supports more powerful clusters of VMs using Google Cloud Dataproc. Use of a VM cluster is an advanced topic that is outside of the scope of this document.
Note about accessing analysis output: Even if you save a notebook, the output, which is attached to the VM or runtime environment, will be lost if you delete or reconfigure the VM. See this article for information on how to save output more permanently to a Google bucket.
Cluster boot disk (aka "Persistent Disk")
A virtual machine needs a disk for storing data files, the operating system, or other software. The name of Google Compute Engine's block storage is Persistent Disk. It is called "persistent" because the disk itself can persist even when the VM to which it is attached is stopped or paused. However, information on the boot disk is lost if you delete or update the runtime environment. For this reason, we try to avoid the (somewhat misleading) term "persistent disk."
Docker is a (branded) container technology for packaging software for rapid deployment onto a machine. A Docker container is like a sandboxed virtual machine that exists wholly inside the Compute Engine Virtual Machine. Software and tools for the Terra notebook environment are packaged together and can be deployed on a compute environment in a Docker container.
The ability to create custom Dockers is not yet a feature on Terra: all notebooks run a default Docker configured for genomics computation.
Every Terra Workspace has an associated Google Cloud Storage bucket for long-term storage of notebooks and other files. Notebook files (and only notebook files) are automatically saved to your workspace bucket (see section on Saving Notebooks below). You can save other files (such as output files from batch processing) to your workspace bucket manually (see this article for more details on how to do this).
Notebook runtime environment versus Cloud Storage
The terminology "persistent disk" is sometimes used to refer to the disk that is associated with the notebook runtime environment. The vocabulary can create confusion: Just how persistent are they? Is there something truly special about this persistence? The term "persistent disk" originates from the early days of Google Compute Engine when the disks associated with VMs lived and died along with the VM. When disks were added that could live whether or not the VM was running, the persistence was particularly noteworthy.
Note that if you want to use it, a persistent disk still must be associated with a running VM. This is different from a Cloud Storage bucket, which exists as a point of storage that can be accessed, using APIs, without a VM.
Multiple Billing Projects
Cloud resources, such as Cloud Storage Buckets and Compute Engine instances exist within Google Cloud Projects. Within Terra, these are referred to as "Billing Projects".
Notebooks belong to workspaces and workspaces belong to Billing Projects.
Thus if you have workspaces in two different billing projects, and you work in notebooks in those two different billing projects, you will have separate Compute Engine resources (see below):