Terra's Jupyter Notebooks environment Part II: Key operations
FollowTerra uses a standard Jupyter Notebooks server implementation, so the interface and core capabilities are all basically the same as what you would see in any other setting. As a result, you can take advantage of the wealth of documentation and tutorials available on the Internet for learning how to use the various menu options, widgets and so on that we're not going to cover in detail here.
The one thing that is truly different about how Jupyter Notebooks work in Terra versus typical local installations is how the computing environment is set up. This article walks through what happens when you perform key operations with your notebooks in Terra. Understanding what is happening behind the curtain can help you avoid pitfalls while making your notebook analysis process more efficient.
For more details about key components of a Jupyter notebook, see Part I here.
Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies. |
Contents
- Creating a notebook
- Opening a notebook
- Saving a notebook
- Editing notebooks
- Running notebooks
- Installing software and dependencies
- Stopping a runtime
- Deleting a cluster
- Opening a notebook (read-only)
- Collaboration in a shared workspace
- Opening notebook (playground mode)
- Using Notebooks when you have Multiple Billing Projects
Creating a notebook
When you create a notebook, the Notebook Service creates a (.jpynb) file and saves it in your workspace bucket:
Opening a notebook
When you open a notebook, the Notebook Service executes several steps in order:
- Create a Cloud Environment (aka cluster, or notebook VM) if one doesn't already exist for your Cloud project - this includes a Boot Disk and a Detachable Persistent Disk
- Start a Docker container
- Localize the notebook to your VM or cluster
- Open the notebook in Jupyter kernel
1. Create a Cloud Environment
If a cloud environment (the notebook VM) does not already exist for you in your Cloud Project, the notebook service will first create one, along with an associated boot disk that persists until you delete the environment and a Detachable Persistent Disk that can be re-attached to a different Cloud Environment after deleting the existing one.
Note on billing for notebooks: Billing for your Cloud Environment will begin now and will continue until you stop or delete it. Every time you open a notebook, a new Jupyter kernel is created. If you have multiple notebooks open and running they will all consume resources (memory and CPU) on the same Cloud Environment.
Note on billing for Detachable Persistent Disks: When you delete your Cloud Environment, you can choose to keep your Detachable Persistent Disk. If you do, you will incur a charge of $2.00/month (50 GB disk).
Note on region/zone of notebook VMs: Notebook VMs are created in one of the us-central1
zones.
Note that your Cloud Environment on Terra is yours and yours alone. No one else can view or access your notebook (a billing project owner can delete it but not open it). The reason for this is security. We store your Google credentials on the Google VM, which cannot be shared with other users:
2. Start Docker container
After setting up the application compute, the Notebook Service will start a Docker container with all the core software that your notebook will run. Because it is a Docker, not a true VM, you will not be able to do some things (such as run a Docker within the Docker). Things you create inside your notebook's Docker will exist as long as you do not delete the detachable persistent disk (or the Cloud Environment, if you don't have a detachable persistent disk). Inside the Docker container, your user id is jupyter-user and your HOME directory (on the detachable persistent disk) is
/home/jupyter-user/notebooks/
3. Localize notebook
A Jupyter extension managed by the Notebook Service copies your notebook from the workspace bucket to the boot disk attached to your Cloud Environment (Notebook VM).
Note that the notebook file is copied to a workspace-specific directory in the HOME directory of the Jupyter user inside the Docker container. Opening notebooks from multiple workspaces within the same Cloud project will result in separate directories inside the HOME directory.
4. Open notebook in Jupyter kernel
The Notebook Server loads your notebook file from the boot disk and starts the Jupyter kernel.
Saving a notebook
If you open a notebook in "Edit" mode, Terra will autosave every five seconds. When you (or Terra) save a notebook, the current copy (in the Jupyter kernel) is first saved to the notebook file on your boot disk:
Then the file on disk is delocalized (copied) to your workspace bucket by a Jupyter extension:
Editing a notebook
When you edit a notebook, the changes are initially only reflected in the Jupyter kernel process. Changes are not saved to disk or copied to Cloud Storage unless you are in "Edit" mode and you explicitly save them or the Jupyter autosave process kicks in. For Jupyter on Terra, the autosave frequency is every 5 seconds.
While there are unsaved changes, Jupyter displays a notification:
When Terra autosaves changes, you will see this notification:
Running notebook code
When you run a code cell in a notebook, the cell execution creates output associated with that code. The output is only stored in the Jupyter kernel runtime until you save the notebook. If you have a detachable persistent disk, the output is saved there.
Analysis outputs done in a Jupyter notebook are not copied to Cloud Storage until you explicitly save them.
For step-by-step instructions on saving outputs to a Google bucket, see this article.
Installing software and dependencies
You will likely need to install libraries or tools on your notebook VM to extend the basic functionality of the kernel. Both "pip install" and "install.packages()" drop stuff in $HOME/notebooks/packages/ directory of the detachable persistent disk of the jupyter-user in the Docker container. Installing libraries may take a long time the first time you install, but because they are installed on the detachable PD, they will be available automatically when you rerun a notebook (i.e. restart - or even recreate - the runtime).
Availability of software within the same Terra Billing Project
All of your workspaces within the same Terra Billing Project share the same notebook VM and its available software. No matter whether you install software from a notebook or from a Jupyter terminal, the software is available to all notebooks on the notebook VM across all workspaces in the same Terra Billing Project.
Note that this means if you stop or delete a Cloud Environment in one workspace, it will also affect every other workspace you have under the same Billing Project.
Stopping a Cloud Environment
When you're done working and close the notebook, Terra tells Google Cloud to stop the Cloud Environment but save its state. The saved portions includes the state of the Jupyter Notebooks container, with any modifications you may have made by installing packages, for example, and any files present on its local storage partition. That way, you can resume working at any time with minimal effort: when you reopen the notebook, Terra restarts the VM and restores the notebook runtime to its saved state.
After a period of inactivity, Terra will automatically save the notebook and pause (stop) your notebook runtime, to save you from incurring additional costs. Inactivity includes when your computer goes to sleep.
If your kernel is active, however, Terra will not pause the runtime (to prevent long-running jobs from aborting). Note that autopause will resume after 24 hours, even if your kernel is active.
You can explicitly pause (stop) your notebook by selecting the "Stop Cluster" button in Terra:
What is kept when you pause (stop) a notebook runtime (boot disk, software)
When a notebook runtime is stopped, its Compute Engine VM goes away, but the boot disk does not. When you re-open your notebook, the notebook VM is more quickly created as the disk does not need to be recreated. You do not need to reinstall your software.
What is lost when you stop a cluster (notebook state)
Any running Jupyter kernel processes are gone, so the notebook state is lost. This includes calculated and other variables, including environment variables. To restore notebook state you will need to open the notebook and re-run the relevant cells of the notebook.
Deleting a notebook Cloud Environment (also the boot disk)
If you do not need your notebook VM or cluster and want to save on the cost of the boot disk, or if you want to pick up a new feature that requires you to rebuild your notebook VM, you can delete the cluster by clicking on the trash icon at the top right of the screen:
What is deleted along with the runtime (boot disk, installed software, any output not explicitly saved)
- Boot disk
- Installed software
- Any input not explicitly saved to the workspace bucket or other external file
When you recreate your runtime, you will need to reinstall any additional libraries or tools that you had installed previously.
What is kept when deleting the runtime (saved notebook files and data)
You can choose to keep or delete your detachable persistent disk. Your notebooks and any data explicitly saved to your bucket are still in long term storage in the workspace bucket as described in the Save notebook section.
Opening notebook (read-only)
Often you only want to read a notebook, rather than edit or run it. Opening a notebook read-only does not require creation of a VM, so it is much faster than opening a notebook for editing.
In the Notebooks tab, click on the three vertical dots icon for your notebook, and select "Open read-only":
A server-side process will render the notebook file from Cloud Storage and display in your browser:
Collaboration in a shared Terra workspace
Your notebook VM is specific to you. Each individual user will have a separate notebook cluster. As a result, any work they do in the notebook will not affect the state of your own cloud environment.
However the workspace bucket and the notebooks in the bucket are shared. The system will automatically save any changes collaborators make to the shared document in the workspace, so it's important to set expectations clearly with your collaborators about whether it's okay for them to modify the notebook or whether they should work in a separate copy.
These two conditions also mean that you need to share your workspace in order for a collaborator to be able to see your notebook.
Terra will "lock" the notebook document in the workspace whenever someone is actively working with it, to avoid having multiple people making conflicting changes at the same time. When this happens, your collaborator can open the notebook in the read-only preview mode, or they can open it in a special "playground" mode that allows them to make changes and run code in their own cloud environment, but does not save any changes to the original notebook file. This falls a bit short of the ideal collaborative experience that you could envision based on Google Docs, for example, but it provides a reasonable compromise given the constraints at play. To learn more about "Edit" versus "Playground" modes in Terra, see this article.
Opening notebooks (playground mode)
If you try to open a notebook (in your cluster) while a collaborator in the same workspace opens the same notebook (in their own cluster), Terra will only allow you to open in "Playground" mode. While in playground mode, you can run cells, but cannot save the modified notebook. Any output generated while in playground mode is also not saved.
Using Notebooks when you have Multiple Billing Projects
Cloud resources, such as Cloud Storage Buckets and Compute Engine instances exist within Google Cloud Projects. Within Terra, these are referred to as "Billing Projects".
Notebooks belong to workspaces and workspaces belong to Billing Projects
Thus if you have workspaces in two different billing projects, and you work in notebooks in those two different billing projects, you will have separate Compute Engine resources (see below):
Comments
2 comments
Thon de Boer If the kernel is active, the Autopause function gets overridden. So the expected behavior is that notebooks with running kernels will persist, even if the browser window is idle or closed. See https://support.terra.bio/hc/en-us/articles/360029761352 .
What happens if you navigate away from the notebook page with any long running kernels? I think it simply kills the kernel running the notebook, no? So, I need to keep the window open as long as the kernel is running?
Please sign in to leave a comment.