This article walks through what happens when you perform key operations with your notebooks. Understanding what is happening behind the curtain can help your process to be more efficient and avoid pitfalls (such as overwriting collaborator's work in a notebook!).
Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies.
- Creating a notebook
- Opening a notebook
- Saving a notebook
- Editing notebooks
- Running notebooks
- Installing software
- Stopping a cluster
- Deleting a cluster
- Opening a notebook (read-only)
- Collaboration in a shared workspace
- How to avoid overwriting collaborator's work in a shared workspace
For more details about key components of a Jupyter notebook, see Part I here.
Creating a notebook
When you create a notebook, the Notebook Service creates a (.jpynb) file and saves it in your workspace bucket:
Opening a notebook
When you open a notebook, the Notebook Service executes several steps in order:
- Create a runtime environment (aka cluster, or notebook VM) if one doesn't already exist for you in your Cloud project
- Start a Docker container
- Localize the notebook to your VM
- Open the notebook in Jupyter Kernel
1. Create a runtime environment
If a runtime environment (the notebook VM) does not already exist for you in your Cloud Project, the Notebook Service will first create one, along with an associated boot disk that persists until you delete the environment.
Note on billing for notebooks: Billing for your runtime will begin now, until you stop or delete it. Every time you open a notebook a new Jupyter Kernel is created, so if you have multiple notebooks open and running they will all consume resources (memory an CPU) on the same machine.
Note that a runtime environment is yours and yours alone. No one else can view or access
your notebook (a project manager can delete it but not open it). The reason for this is security: we store your Google credentials on the Google VM, which cannot be shared with other users:
2. Start Docker container
After setting up the cluster, the Notebook service will start a Docker container with all of the core software that your notebook software will run. Because it is a Docker, not a true VM, you will not be able to do some things (such as run a Docker within the Docker).Things you create inside your notebook's Docker will exist as long as you do not delete the runtime environment.
Inside the Docker container, your user id is jupyter-user and your HOME directory (on the persistent disk) is /home/jupyter-user.
3. Localize notebook
A Jupyter extension managed by the Notebook Service copies your notebook from the workspace bucket to the boot disk attached to your cluster.
Note that the notebook file is copied to a workspace-specific directory in the HOME directory of the jupyter-user inside the Docker container. Opening notebooks from multiple workspaces within the same Cloud project will result in separate directories inside the HOME directory.
4. Open notebook in Jupyter Kernel
The Notebook Server loads your notebook file from the boot disk and starts the Jupyter Kernel.
Saving a notebook
When you save a notebook, the current copy (in the Jupyter Kernel) is first saved to the notebook file on your boot disk:
Then the file on disk is de-localized (copied) to your workspace bucket by a Jupyter extension.
Editing a notebook
When you edit a notebook, the changes are initially only reflected in the Jupyter Kernel process. Changes are not saved to disk or copied to Cloud Storage until you explicitly save them or until the Jupyter autosave process kicks in. For Jupyter on Terra, the autosave frequency is every 5 seconds.
While there are unsaved changes, Jupyter displays a notification:
When Terra autosaves changes, you will see this notification:
When you run a code cell in a notebook, the cell execution creates output associated with that code. The output is only reflected in the Jupyter Kernel runtime until you save the notebook.
Analysis outputs done in a Jupyter notebook are not saved to disk or copied to Cloud Storage until you explicitly save them or until the Jupyter autosave process kicks in.
For Jupyter on Terra, the autosave frequency is every 5 seconds.
Installing software and dependencies
You will likely need to install libraries or tools on your notebook VM to extend the basic functionality of the kernel. When you do this, the software will be installed on the boot disk in the HOME directory of the jupyter-user in the Docker container. Installing libraries may take a long time the first time you install, but because they are installed on the boot disk, they will be available automatically when you rerun a notebook (restart the runtime).
Availability of software within the same Cloud Project
All workspaces within the same Cloud Project share the same notebook VM and its available software. No matter whether you install software from a notebook or from a Jupyter terminal, the software is available to all notebooks on the notebook VM across all workspaces in the same Cloud Project.
After a period of inactivity, Terra will automatically pause (stop) your notebook VM. Inactivity includes when your computer goes to sleep.
If your kernel is active, however, Terra will not pause the runtime (to prevent long-running jobs from aborting).
You can explicitly pause (stop) your notebook by selecting the "Stop Cluster" button in Terra:
What is not lost when you pause (stop) a cluster (boot disk, software):
When a notebook VM is stopped, its Compute Engine VM goes away, but the boot disk does not. When you re-open your notebook, the notebook VM is more quickly created as the disk does not need to be recreated. You do not need to reinstall your software.
What is lost when you stop a cluster (notebook state):
Any running Jupyter kernel processes are gone so notebook state is lost. To restore notebook state you will need to open the notebook and re-run the relevant cells of the notebook.
Deleting the cluster (also the boot disk)
If you do not need your cluster (your notebook VM) and want to save on the cost of the boot disk, or if you want to pick up a new feature that requires you to rebuild your notebook VM, you can delete the cluster by clicking on the trash icon at the top right of the screen:
What is deleted along with the cluster (boot disk, installed software, any output not explicitly saved)
- Boot disk
- Installed software
- Any input not explicitly saved to the workspace bucket or other external file
When you recreate your cluster, you will need to reinstall any additional libraries or tools that you had installed previously.
What is not deleted along with the cluster (saved notebook files and data):
Your notebooks and any data explicitly saved to your bucket are still in long term storage in the workspace bucket as described in the Save notebook section.
Opening notebook (read-only)
Often you only want to read a notebook, rather than edit or run it. Opening a notebook read-only does not require creation of a VM, so it is much faster than opening a notebook for editing.
In the Notebooks tab, click on the snowman for your notebook, and select "Open read-only":
A server-side process will render the notebook file from Cloud Storage and display in your browser:
Collaboration in a shared Terra workspace
Your notebook VM is specific to you. Each individual user will have a separate notebook cluster.
However the workspace bucket and the notebooks in the bucket are shared.
These two conditions mean that you need to share your workspace in order for a collaborator to be able to see your notebook. But be careful! When working on the same notebook in a shared workspace at the same time as a colleague, there is a risk of overwriting each other’s work.
How to avoid overwriting collaborator's changes
If you open a notebook for editing (in your cluster) while a collaborator in the same workspace opens the same notebook (in their own cluster), changes you make can overwrite the changes your collaborator makes and vice-versa. This is true even if you just run cells in the notebook, since running a cell updates the output associated with that cell.
It is important to coordinate with your collaborators, to avoid overwriting each other's changes. A common pattern for working in this shared environment is to work on separate notebooks and merge changes manually.