Notebooks 101: How not to lose data stored or generated in a notebook cloud environment
FollowIf you remember the days before Google docs, you know firsthand the pain of losing work you thought was safe: working for hours on a paper only to have it vanish if your computer shut down and you hadn't saved it. Notebooks are wonderful for interactive data analysis, but there are a few quirks that can can lead you to lose work in much the same way if you're not careful.
This article describes what and when you need to save so you don't lose parts of your analysis unintentionally when working in a Jupyter notebook in Terra.
Contents
-
What real-time updates can you make to Notebook compute resources without losing data?
-
How to not lose output files
-
What to do if you lose your notebook data
Note: For a deeper dive into the back end of a Terra notebook and to understand why notebooks have these characteristics, see this article about key notebook components or this article about key notebook operations.
What real-time updates can you make to Notebook compute resources without losing data?
- You can increase or decrease the # of CPUs or memory
During this update, the Notebook cloud environment will stop the runtime, update, and then restart. The update will take a couple of minutes to complete and you will not be able to continue editing or running the Notebook while it's completing. - You can increase the disk size or change the number of workers (when the number of workers is > 2)
During this update, you can continue to work in your Notebook without stopping your cloud environment runtime. When the update is finished, you will see a confirmation banner.
If you want to simultaneously change both the workers and CPU/memory, we advise doing this sequentially: first update the CPUs/memory, wait for the Notebook cloud environment to restart, and then adjust the workers.
Any other runtime changes (e.g. decreasing the disk size or changing the environment type) require deleting the existing runtime and creating a new one. When you create the new cloud environment runtime, any non-notebook files and installed packages will be lost. Please backup files as appropriate.
How to not lose output files
The key issue is that files generated by the notebook are not automatically saved in the workspace bucket. Because the disk associated with the notebook cloud environment is deleted when you delete or make some changes to a cluster, you will lose installed packages and output data generated in a notebook if you delete or reconfigure a cluster in some ways without explicitly saving your output to the workspace bucket.
You will not lose your data if you pause (stop) a cluster, since the cluster goes away but the cloud environment disk does not. In fact, when you re-open your notebook, the cluster creates more quickly as the disk does not need to be recreated. As an added bonus, you do not need to reinstall your software.
To avoid losing your data, make sure to explicitly save your outputs in the workspace bucket. You can find step by step instructions on how to do this within the notebook in this article.
What to do if you've lost your notebook data?
Your notebooks and any data explicitly saved to your bucket are still in long term storage in the workspace bucket. This means you can rerun the notebook to regenerate any output data (though you will pay for this, of course).
Comments
0 comments
Please sign in to leave a comment.