Interactive applications - such as Jupyter Notebooks and Galaxy - run on virtual machines or clusters of machines in your Cloud Environment. When running an interactive application in Terra, you can adjust the configuration of your Cloud Environment to fit your computational needs. This article gives an overview of the components that make up your Cloud Environment and step-by-step instructions for how to customize them.
Cloud Environment components overview
The workspace Cloud Environment is the sum of all the components of the virtual machine or cluster of machines that run your interactive analysis application. Cloud Environments consist of
1) an application configuration (i.e. packages and software)
2) cloud compute
3) a persistent disk (PD)
To see your Cloud Environment configuration, click the gear icon at the top right of any workspace page to reveal the form below. To customize your VM (including clusters!), select the "Customize" button at the bottom right of the form.
Note that your Cloud Environment is unique to you and there is a separate Cloud Environment for each workspace. Colleagues, even in a shared workspace, will not be able to access anything stored in your Cloud Environment PD. In addition, their Cloud Environment settings can be different than yours. To ensure a consistent analysis environment across all team Cloud Environments, we strongly recommend using one of the default Cloud Environments, or using a startup script or custom Docker.
The application configuration includes software and dependencies that are pre-installed in the Cloud Environment container. Terra includes several pre-configured environments for biomedical use-cases like Bioconductor or Hail analyses.
Terra has five varieties of pre-configured application configurations - plus a custom option - available in the drop down menu. What versions and what libraries are included in each pre-configured option is also listed in the dropdown.
- Terra-maintained Jupyter environments
- Community-maintained Jupyter environments (verified partners)
- Community-maintained RStudio environments (verified partners)
- Custom environments
- Project-specific environments
|If your analysis requires software packages that are not part of the default or pre-configured configurations, you could start your interactive application by installing the ones you need on the Persistent Disk. However, this approach can turn into a maintenance headache if you have multiple notebooks that require the same configuration commands.
Here are three reasons to move some of those software installation steps into the application configuration proper.
The compute power is the CPU and RAM available to your application, which determines how much processing can be done at a time. Customizing the compute power allows you to balance cost and functionality. For example, if your analysis is running slow, it could mean the CPUs and memory allotted are insufficient for the computations you're doing. It may be worth the cost of increasing the compute power so your analysis will complete quicker in real time.
To learn more about resource quotas that could impact the compute power available to you, see Google Cloud quotas: What are they and how do you request more?.
|Note that more compute power costs more, and you don't want to request (and pay
for) significantly more than your computation needs. Running a high-powered notebook
costs a certain amount per unit time no matter what computations are done. You don't
need (or want to pay for) a high-performance parallel Spark cluster if you're running a
simple, non-parallel computation.
Featured and template workspaces and notebooks will include recommended and
To learn more, see Understanding and controlling Cloud costs.
Setting a custom compute power - Step-by-step instructions
Continuing down the cloud environment configuration form, you'll see options for configuring the compute power, including defaults for moderate, increased and high performance. If the defaults are not adequate for your needs, you can select a custom compute, where you can specify the primary CPUs, memory and disk sizes you need. You can also spin up a Spark cluster of parallel machines, and specify the number of secondary machines and their CPUs, memory and disk sizes. To configure a custom compute power, follow the steps below.
1.Select Custom profile from the drop down menu.
2. In the new form that appears, choose the specification of your primary machine. See the example below.
|CPUs||8||Memory (GB)||30||Disk Size (GB)||100|
If you only want one virtual machine, you're done!
3. To configure as a Spark cluster (for parallel processing), first check off "Configure as a Spark cluster".
4. Fill in the values for the secondary processors.
|CPUs||4||Memory (GB)||15||Disk size||500|
The cost of the requested compute power will show at the bottom of the form. For example, when requesting a Spark cluster, your screen will look like this screenshot.
|Size your compute power appropriately
You pay a fixed amount while a notebook is running, whether or not you are doing active calculations. The cost is based on the compute power of your virtual machine or cluster, not how much computation is being done. So you want to have enough power to do your computations in a reasonable amount of time, but not a lot of extra that you will be paying for and not using.
Note that Terra automatically pauses a notebook after twenty minutes of inactivity.
To learn more about controlling cloud costs in a notebook, see Controlling cloud costs - sample use cases.
Your Cloud Environment comes with Persistent Disk (PD) storage by default that lets you keep files stored in your Cloud Environment even after you delete the VM or cluster. The PD can be kept when you delete your Cloud Environment, and reattached to a newly created VM.
Using the persistent disk as storage lets you keep the packages your notebook code is built upon, input files necessary for your analysis, and outputs you’ve generated, without having to move anything to permanent cloud storage.
Data in PD is not available outside the user's Cloud Environment
Note that because the PD is not accessible from outside the Cloud Environment, data generated in a notebook cannot be used as input for a workflow analysis, and it is not accessible by other collaborators using a shared workspace.
To learn more about how to save data generated in a notebook to permanent cloud storage (including the workspace bucket), see How (and why) to save data generated in a notebook to a Workspace bucket.
Your Cloud Environment runs in a GCP location. By default, Cloud Environments will run in the
us-central1 region. If your workspace bucket is located outside of the US, you will be able to modify the location of your Cloud Environment.
Recommended best practice is to choose the same location for your workspace bucket and Cloud Environment in order to minimize cross-region egress costs. To learn more, see US regional versus Multi-regional US buckets: trade-offs.
Note that the location of a Cloud Environment cannot be changed once created. To have a new location you must create a new Cloud Environment.
|- Key components
- Key operations
- Best Practices
How to customize your Cloud Environment
If the default or project-specific environments don't fit your needs, you can use a custom Docker Image or include a startup script. Anyone using the same Docker image or startup script will have the exact same environment, which is critical for reproducibility.
To learn more about developing and using custom Docker images in Terra, see these articles on working with containers/Docker. Note that you can also use a custom environment to revert back to a previous version of the pre-configured environments.
|Changing the Cloud Environment can mean files generated or stored in the application memory will be lost when Terra recreates the Cloud Environment. To avoid this, make sure to include the Persistent Disk option (default) or copy your files to the Workspace bucket, and set the right environment and power before doing any work.
If you don't have a Persistent Disk, see the section below to understand what changes you **can make** without losing generated data.
If the work you're doing in your notebook includes mostly short-running commands that don't amount to much computation cost, this isn't a big problem: there is an option in the Jupyter Notebooks menu to re-run all code cells (or all up to a certain point) so you can simply regenerate the previous state. However, if some of your work involves massive computations that would not be trivial to re-run, you may want a better strategy.
To adjust the virtual environment and/or compute power of your application, first click on the gear icon in the widget at the top right of your workspace:
This will reveal the form at left, with the current values of your Cloud Environment (see default values in screenshot below). To make changes, click the "Customize" button at the bottom right.
You can modify the cloud environment at any time, even if you've already started working in an application (i.e. notebook).
You'll see Terra's Cloud Environment configuration panel (screenshot below). Note that it is much simpler than the equivalent Google Cloud Platform interface!
You'll specify what you want in the configuration panel and let Terra recreate your cloud environment with the new specifications.
1. Application configuration
2. Cloud compute
3. VM location
4. Persistent Disk
Don't forget to save the configuration, after changing any values. This will recreate the application compute with the new values, which can take up to five or ten minutes.
You can further customize using a Docker image or startup script to specify exactly the environment you need. See detailed instructions below.
It's not necessary to guess up front the resources you're going to need to do your work. You can start with minimal settings, then dial them up if you run into limitations.
Setting a custom environment with a Custom Docker Image
Setting a custom environment with a startup script
1. Scroll down to the Compute Power box, which allows you to modify the VM resource allocations.
2. Choose the Custom option from the drop down menu.
For more detail, check out this tutorial about creating your own startup script, uploading it to a Google bucket, and using it to launch a custom cloud environment.
What real-time updates can you make to Cloud Environment compute resources without losing data?
1. Increase or decrease the # of CPUs or VM memory
During this update, the Notebook will pause the cloud environment, update, and then restart. The update will take a couple of minutes to complete and you will not be able to continue editing or running the Notebook while it's completing.
2. Increase the disk size or change the number of workers (when the number of workers is > 2)
During this update, you can continue to work in your Notebook without pausing your cloud environment. When the update is finished, you will see a confirmation banner.
Note that if you want to simultaneously change both the workers and CPU/memory, we advise doing this sequentially.
1. First update the CPUs/memory.
2. Wait for the Notebook Cloud Environment to restart.
3. Adjust the workers.
|Decreasing the Persistent Disk size
Deleting the Persistent Disk (when recreating or deleting the Cloud Environment)
Note that this is true no matter what kind of interactive analysis you are running, including RStudio, Jupyter Notebooks, or Galaxy. Please back up files as appropriate.
How to save interactive analysis outputs to the Workspace bucket
To avoid losing your data, make sure to explicitly save your outputs in the workspace bucket. You can find step by step instructions - and exact code to do this within the notebook - below.
Python kernel instructions
1. Set the environment variables.
import os BILLING_PROJECT_ID = os.environ['WORKSPACE_NAMESPACE'] WORKSPACE = os.environ['WORKSPACE_NAME'] bucket = os.environ['WORKSPACE_BUCKET']
2. Copy all files in the notebook into the workspace bucket.
!gsutil cp ./* $bucket
# Run list command to verify file is in the bucket
!gsutil ls $bucket
Note: the Workspace bucket is a Google bucket, so basic bash commands in the notebooks need to be preceded by "gsutil." These commands will only work if you have run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.
If you want to copy individual files, you can replace `*` with the file name to copy.
R kernel instructions
1.Set the environment variables.
project <- Sys.getenv('WORKSPACE_NAMESPACE') workspace <- Sys.getenv('WORKSPACE_NAME') bucket <- Sys.getenv('WORKSPACE_BUCKET')
2.Copy all files in the notebook into the workspace bucket.
#Copy all files generated in the notebook into the bucket system(paste0("gsutil cp ./* ",bucket),intern=TRUE) #Run list command to see if file is in the bucket system(paste0("gsutil ls ",bucket),intern=TRUE)
Note: the Workspace bucket is a Google bucket, so bash commands in a notebook need to be preceded by "gsutil." These commands will only work if you have run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.
If you want to copy individual files, you can replace `*` with the file name to copy.