Understanding and adjusting your Cloud Environment

Allie Hajian
  • Updated

Interactive applications - such as Jupyter Notebooks and Galaxy - run on virtual machines or clusters of machines in your Cloud Environment. When running an interactive application in Terra, you can adjust the configuration of your Cloud Environment to fit your computational needs.  This article gives an overview of the components that make up your Cloud Environment and step-by-step instructions for how to customize them. 

Cloud Environment components overview

The workspace Cloud Environment is the sum of all the components of the virtual machine or cluster of machines that run your interactive analysis application. Cloud Environments consist of

    1) an application configuration (i.e. packages and software)
    2) cloud compute
    3) a persistent disk (PD)

To see your Cloud Environment configuration, click the gear icon at the top right of any workspace page to reveal the form below. To customize your VM (including clusters!), select the "Customize" button at the bottom right of the form. 

Customizing-the-cloud-environment_Default-settings_Screen_shot.png

Note that your Cloud Environment is unique to you and there is a separate Cloud Environment for each workspace. Colleagues, even in a shared workspace, will not be able to access anything stored in your Cloud Environment PD. In addition, their Cloud Environment settings can be different than yours. To ensure a consistent analysis environment across all team Cloud Environments, we strongly recommend using one of the default Cloud Environments, or using a startup script or custom Docker

Application configuration 

The application configuration includes software and dependencies that are pre-installed in the Cloud Environment container. Terra includes several pre-configured environments for biomedical use-cases like Bioconductor or Hail analyses.

Pre-configured options

Terra has five varieties of pre-configured application configurations - plus a custom option - available in the drop down menu. What versions and what libraries are included in each pre-configured option is also listed in the dropdown.

If pre-configured application configurations don't meet your needs, you can customize with a Docker image or startup script.

G0_tip-icon.png


Pre-installed software and dependencies - Reproducible and efficient

  If your analysis requires software packages that are not part of the default or pre-configured configurations, you could start your interactive application by installing the ones you need on the Persistent Disk. However, this approach can turn into a maintenance headache if you have multiple notebooks that require the same configuration commands.

Here are three reasons to move some of those software installation steps into the application configuration proper.

Efficiency
You don't have to spend time (and $$) running setup code to install what's needed for your analysis every time you spin up your application.

Simplified setup
Customizing the application configuration simplifies and standardizes the setup needed to run an application. 

Reproducibility
Custom Docker containers and startup scripts put you in control of exactly what version of programming languages and packages to include. Recording the image URL within the notebook you wish to share is a quick-and-easy way to spin up and share an identical environment, ensuring that colleagues get the same results.

Compute power

The compute power is the CPU and RAM available to your application, which determines how much processing can be done at a time. Customizing the compute power allows you to balance cost and functionality. For example, if your analysis is running slow, it could mean the CPUs and memory allotted are insufficient for the computations you're doing. It may be worth the cost of increasing the compute power so your analysis will complete quicker in real time. 

To learn more about resource quotas that could impact the compute power available to you, see Google Cloud quotas: What are they and how do you request more?.

G0_warning-icon.png


Compute power comes with a cost! Find the balance that's right for you.

  Note that more compute power costs more, and you don't want to request (and pay
for) significantly more than your computation needs. Running a high-powered notebook
costs a certain amount per unit time no matter what computations are done. You don't
need (or want to pay for) a high-performance parallel Spark cluster if you're running a
simple, non-parallel computation.

Featured and template workspaces and notebooks will include recommended and
project-specific configurations, as well as estimated costs to run (where possible).
Since it is fairly straightforward to adjust the compute power, you can estimate an
initial power to try and then dial it up or down as needed. Just be sure to be careful to
save any generated data you want to keep when recreating the cloud environment (see below).

To learn more, see Understanding and controlling Cloud costs.

Setting a custom compute power - Step-by-step instructions

Continuing down the cloud environment configuration form, you'll see options for configuring the compute power, including defaults for moderate, increased and high performance. If the defaults are not adequate for your needs, you can select a custom compute, where you can specify the primary CPUs, memory and disk sizes you need. You can also spin up a Spark cluster of parallel machines, and specify the number of secondary machines and their CPUs, memory and disk sizes. To configure a custom compute power, follow the steps below.

1.Select Custom profile from the drop down menu.

2. In the new form that appears, choose the specification of your primary machine. See the example below. 

CPUs 8 Memory (GB) 30 Disk Size (GB) 100

If you only want one virtual machine, you're done!

3. To configure as a Spark cluster (for parallel processing), first check off "Configure as a Spark cluster". 

4. Fill in the values for the secondary processors.

Workers 120 Preemptibles 100
CPUs 4 Memory (GB) 15 Disk size 500

The cost of the requested compute power will show at the bottom of the form. For example, when requesting a Spark cluster, your screen will look like this screenshot.

Biodata_Catalyst_compute_power_Screen_Shot.png

G0_tip-icon.png


Cost saving recommendations 

  Size your compute power appropriately
You pay a fixed amount while a notebook is running, whether or not you are doing active calculations. The cost is based on the compute power of your virtual machine or cluster, not how much computation is being done. So you want to have enough power to do your computations in a reasonable amount of time, but not a lot of extra that you will be paying for and not using.

Note that Terra automatically pauses a notebook after twenty minutes of inactivity

To learn more about controlling cloud costs in a notebook, see Controlling cloud costs - sample use cases.

Persistent Disk

Your Cloud Environment comes with Persistent Disk (PD) storage by default that lets you keep files stored in your Cloud Environment even after you delete the VM or cluster. The PD can be kept when you delete your Cloud Environment, and reattached to a newly created VM.

Using the persistent disk as storage lets you keep the packages your notebook code is built upon, input files necessary for your analysis, and outputs you’ve generated, without having to move anything to permanent cloud storage.

Data in PD is not available outside the user's Cloud Environment
Note that because the PD is not accessible from outside the Cloud Environment, data generated in a notebook cannot be used as input for a workflow analysis, and it is not accessible by other collaborators using a shared workspace.

To learn more about how to save data generated in a notebook to permanent cloud storage (including the workspace bucket), see How (and why) to save data generated in a notebook to a Workspace bucket

Compute Location

Your Cloud Environment runs in a GCP location. By default, Cloud Environments will run in the us-central1 region. If your workspace bucket is located outside of the US, you will be able to modify the location of your Cloud Environment.

Recommended best practice is to choose the same location for your workspace bucket and Cloud Environment in order to minimize cross-region egress costs. To learn more, see US regional versus Multi-regional US buckets: trade-offs.

Note that the location of a Cloud Environment cannot be changed once created. To have a new location you must create a new Cloud Environment. 

G0_tip-icon.png


Learn more about Terra's Jupyter notebook environment 

  - Key components
- Key operations
- Best Practices


How to customize your Cloud Environment 

If the default or project-specific environments don't fit your needs, you can use a custom Docker Image or include a startup script. Anyone using the same Docker image or startup script will have the exact same environment, which is critical for reproducibility.

To learn more about developing and using custom Docker images in Terra, see these articles on working with containers/Docker. Note that you can also use a custom environment to revert back to a previous version of the pre-configured environments. 

G0_warning-icon.png


Cloud Environment considerations 

  Changing the Cloud Environment can mean files generated or stored in the application memory will be lost when Terra recreates the Cloud Environment. To avoid this, make sure to include the Persistent Disk option (default) or copy your files to the Workspace bucket, and set the right environment and power before doing any work.

If you don't have a Persistent Disk, see the section below to understand what changes you **can make** without losing generated data.

If the work you're doing in your notebook includes mostly short-running commands that don't amount to much computation cost, this isn't a big problem: there is an option in the Jupyter Notebooks menu to re-run all code cells (or all up to a certain point) so you can simply regenerate the previous state. However, if some of your work involves massive computations that would not be trivial to re-run, you may want a better strategy.


To adjust the virtual environment and/or compute power of your application, first click on the gear icon in the widget at the top right of your workspace:

Configure_runtime_Screen_Shot.png

Customizing-the-cloud-environment_Default-settings_Screen_shot.png

This will reveal the form at left, with the current values of your Cloud Environment (see default values in screenshot below). To make changes, click the "Customize" button at the bottom right.

You can modify the cloud environment at any time, even if you've already started working in an application (i.e. notebook).

You'll see Terra's Cloud Environment configuration panel (screenshot below). Note that it is much simpler than the equivalent Google Cloud Platform interface!

You'll specify what you want in the configuration panel and let Terra recreate your cloud environment with the new specifications.

1. Application configuration
Includes preconfigured con configurations with popular packages. Custom option offers the ability to use a custom Docker.

2. Cloud compute
Dropdown options include standard VM, Spark master node and Spark cluster. This is also where you can specify a custom startup script. 

3. VM location 
Your Cloud Environment will default to the workspace bucket region. Note that if you change the location from the value proposed by the UI, you may incur egress charges if your bucket location and interactive analysis Cloud Environment location are different. 

4. Persistent Disk
Learn more about detachable persistent disks for notebook applications in Terra here.

See Customizing where your data are stored and analyzed

Cloud-Environment_Configuration-form_Screen_shot.png

Don't forget to save the configuration, after changing any values. This will recreate the application compute with the new values, which can take up to five or ten minutes. 

You can further customize using a Docker image or startup script to specify exactly the environment you need. See detailed instructions below.

It's not necessary to guess up front the resources you're going to need to do your work. You can start with minimal settings, then dial them up if you run into limitations.

Setting a custom environment with a Custom Docker Image

1. Select "Custom" from the Environment drop down menu.

2. Input the container image, using the format <image name>:<tag>.

Note that custom environments must be based off one of the Terra Jupyter Notebook base images or a Project-Specific image.

Setting a custom environment with a startup script

1. Scroll down to the Compute Power box, which allows you to modify the VM resource allocations.

2. Choose the Custom option from the drop down menu.

3. Input the path to the startup script in the field where it says URI (for Uniform Resource Identifier, a close cousin of URL, the Uniform Resource Locator).

For more detail, check out this tutorial about creating your own startup script, uploading it to a Google bucket, and using it to launch a custom cloud environment.

What real-time updates can you make to Cloud Environment compute resources without losing data?

1. Increase or decrease the # of CPUs or VM memory
During this update, the Notebook will pause the cloud environment, update, and then restart. The update will take a couple of minutes to complete and you will not be able to continue editing or running the Notebook while it's completing.

2. Increase the disk size or change the number of workers (when the number of workers is > 2)
During this update, you can continue to work in your Notebook without pausing your cloud environment. When the update is finished, you will see a confirmation banner. 

Note that if you want to simultaneously change both the workers and CPU/memory, we advise doing this sequentially. 

1. First update the CPUs/memory.
2. Wait for the Notebook Cloud Environment to restart.
3. Adjust the workers.

G0_warning-icon.png


Cloud Environment changes that can cause you to lose work

  Decreasing the Persistent Disk size

Deleting the Persistent Disk (when recreating or deleting the Cloud Environment)

Note that this is true no matter what kind of interactive analysis you are running, including RStudio, Jupyter Notebooks, or Galaxy. Please back up files as appropriate. 

How to save interactive analysis outputs to the Workspace bucket

To avoid losing your data, make sure to explicitly save your outputs in the workspace bucket. You can find step by step instructions - and exact code to do this within the notebook - below.

Python kernel instructions 

1. Set the environment variables.

import os
BILLING_PROJECT_ID = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE = os.environ['WORKSPACE_NAME']
bucket = os.environ['WORKSPACE_BUCKET']

2. Copy all files in the notebook into the workspace bucket.

!gsutil cp ./* $bucket
# Run list command to verify file is in the bucket
!gsutil ls $bucket

Note: the Workspace bucket is a Google bucket, so basic bash commands in the notebooks need to be preceded by "gsutil." These commands will only work if you have run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.

If you want to copy individual files, you can replace `*` with the file name to copy.

R kernel instructions

1.Set the environment variables.

project <- Sys.getenv('WORKSPACE_NAMESPACE')
workspace <- Sys.getenv('WORKSPACE_NAME')
bucket <- Sys.getenv('WORKSPACE_BUCKET')

2.Copy all files in the notebook into the workspace bucket.

#Copy all files generated in the notebook into the bucket
system(paste0("gsutil cp ./* ",bucket),intern=TRUE)
#Run list command to see if file is in the bucket
system(paste0("gsutil ls ",bucket),intern=TRUE)

Note: the Workspace bucket is a Google bucket, so bash commands in a notebook need to be preceded by "gsutil." These commands will only work if you have run the commands above to set the environment variables. Once you execute these cells, the data files should be visible in the workspace bucket.

If you want to copy individual files, you can replace `*` with the file name to copy.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.