Terra sets default values for storage and compute regions that satisfy most use-cases for data stored in North America. But there are cases when specifying the region where your data are stored and analyzed can decrease costs. This article outlines why you might want to customize the region where your data are stored and analyzed, and how to do it.
Terra no longer recommends using US multi-region bucketsFollowing Google Storage pricing changes that increase the cost of storage in multi-region buckets as well as data transfer charges every time you run any analysis on data stored in a multi-region bucket. For more details, see this blog.
Google Cloud regionality overview
Data files "in the cloud" actually exist on physical storage devices and computers here on earth. Google Cloud has data centers that house these machines all around the globe, and allow users to request that their data be stored in particular "regions". You can read all about Google's datacenters and regions in the Google documentation.
Likewise, the virtual machines (VMs) you use to process that data (for example, in a workflow or notebook analysis) exist on physical machines in Google's regional datacenters.
Google Cloud has a set of regions; a region consists of a set of zones
For example: the us-central1
us-central1 region has four zones: us-central1-a
, us-central1-b
, us-central1-c
, and us-central1-f
.
Cloud Storage buckets are regional, but can be multiregional
- Terra recommends using single-region buckets.
- Terra's default bucket location is
us-central1
.
GCE VMs are zonal
- For most compute needs, the zones within a region are equivalent.
- For some compute resources, such as GPUs or preemptible VMs, availability and capacity differ.
Why customize your data storage/compute region?
Terra assigns default storage and compute regions that simplify workspace setup for researchers. However, there are cases when you may want to override the defaults.
-
You work with data that is not in us-central1 and want to reduce costs
To move data - called data transfer - requires infrastructure and comes with a cost. For example, you incur data transfer charges if you analyze data stored in one region using VM compute resources in a different region. -
You have limited oversight of workflow and compute choices of labs in your organization and can't enforce standardization in a single region, such as us-central1.
-
You have existing US multiregional buckets and want to copy data between these buckets and new buckets. Remember - a copy between regional and multiregional buckets incurs data transfer charges.
Will customizing the VM region save you money?Consider where all the data for your study are stored (including the workspace bucket, external storage buckets, and consortia data from data repositories like Gen3) and especially if they are stored in one region (different than us-central1). It costs more to store data for quick, easy access in multiple regions. On the other hand, savings of regional storage can disappear if you need to analyze in a different - or many different - region(s).
For more detailed cost information, see Regional or Multi-regional US buckets: tradeoffs.
Where data are stored and analyzed in Terra
Terra consists of Terra-managed infrastructure (including the Terra control pane and workspace metadata) and user-managed cloud components for storing and analyzing data.
Terra-managed location |
User-managed location |
|
|
Terra storage and analysis region defaults
-
Workspace buckets:
us-central1
-
Workflow VMs: default to the workspace bucket region or
us-central1
-
Cloud Environment VMs: default to the workspace bucket region, or
us-central1
To learn more, see Terra architecture and where your files live in it.
Background on Terra defaults
Terra assigns regions to the workspace storage and compute by default. The defaults simplify decision making for researchers; you can focus on storage and compute pricing as you're unlikely to encounter cross-region data transfer (egress) charges. Centralizing all storage and compute in a single region can reduce your storage costs and avoid data transfer - reducing your total cloud bill.
Data storage and data transfer costs
Note: Data transfer from the US to another continent is much more expensive! See this pricing guide for the full list of up-to-date costs of Google Cloud data transfer.
Be aware of Storage + Compute + Network Data Transfer pricing! Take care to avoid data transfer charges that result from using VMs in a region other than the workspace bucket region.
The default behavior is that workflow VMs will be located in the same region as the workspace bucket (or us-central1
for US multi-regional buckets).
If you're using regional storage in the US, this will be us-central1
. When running workflows, be careful those workflows don't explicitly set available zones to include regions other than your workspace bucket.
How to customize your workspace bucket storage region
If you are working with data in an external Google Bucket outside of us-central1
, you may want to set the location of your workspace bucket(s) to the same region the data are stored in. This reduces storage costs by storing and analyzing in a single region.
If you change the default location of your workspace bucket(s), be aware that you may incur data transfer charges when copying data from one region to another (examples below).
- Copying data from a multiregional bucket to a regional bucket.
- Copying data from a regional bucket to a regional bucket when the regions are different.
- Copying data from a regional bucket to a VM in a different region.
The benefits of single region versus multiregionUsing a single US region will mean saving money month after month on all workspace bucket storage. Storage costs are often the single largest cost for life sciences projects on Cloud - especially if you are paying to store your own primary data.
For many organizations and individual labs, the best long-term choice is to select a single region, such as Google's oldest (and presumably largest), us-central1, and use it for storage as well as all compute VMs. Terra's default region for Workflow and Cloud Environment VMs has historically been us-central1.
Customizing your workspace bucket - step-by-step instructions
You will have the opportunity to select either us-central1 (Iowa)
(default) or northamerica-northeast1(Montreal)
from the dropdown when you create or clone a workspace.
Selecting other workspace bucket regionsThe bucket location drop-down menu lists a limited number of regions. If you want to select a different region, you can create your workspace through the createWorkspace API endpoint instead. Set the bucketLocation
field to your preferred Google bucket location.
Note that you will have to authorizing Swagger (using the same login credentials that you use to log into Terra) before executing the API call.
Once the workspace is created, you can use and manage it through the Terra website just like any other workspace.
How to customize your workflow compute region
Every WDL workflow allows you to specify a list of Compute Engine zones as a default for the VMs in that workflow. In addition, you can specify a list of Compute Engine zones for each individual task's VM, which will override the default for the workflow.
Default zones for workflow VMs
All zones in the workspace bucket region -- or -- us-central1
for US multiregional workspace buckets.
Implications of changing the region of your workflow VM from the default All analyses in Terra are run on VMs. If you change the location from the default, you may incur data transfer charges if your bucket location and workflow VM location are different.
Any regional bucket mismatched with the VM region will result in data transfer charges for localizing data when running the workflow, including multiregional Buckets.
Customizing your workflow VM location - step-by-step instructions
To specify zones on a per-task basis
- Provide a hard-coded list in the workflow WDL
- Allow a user-input value to be used as the list of zones
For example, here is a hard-coded list of zones
runtime {
docker: "python:slim"
disks: "local-disk 200 HDD"
memory: "4G"
cpu: 1
zones: "us-central1-a us-central1-b us-central1-c us-central1-f"
}
and here is a workflow input option, calledruntime_zones
workflow MyWorkflow {
String runtime_zones
...
runtime {
docker: "python:slim"
disks: "local-disk 200 HDD"
memory: "4G"
cpu: 1
zones: runtime_zones
}
which you can set on the workflow submission page:
How to customize your Cloud Environment region
Historically, Cloud Environment VMs - used for Jupyter Notebooks, RStudio, and Galaxy - have been created in the us-central1
region. Now you can choose the region for your Cloud Environment VM right in Terra.
For step-by-step instructions, see How to Customize your Cloud Environment in Understanding and Adjusting your Cloud Environment.
Cloud Environment caveats Note: This functionality is currently EXPERIMENTAL, and the option to change is only available to users with non-US workspace buckets [which is only available through the API].
There is a data transfer cost risk for the rendering of the user interface, which flows through the Leo proxy in us-central1.
This functionality is currently supported for standard VMs and Spark single nodes. Spark cluster support will be added in a future release.
What is the default behavior?
Your Cloud Environment will default to the workspace bucket region.
Implications of changing cloud compute engine region from the defaultAll analyses in Terra are run on VMs. If you change the location from the value proposed by the UI, you may incur data transfer out charges if your bucket location and interactive analysis Cloud Environment location are different.
Any regional or multiregional bucket mismatched with the Cloud Environment region will result in data transfer charges for copying data from the bucket to the Cloud Environment.