Customizing where your data are stored and analyzed

Allie Hajian

Terra sets default values for storage and compute regions that satisfy most use-cases for data stored in North America. But there are cases when being able to specify the region where your data are stored and analyzed can decrease costs. This article outlines why you might want to customize the region where your data are stored and analyzed, and how to do it.  

For more information, see US regional versus multi-regional US buckets: trade-offs

GCP regionality overview

Data "in the cloud" actually exists on physical storage devices and computers here on earth. Google Cloud has data centers that house these machines all around the globe, and allow users to request that their data be stored in particular "regions". You can read all about Google's datacenters and regions in the Google documentation (click here).

Likewise, the virtual machines (VMs) you use to process that data (for example, in a workflow or notebook analysis) exist on physical machines in Google's regional datacenters.

GCP has a set of regions and a region consists of a set of zones
For example: the us-central1 us-central1 region has four zones: us-central1-a , us-central1-b , us-central1-c, and us-central1-f.

Cloud Storage buckets are regional, but can be multi-regional

  • Terra's default bucket location is US multi-region.
  • You may optionally select an individual region, us-central1.

GCE VMs are zonal

  • For most compute needs, the zones within a region are equivalent.
  • For some compute resources, such as GPUs or preemptible VMs, availability and capacity differ.

Why customize your data storage/compute region?

Terra assigns default storage and compute regions that simplify workspace setup for researchers. However, there are cases when you may want to override the defaults. See if one of the use-cases beelow meets your needs.

I work with data in the US and want to reduce costs

To move data - called egress - requires infrastructure and comes with a cost. You will incur egress charges, for example, if you are analyzing data stored in one region using VM compute resources in a different region. 

To determine if customizing the VM region will save you money, consider where all the data for your study are stored (this includes the workspace bucket, external storage buckets, and consortia data from data repositories like Gen3) and especially if they are stored in one region. It costs more to store data for quick, easy access in multiple regions. On the other hand, the savings of regional storage can disappear if you need to analyze in a different - or many different - region(s).  

For more information, see Regional or Multi-regional US buckets: tradeoffs

I want to keep life simple (take the defaults, willing to pay more for storage costs)

The default storage and compute regions will work for you if

  • You prefer consistent (though higher) storage costs to avoid the risk of unexpected egress fees
  • You need access to data stored in multiple regions

Overview: Where data are stored and analyzed in Terra

Terra consists of Terra-managed infrastructure (including the Terra control pane and workspace metadata) and user-managed cloud components for storing and analyzing data.

Terra-infrastructure-regions_Diagram.png

Terra-managed location User-managed location
  • Terra Control pane
  • Workspace metadata
  • Workspace bucket
  • Workflows VM
  • Cloud Environment VM

Terra storage and analysis region defaults

  • Workspace buckets: US multi-regional
  • Workflow VMs: default to the workspace bucket region or us-central1 (US multi-regional workspace buckets).
  • Cloud Environment VMs: default to the workspace bucket region, or us-central1 (US multi-regional workspace buckets).

To learn more, see Terra architecture and where your files live in it

Background on Terra defaults

Terra assigns regions to the workspace storage and compute by default. The defaults simplify decision-making for researchers; you can focus on storage and compute pricing as you are unlikely to encounter cross-region egress charges. However, the defaults can make your monthly cloud costs more expensive. If you have time and are able to centralize all of your storage and compute in a single region, you can reduce your storage costs and avoid egress - reducing your total cloud bill.

Example data storage and egress savings

Cost of data storage in the US

  • US multi-regional: $0.026 / GB / mo
  • US regional: $0.02 / GB / mo

Cost of network egress in the US   

  • US multi-regional: no cost (to VMs in any US region)
  • US regional: no cost (to VMs in the same US region)
                          $0.01 / GB (to VMs in a different US region)

Note that egress from the US to another continent is much more expensive than this example! See this pricing guide for the full list of up-to-date costs of GCP egress. 

G0_warning-icon.png


Be aware of Storage + Compute + Network Egress pricing!

 

You will need to be careful to avoid egress charges that result from using VMs in a different region than the workspace bucket region.

The default behavior is that workflow VMs will be located in the same region as the workspace bucket (or us-central1 for US multi-regional buckets).

If you're going to change to Regional storage in the US, this will be us-central1. Take care when running workflows that those workflows do not explicitly set the available zones to include regions other than that of your workspace bucket.


How to customize your workspace bucket storage region

You may want to set the location of your workspace bucket(s) to be us-central1 to reduce storage costs by storing and analyzing in a single region.

If you change the location of your workspace bucket(s), be aware that egress charges can be incurred copying data from one region to another (examples below).

  • Copying data from a multi-regional bucket to a regional bucket when that region is not part of the multi-region
  • Copying data from a regional bucket to a regional bucket when the regions are different
  • Copying data from a regional bucket to a VM in a different region

The benefits of single-region versus multi-region
Selecting a single US region will mean saving money month after month on all workspace bucket storage. Storage costs are often the single largest cost for life sciences projects on Cloud - especially if you are paying to store your own primary data.

For many organizations and individual labs, the best long-term choice is to select a single region, such as Google's oldest (and presumably largest), us-central1 and use it for storage and compute VMs. Terra's default region for Workflow and Cloud Environment VMs has historically been us-central1.

G0_warning-icon.png


When multi-region storage is the best option 

 
  • You have consistently high, ongoing workflow compute requirements that depend on the full capacity of multiple US regions.

  • You don't store much data in your workspace bucket (this could be the case if your primary data is stored by an organization such as All of Us, AMP-PD, BioData Catalyst or AnVIL and you don't generate or keep large amounts of secondary data).

  • You have limited oversight of workflow and compute choices of labs in your organization and would not be able to enforce standardization on a single region, such as us-central1.

  • An unexpected accidental egress charge would have a greater impact to your organization than consistently higher storage costs.

  • You have existing US multi-regional buckets and would want to copy data between these buckets and new buckets. Remember that a copy between regional and multi-regional buckets incurs egress charges.


Customizing your workspace bucket - step-by-step instructions

You will have the opportunity to select either us multi-regional (default) or us-central1 (Iowa) from the dropdown when you create or clone a workspace. 

Data-regionality_Choose-workspace-bucket-location_Screen_shot.png

How to customize your workflow compute region 

Every WDL workflow allows you to specify a list of Compute Engine zones as a default for the VMs in that workflow. In addition, you can specify a list of Compute Engine zones for each individual task's VM, which will override the default for the workflow.

Default zones for workflow VMs

All zones in the workspace bucket region

or

us-central1 for US multi-regional workspace bucket

G0_warning-icon.png


Implications of changing the region of your VM from the default

  All analyses in Terra are run on VMs. If you change the location from the default, you may incur egress charges if your bucket location and workflow VM location are different.
  • A US multi-regional bucket allows for a VM in any US region, without risk of incurring egress charges for copying data from the bucket to the VM.

  • Any regional bucket mis-matched with the VM region will result in egress charges for localizing data when running the workflow.


Customizing your workflow VM location - step-by-step instructions

To specify zones on a per-task basis 

  • Provide a hard-coded list in the workflow WDL
  • Allow a user-input value to be used as the list of zones

For example, here is a hard-coded list of zones:

runtime {
 docker: "python:slim"
 disks: "local-disk 200 HDD"
   memory: "4G"
   cpu: 1
 zones: "us-central1-a us-central1-b us-central1-c us-central1-f"
}

and here is a workflow input option, calledruntime_zones

workflow MyWorkflow {
   String runtime_zones 

   ...

   runtime {
      docker: "python:slim"
      disks: "local-disk 200 HDD"
      memory: "4G"
      cpu: 1
      zones: runtime_zones
}

which you can set on the workflow submission page:
Data-regionality_Specify-runtime-zones-in-UI_Screen_shot.png

How to customize your Cloud Environment region 

Historically, Cloud Environment VMs - used for Jupyter notebooks, RStudio, and Galaxy - have been created in the us-central1 region. Now you can choose the region for your Cloud Environment VM right in the UI.

For step-by-step instructions, see How to Customize your Cloud Environment in Understanding and Adjusting your Cloud Environment

Caveats

  • Note that this functionality is currently EXPERIMENTAL.
  • The option to change is only available to users with non-US workspace buckets [which is only available through the API]
  • There is an egress risk for the rendering of the user interface, which flows through the Leo proxy in us-central1.
  • This functionality is currently supported for standard VMs and Spark master nodes. Spark cluster support will be added in a future release.

What is the default behavior? 

Your Cloud Environment will default to the workspace bucket region. 

G0_warning-icon.png


Implications of changing the region of your cloud compute engine from the default

  All analyses in Terra are run on VMs. If you change the location from the value proposed by the UI, you may incur egress charges if your bucket location and interactive analysis Cloud Environment location are different.
  • A US multi-regional bucket allows for a Cloud Environment in any US region, without risk of incurring egress charges for copying data from the bucket to the Cloud Environment.

  • Any regional bucket mis-matched with the Cloud Environment region will result in egress charges for copying data from the bucket to the Cloud Environment.

Frequently asked questions

Why is <region> not available in the Terra UI?

Initially, we are only offering the us-central1 regional option for the Workspace bucket location.

This is to help the research communities in the US avoid creating regional data silos when they create regional buckets. For example, if one data generator chooses to locate data in us-west1, while another arbitrarily chooses us-central1, any cross-analysis of that data will incur egress charges.

Use of life science research data access is rarely latency sensitive and instead is generally throughput sensitive. Thus, locating data on the west or east coast of the US provides low value versus the cost to the community of data siloing.

Regional research data for US is directed to:

- us-central1 (Iowa)

Note that Terra buckets default to US multi-regional, which avoids egress charges within Google's US data centers but have a higher storage cost.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

2 comments

  • Comment author
    Zih-Hua Fang

    Hi,

    When will it be possible to customize the workspace bucket storage region?

    0
  • Comment author
    Allie Hajian
    • Edited

    Zih-Hua Fang Thanks for the question. We are actively working on this functionality and hoping to release soon, though I can't say exactly when it will be. The best way to stay current on new released is by "following" the release notes section in Terra Support (see the blue button at the top right of the article). 

    0

Please sign in to leave a comment.