Customizing where your data are stored and analyzed

Allie Hajian

Terra sets default values for storage and compute regions that satisfy most use-cases for data stored in North America. But there are cases when being able to specify the region where your data are stored and analyzed can decrease costs. This article outlines why you might want to customize the region where your data are stored and analyzed, and how to do it.  

Note: Due to a change in GCP pricing policy coming in October 2022, Terra is changing the default regionality for new workspaces from "US multi-region" to "us-central1" as of June 13, 2022. See this blog for more details!

For more information, see US regional versus multi-regional US buckets: trade-offs

GCP regionality overview

Data "in the cloud" actually exists on physical storage devices and computers here on earth. Google Cloud has data centers that house these machines all around the globe, and allow users to request that their data be stored in particular "regions". You can read all about Google's datacenters and regions in the Google documentation (click here).

Likewise, the virtual machines (VMs) you use to process that data (for example, in a workflow or notebook analysis) exist on physical machines in Google's regional datacenters.

GCP has a set of regions and a region consists of a set of zones
For example: the us-central1 us-central1 region has four zones: us-central1-a , us-central1-b , us-central1-c, and us-central1-f.

Cloud Storage buckets are regional, but can be multi-regional

  • Terra's default bucket location is us-central1.
  • You may optionally select a different region, such asUS multi-region.  Note that we are working behind the scenes to include more regions to choose from. 

GCE VMs are zonal

  • For most compute needs, the zones within a region are equivalent.
  • For some compute resources, such as GPUs or preemptible VMs, availability and capacity differ.

Why customize your data storage/compute region?

Terra assigns default storage and compute regions that simplify workspace setup for researchers. However, there are cases when you may want to override the defaults. See if one of the use-cases beelow meets your needs.

I work with data in the US and want to reduce costs

To move data - called egress - requires infrastructure and comes with a cost. You will incur egress charges, for example, if you are analyzing data stored in one region using VM compute resources in a different region. 

To determine if customizing the VM region will save you money, consider where all the data for your study are stored (this includes the workspace bucket, external storage buckets, and consortia data from data repositories like Gen3) and especially if they are stored in one region. It costs more to store data for quick, easy access in multiple regions. On the other hand, the savings of regional storage can disappear if you need to analyze in a different - or many different - region(s).  

For more information, see Regional or Multi-regional US buckets: tradeoffs

I want to keep life simple (take the defaults, willing to pay more for storage costs)

The default storage and compute regions will work for you if

  • You prefer consistent (though higher) storage costs to avoid the risk of unexpected egress fees
  • You need access to data stored in multiple regions

Where data are stored and analyzed in Terra

Terra consists of Terra-managed infrastructure (including the Terra control pane and workspace metadata) and user-managed cloud components for storing and analyzing data.

Terra-infrastructure-regions_Diagram.png

Terra-managed location User-managed location
  • Terra Control pane
  • Workspace metadata
  • Workspace bucket
  • Workflows VM
  • Cloud Environment VM

Terra storage and analysis region defaults

  • Workspace buckets: us-central1
  • Workflow VMs: default to the workspace bucket region or us-central1 
  • Cloud Environment VMs: default to the workspace bucket region, or us-central1

To learn more, see Terra architecture and where your files live in it

Background on Terra defaults

Terra assigns regions to the workspace storage and compute by default. The defaults simplify decision-making for researchers; you can focus on storage and compute pricing as you are unlikely to encounter cross-region egress charges. However, the defaults can make your monthly cloud costs more expensive. If you have time and are able to centralize all of your storage and compute in a single region, you can reduce your storage costs and avoid egress - reducing your total cloud bill.

Example data storage and egress savings

Cost of data storage in the US

  • US multi-regional: $0.026 / GB / mo
  • US regional: $0.02 / GB / mo

Cost of network egress in the US   

  • US multi-regional: no cost (to VMs in any US region)
  • US regional: no cost (to VMs in the same US region)
                          $0.01 / GB (to VMs in a different US region)

Note that egress from the US to another continent is much more expensive than this example! See this pricing guide for the full list of up-to-date costs of GCP egress. 

Be aware of Storage + Compute + Network Egress pricing! You will need to be careful to avoid egress charges that result from using VMs in a different region than the workspace bucket region.

The default behavior is that workflow VMs will be located in the same region as the workspace bucket (or us-central1 for US multi-regional buckets).

If you're using Regional storage in the US, this will be us-central1. Take care when running workflows that those workflows do not explicitly set the available zones to include regions other than that of your workspace bucket.

How to customize your workspace bucket storage region

You may want to set the location of your workspace bucket(s) to be us-central1 to reduce storage costs by storing and analyzing in a single region.

If you change the location of your workspace bucket(s), be aware that egress charges can be incurred copying data from one region to another (examples below).

  • Copying data from a multi-regional bucket to a regional bucket when that region is not part of the multi-region
  • Copying data from a regional bucket to a regional bucket when the regions are different
  • Copying data from a regional bucket to a VM in a different region

The benefits of single-region versus multi-regionSelecting a single US region will mean saving money month after month on all workspace bucket storage. Storage costs are often the single largest cost for life sciences projects on Cloud - especially if you are paying to store your own primary data.

For many organizations and individual labs, the best long-term choice is to select a single region, such as Google's oldest (and presumably largest), us-central1 and use it for storage and compute VMs. Terra's default region for Workflow and Cloud Environment VMs has historically been us-central1.

When multi-region storage is the best option You have consistently high, ongoing workflow compute requirements that depend on the full capacity of multiple US regions.

You don't store much data in your workspace bucket (this could be the case if your primary data is stored by an organization such as All of Us, AMP-PD, BioData Catalyst or AnVIL and you don't generate or keep large amounts of secondary data).

You have limited oversight of workflow and compute choices of labs in your organization and would not be able to enforce standardization on a single region, such as us-central1.

An unexpected accidental egress charge would have a greater impact to your organization than consistently higher storage costs.

You have existing US multi-regional buckets and would want to copy data between these buckets and new buckets. Remember that a copy between regional and multi-regional buckets incurs egress charges.

Customizing your workspace bucket - step-by-step instructions

You will have the opportunity to select either us multi-regional (default), us-central1 (Iowa) or northamerica-northeast1(Montreal)from the dropdown when you create or clone a workspace. 

Regional-bucket-Montreal_Screen_shot.png

How to customize your workflow compute region 

Every WDL workflow allows you to specify a list of Compute Engine zones as a default for the VMs in that workflow. In addition, you can specify a list of Compute Engine zones for each individual task's VM, which will override the default for the workflow.

Default zones for workflow VMs All zones in the workspace bucket region -- or -- us-central1 for US multi-regional workspace buckets.

Implications of changing the region of your workflow VM from the default All analyses in Terra are run on VMs. If you change the location from the default, you may incur egress charges if your bucket location and workflow VM location are different.

A US multi-regional bucket allows for a VM in any US region, without risk of incurring egress charges for copying data from the bucket to the VM.

Any regional bucket mis-matched with the VM region will result in egress charges for localizing data when running the workflow.

Customizing your workflow VM location - step-by-step instructions

To specify zones on a per-task basis 

  • Provide a hard-coded list in the workflow WDL
  • Allow a user-input value to be used as the list of zones

For example, here is a hard-coded list of zones:

runtime {
 docker: "python:slim"
 disks: "local-disk 200 HDD"
   memory: "4G"
   cpu: 1
 zones: "us-central1-a us-central1-b us-central1-c us-central1-f"
}

and here is a workflow input option, calledruntime_zones

workflow MyWorkflow {
   String runtime_zones 

   ...

   runtime {
      docker: "python:slim"
      disks: "local-disk 200 HDD"
      memory: "4G"
      cpu: 1
      zones: runtime_zones
}

which you can set on the workflow submission page:
Data-regionality_Specify-runtime-zones-in-UI_Screen_shot.png

How to customize your Cloud Environment region 

Historically, Cloud Environment VMs - used for Jupyter notebooks, RStudio, and Galaxy - have been created in the us-central1 region. Now you can choose the region for your Cloud Environment VM right in the UI.

For step-by-step instructions, see How to Customize your Cloud Environment in Understanding and Adjusting your Cloud Environment

Cloud environment caveats Note that this functionality is currently EXPERIMENTALand the option to change is only available to users with non-US workspace buckets [which is only available through the API]. 

There is an egress risk for the rendering of the user interface, which flows through the Leo proxy in us-central1.

This functionality is currently supported for standard VMs and Spark single nodes. Spark cluster support will be added in a future release.

What is the default behavior? 

Your Cloud Environment will default to the workspace bucket region. 

Implications of changing cloud compute engine region from the defaultAll analyses in Terra are run on VMs. If you change the location from the value proposed by the UI, you may incur egress charges if your bucket location and interactive analysis Cloud Environment location are different.

A US multi-regional bucket allows for a Cloud Environment in any US region, without risk of incurring egress charges for copying data from the bucket to the Cloud Environment.

Any regional bucket mis-matched with the Cloud Environment region will result in egress charges for copying data from the bucket to the Cloud Environment.

Frequently asked questions

Why is <region> not available in the Terra UI?

Initially, we are only offering the us-central1 regional option for the Workspace bucket location.

This is to help the research communities in the US avoid creating regional data silos when they create regional buckets. For example, if one data generator chooses to locate data in us-west1, while another arbitrarily chooses us-central1, any cross-analysis of that data will incur egress charges.

Use of life science research data access is rarely latency sensitive and instead is generally throughput sensitive. Thus, locating data on the west or east coast of the US provides low value versus the cost to the community of data siloing.

Regional research data for US is directed to:

- us-central1 (Iowa)

Note that Terra buckets default to US multi-regional, which avoids egress charges within Google's US data centers but have a higher storage cost.

UPDATE (12-09-2021): We are now offering the northamerica-northeast1(Montreal) regional option for the Workspace bucket location. In a similar fashion to the above for the US, regional research data for Canada is directed to northamerica-northeast1. We will be adding additional regional options. To be notified when new regions are added, see the release notes

Was this article helpful?

Comments

6 comments

  • Comment author
    Zih-Hua Fang

    Hi,

    When will it be possible to customize the workspace bucket storage region?

    0
  • Comment author
    Allie Hajian
    • Edited

    Zih-Hua Fang Thanks for the question. We are actively working on this functionality and hoping to release soon, though I can't say exactly when it will be. The best way to stay current on new released is by "following" the release notes section in Terra Support (see the blue button at the top right of the article). 

    0
  • Comment author
    Nicholas Youngblut

    My entire non-profit institute uses us-west1 for all data storage (e.g., many Tb of sequence data). Will us-west1 (and other regions) be supported anytime soon? We are testing out Terra Bio, but the lack of support for other regions makes us question whether using Terra is a good fit for our needs.

    0
  • Comment author
    Allie Cliffe

    Nicholas Youngblut - The engineering team is currently targeting January 2024 for multi-region support for Terra on Azure. Additionally, it's possible to create a GCP workspace in any region using the API (though it's important to note that because Terra has core services in us-central1, that could result in egress costs if you're not careful). 

    For additional information and next steps, please reach out to frontline at support@terra.bio, who would be happy to help. 

    0
  • Comment author
    Nicholas Youngblut

    Allie Cliffe would you advise that we transfer all of our existing GCP Cloud Storage data from us-west1 to us-central1? The total amount of data is <100 Tb, so hopefully <$1000. We do not want to use the API for creating GCP workspaces, since some users do not have such skills. 

    In regards to "Terra on Azure", is Terra expanding from GCP to include Azure, or migrating from GCP to Azure?

    0
  • Comment author
    Allie Cliffe

    Nicholas Youngblut Terra was originally built on Google infrastructure but is expanding to Azure infrastructure as well. Both versions of Terra will exist. Whether your workspaces use Azure or Google infrastructure depends on what funds your Terra Billing Project. The goal is to have very similar functionality, regardless of whether you are on Terra GCP or Terra on Azure. There are some key differences at the back end, with different pros and cons. See Getting Started (Terra on Azure) for more details. 

    I would recommend reaching out to support (support@terra.bio) to ask if it would be advantageous for you to transfer your data from GCS us-west1 to us-central1. It depends on your use case - what kind of analysis you will be doing, how hard it is to change the compute to a non-default region of storage, etc. 

    0

Please sign in to leave a comment.