Customizing where your data are stored and analyzed

Allie Hajian

Terra sets default values for storage and compute regions that satisfy most use-cases for data stored in North America. But there are cases when specifying the region where your data are stored and analyzed can decrease costs. This article outlines why you might want to customize the region where your data are stored and analyzed, and how to do it.  

Note: Due to a change in Google Cloud pricing policy coming in October 2022, Terra is changing the default regionality for new workspaces from "US multi-region" to "us-central1" as of June 13, 2022. See this blog for more details!

For more information, see US regional versus multi-regional US buckets: trade-offs

Google Cloud regionality overview

Data files "in the cloud" actually exist on physical storage devices and computers here on earth. Google Cloud has data centers that house these machines all around the globe, and allow users to request that their data be stored in particular "regions". You can read all about Google's datacenters and regions in the Google documentation.

Likewise, the virtual machines (VMs) you use to process that data (for example, in a workflow or notebook analysis) exist on physical machines in Google's regional datacenters.

Google Cloud has a set of regions and a region consists of a set of zones
For example: the us-central1 us-central1 region has four zones: us-central1-a , us-central1-b , us-central1-c, and us-central1-f.

Cloud Storage buckets are regional, but can be multiregional

  • Terra's default bucket location is us-central1.
  • Optionally, you may select a different region, such asUS multi-region.  Note: We are working behind the scenes to include more regions to choose from. 

GCE VMs are zonal

  • For most compute needs, the zones within a region are equivalent.
  • For some compute resources, such as GPUs or preemptible VMs, availability and capacity differ.

Why customize your data storage/compute region?

Terra assigns default storage and compute regions that simplify workspace setup for researchers. However, there are cases when you may want to override the defaults. See if one of the use-cases below meets your needs.

I work with data in the US and want to reduce costs

To move data - called data transfer - requires infrastructure and comes with a cost. For example, you incur data transfer charges if you analyze data stored in one region using VM compute resources in a different region. 

To check if customizing the VM region will save you money, consider where all the data for your study are stored (including the workspace bucket, external storage buckets, and consortia data from data repositories like Gen3) and especially if they are stored in one region. It costs more to store data for quick, easy access in multiple regions. On the other hand, savings of regional storage can disappear if you need to analyze in a different - or many different - region(s).  

For more information, see Regional or Multi-regional US buckets: tradeoffs

I want to keep life simple (take the defaults, willing to pay more for storage costs)

The default storage and compute regions will work for you if

  • You prefer consistent (though higher) storage costs to avoid the risk of unexpected data transfer fees.
  • You need access to data stored in multiple regions.

Where data are stored and analyzed in Terra

Terra consists of Terra-managed infrastructure (including the Terra control pane and workspace metadata) and user-managed cloud components for storing and analyzing data.

Terra-infrastructure-regions_Diagram.png

Terra-managed location User-managed location
  • Terra Control pane
  • Workspace metadata
  • Workspace bucket
  • Workflows VM
  • Cloud Environment VM

Terra storage and analysis region defaults

  • Workspace buckets: us-central1
  • Workflow VMs: default to the workspace bucket region or us-central1 
  • Cloud Environment VMs: default to the workspace bucket region, or us-central1

To learn more, see Terra architecture and where your files live in it

Background on Terra defaults

Terra assigns regions to the workspace storage and compute by default. The defaults simplify decision making for researchers; you can focus on storage and compute pricing as you're unlikely to encounter cross-region datat ransfer charges. However, defaults can make your monthly cloud costs more expensive. If you have time and can centralize all your storage and compute in a single region, you can reduce your storage costs and avoid data transfer - reducing your total cloud bill.

Example data storage and data transfer savings

Cost of data storage in the US

  • US multiregional: $0.026 / GB / mo
  • US regional: $0.02 / GB / mo

Cost of network data transfer in the US   

  • US multiregional: no cost (to VMs in any US region)
  • US regional: no cost (to VMs in the same< US region)
                          $0.01 / GB (to VMs in a different US region)

Note: Data transfer from the US to another continent is much more expensive than this example! See this pricing guide for the full list of up-to-date costs of Google Cloud data transfer. 

Be aware of Storage + Compute + Network Data Transfer pricing! Take care to avoid data transfer charges that result from using VMs in a region other than the workspace bucket region.

The default behavior is that workflow VMs will be located in the same region as the workspace bucket (or us-central1 for US multi-regional buckets).

If you're using regional storage in the US, this will be us-central1. When running workflows, be careful those workflows don't explicitly set available zones to include regions other than your workspace bucket.

How to customize your workspace bucket storage region

You may want to set the location of your workspace bucket(s) to be us-central1 to reduce storage costs by storing and analyzing in a single region.

If you change the location of your workspace bucket(s), be aware that you may incur data transfer charges when copying data from one region to another (examples below).

  • Copying data from a multiregional bucket to a regional bucket when that region is not part of the multiregion.
  • Copying data from a regional bucket to a regional bucket when the regions are different.
  • Copying data from a regional bucket to a VM in a different region.

The benefits of single region versus multiregionSelecting a single US region will mean saving money month after month on all workspace bucket storage. Storage costs are often the single largest cost for life sciences projects on Cloud - especially if you are paying to store your own primary data.

For many organizations and individual labs, the best long-term choice is to select a single region, such as Google's oldest (and presumably largest), us-central1 and use it for storage and compute VMs. Terra's default region for Workflow and Cloud Environment VMs has historically been us-central1.

When multiregion storage is the best option You have consistently high, ongoing workflow compute requirements that depend on the full capacity of multiple US regions.

You don't store much data in your workspace bucket (ie., your primary data is stored by an organization such as All of Us, AMP-PD, BioData Catalyst or AnVIL and you don't generate or keep large amounts of secondary data).

You have limited oversight of workflow and compute choices of labs in your organization and can't enforce standardization in a single region, such as us-central1.

An unexpected accidental datat ransfer charge has a greater impact to your organization than consistently higher storage costs.

You have existing US multiregional buckets and want to copy data between these buckets and new buckets. Remember - a copy between regional and multiregional buckets incurs data transfer charges.

Customizing your workspace bucket - step-by-step instructions

You will have the opportunity to select either us multi-regional (default)us-central1 (Iowa) or northamerica-northeast1(Montreal)from the dropdown when you create or clone a workspace. 

Regional-bucket-Montreal_Screen_shot.png

How to customize your workflow compute region 

Every WDL workflow allows you to specify a list of Compute Engine zones as a default for the VMs in that workflow. In addition, you can specify a list of Compute Engine zones for each individual task's VM, which will override the default for the workflow.

Default zones for workflow VMs All zones in the workspace bucket region -- or -- us-central1 for US multiregional workspace buckets.

Implications of changing the region of your workflow VM from the default All analyses in Terra are run on VMs. If you change the location from the default, you may incur data transfer charges if your bucket location and workflow VM location are different.

A US multiregional bucket allows for a VM in any US region, without risk of incurring data transfer charges for copying data from the bucket to the VM.

Any regional bucket mismatched with the VM region will result in data transfer charges for localizing data when running the workflow.

Customizing your workflow VM location - step-by-step instructions

To specify zones on a per-task basis 

  • Provide a hard-coded list in the workflow WDL
  • Allow a user-input value to be used as the list of zones

For example, here is a hard-coded list of zones

runtime {
 docker: "python:slim"
 disks: "local-disk 200 HDD"
   memory: "4G"
   cpu: 1
 zones: "us-central1-a us-central1-b us-central1-c us-central1-f"
}

and here is a workflow input option, calledruntime_zones

workflow MyWorkflow {
   String runtime_zones 

   ...

   runtime {
      docker: "python:slim"
      disks: "local-disk 200 HDD"
      memory: "4G"
      cpu: 1
      zones: runtime_zones
}

which you can set on the workflow submission page:
Data-regionality_Specify-runtime-zones-in-UI_Screen_shot.png

How to customize your Cloud Environment region 

Historically, Cloud Environment VMs - used for Jupyter Notebooks, RStudio, and Galaxy - have been created in the us-central1region. Now you can choose the region for your Cloud Environment VM right in the UI.

For step-by-step instructions, see How to Customize your Cloud Environment in Understanding and Adjusting your Cloud Environment

Cloud Environment caveats Note: This functionality is currently EXPERIMENTALand the option to change is only available to users with non-US workspace buckets [which is only available through the API]. 

There is a data transfer cost risk for the rendering of the user interface, which flows through the Leo proxy in us-central1.

This functionality is currently supported for standard VMs and Spark single nodes. Spark cluster support will be added in a future release.

What is the default behavior? 

Your Cloud Environment will default to the workspace bucket region. 

Implications of changing cloud compute engine region from the defaultAll analyses in Terra are run on VMs. If you change the location from the value proposed by the UI, you may incur data transfer out charges if your bucket location and interactive analysis Cloud Environment location are different.

A US multiregional bucket allows for a Cloud Environment in any US region, without risk of incurring data transfer charges for copying data from the bucket to the Cloud Environment.

Any regional bucket mismatched with the Cloud Environment region will result in data transfer charges for copying data from the bucket to the Cloud Environment.

Frequently asked questions

Why is <region> not available in the Terra UI?

Initially, we offer only the us-central1 regional option for the Workspace bucket location.

This is to help the research communities in the US avoid creating regional data silos when they create regional buckets. For example, if one data generator chooses to locate data in us-west1, while another arbitrarily chooses us-central1, any cross-analysis of that data will incur data transfer charges.

Use of life science research data access is rarely latency sensitive and instead is generally throughput sensitive. Thus, locating data files on the west or east coast of the US provide low value versus the cost to the community of data siloing.

Regional research data for US is directed to:

- us-central1 (Iowa)

Note: Terra buckets default to us-central1 (Iowa), which have a lower storage cost and no data tranfser costs between us-central1.

UPDATE (12-09-2021): We now offer the northamerica-northeast1(Montreal) regional option for the Workspace bucket location. In a similar fashion to the above for the US, regional research data for Canada is directed to northamerica-northeast1. We will add additional regional options. To be notified when new regions are added, see the release notes

Was this article helpful?

Comments

6 comments

  • Comment author
    Zih-Hua Fang

    Hi,

    When will it be possible to customize the workspace bucket storage region?

    0
  • Comment author
    Allie Hajian
    • Edited

    Zih-Hua Fang Thanks for the question. We are actively working on this functionality and hoping to release soon, though I can't say exactly when it will be. The best way to stay current on new released is by "following" the release notes section in Terra Support (see the blue button at the top right of the article). 

    0
  • Comment author
    Nicholas Youngblut

    My entire non-profit institute uses us-west1 for all data storage (e.g., many Tb of sequence data). Will us-west1 (and other regions) be supported anytime soon? We are testing out Terra Bio, but the lack of support for other regions makes us question whether using Terra is a good fit for our needs.

    0
  • Comment author
    Allie Cliffe

    Nicholas Youngblut - The engineering team is currently targeting January 2024 for multi-region support for Terra on Azure. Additionally, it's possible to create a GCP workspace in any region using the API (though it's important to note that because Terra has core services in us-central1, that could result in egress costs if you're not careful). 

    For additional information and next steps, please reach out to frontline at support@terra.bio, who would be happy to help. 

    0
  • Comment author
    Nicholas Youngblut

    Allie Cliffe would you advise that we transfer all of our existing GCP Cloud Storage data from us-west1 to us-central1? The total amount of data is <100 Tb, so hopefully <$1000. We do not want to use the API for creating GCP workspaces, since some users do not have such skills. 

    In regards to "Terra on Azure", is Terra expanding from GCP to include Azure, or migrating from GCP to Azure?

    0
  • Comment author
    Allie Cliffe

    Nicholas Youngblut Terra was originally built on Google infrastructure but is expanding to Azure infrastructure as well. Both versions of Terra will exist. Whether your workspaces use Azure or Google infrastructure depends on what funds your Terra Billing Project. The goal is to have very similar functionality, regardless of whether you are on Terra GCP or Terra on Azure. There are some key differences at the back end, with different pros and cons. See Getting Started (Terra on Azure) for more details. 

    I would recommend reaching out to support (support@terra.bio) to ask if it would be advantageous for you to transfer your data from GCS us-west1 to us-central1. It depends on your use case - what kind of analysis you will be doing, how hard it is to change the compute to a non-default region of storage, etc. 

    0

Please sign in to leave a comment.