Customizing where your data are stored and analyzed

Allie Hajian

Terra sets default values for storage and compute regions that satisfy most use-cases for data stored in North America. But there are cases when specifying the region where your data are stored and analyzed can decrease costs. This article outlines why you might want to customize the region where your data are stored and analyzed, and how to do it.  

Terra no longer recommends using US multi-region bucketsFollowing Google Storage pricing changes that increase the cost of storage in multi-region buckets as well as data transfer charges every time you run any analysis on data stored in a multi-region bucket. For more details, see this blog.

Google Cloud regionality overview

Data files "in the cloud" actually exist on physical storage devices and computers here on earth. Google Cloud has data centers that house these machines all around the globe, and allow users to request that their data be stored in particular "regions". You can read all about Google's datacenters and regions in the Google documentation.

Likewise, the virtual machines (VMs) you use to process that data (for example, in a workflow or notebook analysis) exist on physical machines in Google's regional datacenters.

Google Cloud has a set of regions; a region consists of a set of zones

For example: the us-central1 us-central1 region has four zones: us-central1-a , us-central1-b , us-central1-c, and us-central1-f.

Cloud Storage buckets are regional, but can be multiregional

  • Terra recommends using single-region buckets.
  • Terra's default bucket location is us-central1.

GCE VMs are zonal

  • For most compute needs, the zones within a region are equivalent.
  • For some compute resources, such as GPUs or preemptible VMs, availability and capacity differ.

Why customize your data storage/compute region?

Terra assigns default storage and compute regions that simplify workspace setup for researchers. However, there are cases when you may want to override the defaults.

  • You work with data that is not in us-central1 and want to reduce costs
    To move data - called data transfer - requires infrastructure and comes with a cost. For example, you incur data transfer charges if you analyze data stored in one region using VM compute resources in a different region.

  • You have limited oversight of workflow and compute choices of labs in your organization and can't enforce standardization in a single region, such as us-central1.

  • You have existing US multiregional buckets and want to copy data between these buckets and new buckets. Remember - a copy between regional and multiregional buckets incurs data transfer charges.

Will customizing the VM region save you money?Consider where all the data for your study are stored (including the workspace bucket, external storage buckets, and consortia data from data repositories like Gen3) and especially if they are stored in one region (different than us-central1). It costs more to store data for quick, easy access in multiple regions. On the other hand, savings of regional storage can disappear if you need to analyze in a different - or many different - region(s).  

For more detailed cost information, see Regional or Multi-regional US buckets: tradeoffs

Where data are stored and analyzed in Terra

Terra consists of Terra-managed infrastructure (including the Terra control pane and workspace metadata) and user-managed cloud components for storing and analyzing data.

Terra-infrastructure-regions_Diagram.png

Terra-managed location

User-managed location

  • Terra Control pane
  • Workspace metadata
  • Workspace bucket
  • Workflows VM
  • Cloud Environment VM

Terra storage and analysis region defaults

  • Workspace buckets: us-central1
  • Workflow VMs: default to the workspace bucket region or us-central1
  • Cloud Environment VMs: default to the workspace bucket region, or us-central1

To learn more, see Terra architecture and where your files live in it

Background on Terra defaults

Terra assigns regions to the workspace storage and compute by default. The defaults simplify decision making for researchers; you can focus on storage and compute pricing as you're unlikely to encounter cross-region data transfer (egress) charges. Centralizing all storage and compute in a single region can reduce your storage costs and avoid data transfer - reducing your total cloud bill.

Data storage and data transfer costs

Note: Data transfer from the US to another continent is much more expensive! See this pricing guide for the full list of up-to-date costs of Google Cloud data transfer. 

Be aware of Storage + Compute + Network Data Transfer pricing! Take care to avoid data transfer charges that result from using VMs in a region other than the workspace bucket region.

The default behavior is that workflow VMs will be located in the same region as the workspace bucket (or us-central1 for US multi-regional buckets).

If you're using regional storage in the US, this will be us-central1. When running workflows, be careful those workflows don't explicitly set available zones to include regions other than your workspace bucket.

How to customize your workspace bucket storage region

If you are working with data in an external Google Bucket outside of us-central1, you may want to set the location of your workspace bucket(s) to the same region the data are stored in. This reduces storage costs by storing and analyzing in a single region.

If you change the default location of your workspace bucket(s), be aware that you may incur data transfer charges when copying data from one region to another (examples below).

  • Copying data from a multiregional bucket to a regional bucket.
  • Copying data from a regional bucket to a regional bucket when the regions are different.
  • Copying data from a regional bucket to a VM in a different region.

The benefits of single region versus multiregionUsing a single US region will mean saving money month after month on all workspace bucket storage. Storage costs are often the single largest cost for life sciences projects on Cloud - especially if you are paying to store your own primary data.

For many organizations and individual labs, the best long-term choice is to select a single region, such as Google's oldest (and presumably largest), us-central1, and use it for storage as well as all compute VMs. Terra's default region for Workflow and Cloud Environment VMs has historically been us-central1.

Customizing your workspace bucket - step-by-step instructions

You will have the opportunity to select either us-central1 (Iowa) (default) or northamerica-northeast1(Montreal)from the dropdown when you create or clone a workspace. 

Screenshot-of-clone-workspace-modal.png

Selecting other workspace bucket regionsThe bucket location drop-down menu lists a limited number of regions. If you want to select a different region, you can create your workspace through the createWorkspace API endpoint instead. Set the bucketLocation field to your preferred Google bucket location.

Note that you will have to authorizing Swagger (using the same login credentials that you use to log into Terra) before executing the API call.

Once the workspace is created, you can use and manage it through the Terra website just like any other workspace.

How to customize your workflow compute region 

Every WDL workflow allows you to specify a list of Compute Engine zones as a default for the VMs in that workflow. In addition, you can specify a list of Compute Engine zones for each individual task's VM, which will override the default for the workflow.

Default zones for workflow VMs

All zones in the workspace bucket region -- or -- us-central1 for US multiregional workspace buckets.

Implications of changing the region of your workflow VM from the default All analyses in Terra are run on VMs. If you change the location from the default, you may incur data transfer charges if your bucket location and workflow VM location are different.

Any regional bucket mismatched with the VM region will result in data transfer charges for localizing data when running the workflow, including multiregional Buckets.

Customizing your workflow VM location - step-by-step instructions

To specify zones on a per-task basis 

  • Provide a hard-coded list in the workflow WDL
  • Allow a user-input value to be used as the list of zones

For example, here is a hard-coded list of zones

runtime {
 docker: "python:slim"
 disks: "local-disk 200 HDD"
   memory: "4G"
   cpu: 1
 zones: "us-central1-a us-central1-b us-central1-c us-central1-f"
}

and here is a workflow input option, calledruntime_zones

workflow MyWorkflow {
   String runtime_zones 

   ...

   runtime {
      docker: "python:slim"
      disks: "local-disk 200 HDD"
      memory: "4G"
      cpu: 1
      zones: runtime_zones
}

which you can set on the workflow submission page:
Data-regionality_Specify-runtime-zones-in-UI_Screen_shot.png

How to customize your Cloud Environment region 

Historically, Cloud Environment VMs - used for Jupyter Notebooks, RStudio, and Galaxy - have been created in the us-central1 region. Now you can choose the region for your Cloud Environment VM right in Terra.

For step-by-step instructions, see How to Customize your Cloud Environment in Understanding and Adjusting your Cloud Environment

Cloud Environment caveats Note: This functionality is currently EXPERIMENTALand the option to change is only available to users with non-US workspace buckets [which is only available through the API]. 

There is a data transfer cost risk for the rendering of the user interface, which flows through the Leo proxy in us-central1.

This functionality is currently supported for standard VMs and Spark single nodes. Spark cluster support will be added in a future release.

What is the default behavior? 

Your Cloud Environment will default to the workspace bucket region. 

Implications of changing cloud compute engine region from the defaultAll analyses in Terra are run on VMs. If you change the location from the value proposed by the UI, you may incur data transfer out charges if your bucket location and interactive analysis Cloud Environment location are different.

Any regional or multiregional bucket mismatched with the Cloud Environment region will result in data transfer charges for copying data from the bucket to the Cloud Environment.

Was this article helpful?

0 out of 0 found this helpful

Comments

6 comments

  • Comment author
    Zih-Hua Fang

    Hi,

    When will it be possible to customize the workspace bucket storage region?

    0
  • Comment author
    Allie Hajian
    • Edited

    Zih-Hua Fang Thanks for the question. We are actively working on this functionality and hoping to release soon, though I can't say exactly when it will be. The best way to stay current on new released is by "following" the release notes section in Terra Support (see the blue button at the top right of the article). 

    0
  • Comment author
    Nicholas Youngblut

    My entire non-profit institute uses us-west1 for all data storage (e.g., many Tb of sequence data). Will us-west1 (and other regions) be supported anytime soon? We are testing out Terra Bio, but the lack of support for other regions makes us question whether using Terra is a good fit for our needs.

    0
  • Comment author
    Allie Cliffe

    Nicholas Youngblut - The engineering team is currently targeting January 2024 for multi-region support for Terra on Azure. Additionally, it's possible to create a GCP workspace in any region using the API (though it's important to note that because Terra has core services in us-central1, that could result in egress costs if you're not careful). 

    For additional information and next steps, please reach out to frontline at support@terra.bio, who would be happy to help. 

    0
  • Comment author
    Nicholas Youngblut

    Allie Cliffe would you advise that we transfer all of our existing GCP Cloud Storage data from us-west1 to us-central1? The total amount of data is <100 Tb, so hopefully <$1000. We do not want to use the API for creating GCP workspaces, since some users do not have such skills. 

    In regards to "Terra on Azure", is Terra expanding from GCP to include Azure, or migrating from GCP to Azure?

    0
  • Comment author
    Allie Cliffe

    Nicholas Youngblut Terra was originally built on Google infrastructure but is expanding to Azure infrastructure as well. Both versions of Terra will exist. Whether your workspaces use Azure or Google infrastructure depends on what funds your Terra Billing Project. The goal is to have very similar functionality, regardless of whether you are on Terra GCP or Terra on Azure. There are some key differences at the back end, with different pros and cons. See Getting Started (Terra on Azure) for more details. 

    I would recommend reaching out to support (support@terra.bio) to ask if it would be advantageous for you to transfer your data from GCS us-west1 to us-central1. It depends on your use case - what kind of analysis you will be doing, how hard it is to change the compute to a non-default region of storage, etc. 

    0

Please sign in to leave a comment.