For researchers with data in the US, Google Cloud (and now Terra) provides a choice between creating Cloud Storage buckets for storing data in one US region (us-central1) or using a US multi-region bucket. This article outlines the most relevant trade offs to consider within Terra when selecting between these choices, including:
For additional details on location choices (including specific locations), see the Google Cloud documentation.
Source material for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality.
- Storage costs: Regional versus multi-regional buckets (US)
- Network egress costs: Regional versus multi-regional buckets (US)
- Geo-redundancy considerations
- Available compute capacity considerations
- Trade-offs in Terra
- Anticipated additional storage options in Terra
- Additional resources
Storage costs: Regional versus multi-regional buckets (US)
If you don't need your data to be stored in multiple US regions, you can save storage costs in Terra by changing from the default (multi-regional) to regional cloud storage buckets. Read on to understand considerations/trade-offs beyond storage costs - i.e. egress costs and compute availability - if you choose regional storage buckets.
Example Google Cloud costs
1000 30x WGS samples - assuming an an average of 17.5 GB CRAM and 7.5 GB gVCF.
|US multi-regional||US regional||Savings|
|Published pricing||$0.026 / GB / mo||$0.02 /GB / mo||23%|
|Example cost||12 mo * 1000 * 25 GB * $0.026 / GB / mo =
$7,800 / year
|12 mo * 1000 * 25 GB * $0.02 / GB / mo
$6,000 / year
For more details on the costs of various storage options, see GCP documentation.
Network egress costs: Regional versus multi-regional buckets (US)
What are egress costs?
When data in Google Cloud storage moves from one region to another, there can be charges for using the network between those regions (i.e. moving data out of the source region - to do compute in a different region, for example). These are network egress costs.
Egress costs tradeoffs
You'll want to consider whether you can keep all of your compute engine (Terra Cloud Environment VMs or Workflow VMs) in the same region as your data (i.e. avoid egress costs), or if you are willing to pay the egress costs when you must compute outside the storage region.
Egress costs within the US with multi-regional (default) and regional buckets
|Data storage region||Compute (VM) region||Egress pricing||Example cost*|
|US multi-regional bucket||Any US region||$0.00||$0.00|
|US regional bucket||Same as data storage (US)||$0.00||$0.00|
|US regional bucket||Different US region||$0.01/GB||$800|
* Example use-case - 1000 30x WGS samples (assume 80 GB paired FASTQ files).
Although $800 of egress charges for processing 1000 samples is not insignificant, consider:
- There should be little reason to run compute workflows outside of your storage region
(so you should generally pay $0 in egress to run this processing)
- You'd save $800 in less than 2 months (equivalent to the egress costs in this example) by choosing regional storage ($0.006 / GB / month).
For more details on network egress costs, see GCP documentation here.
Storing data in a US multi-regional bucket means that copies of your data will be stored in multiple distinct locations. Per GCP documentation:
"data will be stored in at least two separate geographic places separated by at least 100 miles."
Should a disaster render a single US region inaccessible, your data will continue to be available.
This does not mean that if you choose a single region, there is only a single copy of your data, however. Per GCP documentation:
"all Cloud Storage data is redundant within at least one geographic place as soon as you upload it."
An additional advantage to geo-redundancy is that users may observe quicker access (reduced latency) to data in Cloud Storage. However, this is less relevant to typical life sciences users, because typical use cases:
- are throughput-limited workflows and analyses, not latency-limited (i.e. interactive web/phone apps).
- in Terra are within region data access (latency in Terra is determined by how close your VM is to your data, not how close you are to your data).
To learn more about GCP geo-redundancy, see the Google documentation.
Available compute capacity considerations
Choosing to store data in a single region means that you'll want to put all of your compute within the same region to avoid network egress charges. Using a single region for compute comes with the possibility of reduced capacity for computation, notably for large workflows. Google Cloud's capacity is large, but there are times when VMs with certain hardware requirements are unavailable.
Click the links below for more information on limitations for particular compute configurations
Tradeoffs in Terra
For many organizations and individual labs, the best long-term choice is to select a single region for buckets. In the US, Terra uses the
us-central1 region by default.
Avoiding data silos with us-central or multi-region buckets
For data in the US, the Terra interface supports creating workspace buckets in US multi-region or
us-central1. Terra's default region for Workflow and Cloud Environment VMs has historically been
us-central1. The exclusion of
us-est regions is intended to help the community avoid unintentionally creating data silos in different regions, since cross-analysis of data in different regions is inherently more expensive due to network egress charges.
If you have specific need to create workspaces with buckets in these other US regions, please contact Terra Support.
Selecting a single US region will mean saving money - month after month - on workspace bucket storage.
While compute costs can be significant for large data processing jobs, storage costs are typically the largest cost for life sciences projects in the cloud. Storage costs accumulate month to month, whereas compute costs are often bursty and short-lived.
So long as you are able to ensure workflow and Cloud Environment VMs run in your selected region, you'll get all of the cost savings while avoiding egress costs.
Case study - How AMP-PD saved more than $20k by switching to regional storage
In 2018, AMP PD chose to standardize on
us-central1 for storage and compute. Looking only at storage costs for the large outputs (CRAMs and VCFs for almost 10,000 WGS samples and BAMs for over 8,000 RNASeq samples), per year, the project saves approximately:
- 12 mo * 216 TiB * ($0.026 - $0.020) / GB / mo = $17,086
- 12 mo * 143 TiB * ($0.026 - $0.020) / GB / mo = $11,323
That is a substantial savings, just for storage of outputs! Storing inputs such as the much larger FASTQ files accumulates even more savings.
Google Cloud has additional storage and regionality capabilities, that we hope to add one day to Terra, including:
- Multi-region EU or ASIA
If you are comfortable with working in GCP console, see Accessing advanced GCP features in Terra to take advantage of these capabilities in non-Terra GCP projects.
To learn more about regional selections, see Best practices for Compute Engine regions selection.
To learn more about Cloud pricing, see Understanding and controlling Cloud costs.
To learn more about Terra storage and compute location controls, see
Customizing where your data are stored and analyzed.