For researchers with data in the US, Google Cloud (and now Terra) provides a choice between creating Cloud Storage buckets for storing data in one US region or using a US multi-region bucket. This article outlines the most relevant trade offs to consider within Terra when selecting between these choices, including:
Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies.
Additional details on location choices (including specific locations) can be found in the Google Cloud documentation.
- Comparing storage costs: Regional versus multi-regional buckets (US)
- Comparing network egress costs: Regional versus multi-regional buckets (US)
- Geo-redundancy considerations
- Available compute capacity considerations
- Trade-offs in Terra
- Anticipated additional storage options in Terra
- Additional resources
Comparing storage costs: Regional versus multi-regional buckets (US)
If you don't need your data stored in multiple US regions, you can achieve cost savings by changing from the default (multi-regional) to regional cloud storage buckets.
1000 30x WGS samples - assuming an an average of 17.5 GB CRAM and 7.5 GB gVCF
|US multi-regional||US regional||Savings|
|Published pricing||$0.026 / GB / mo||$0.02 /GB / mo||23%|
|Example cost||$7,800 / year||$6,000 / year||$1,800|
For more details on the costs of various storage options, see GCP documentation here.
Read on to understand considerations/tradeoffs beyond storage costs - i.e. egress costs, and compute availability - if you choose regional storage buckets.
Network egress costs: Regional versus multi-regional buckets (US)
What are egress costs?
When data moves from one region to another in Google Cloud, there can be charges for using the network between those regions (i.e. moving data out of the source region). These are network egress costs.
Egress costs tradeoffs
You'll want to consider whether you can keep all of your compute engine (Terra Cloud Environments or Workflow VMs) in the same region as your data (i.e. avoid egress costs), or if you are willing to pay the egress costs when you must compute outside the storage region.
Egress costs as a function of data storage and compute region options
|Data storage region||Compute (VM) region||Egress pricing||Example cost*|
|US multi-regional bucket||Any US region||$0.00||$0.00|
|US regional bucket||Same as data storage (US)||$0.00||$0.00|
|US regional bucket||Different US bucket||$0.01/GB||$800|
* Example use-case - 1000 30x WGS samples (assume 80 GB paired FASTQ files).
Although $800 of egress charges for processing 1000 samples is not insignificant, consider:
- There should be little reason to run compute workflows outside of your storage region
(so you should generally pay $0 in egress to run this processing)
- You'd save $800 in less than 2 months (equivalent to the egress costs in this example) by choosing regional storage ($0.006 / GB / mo of storage).
For more details on network egress costs, see GCP documentation here.
Storing data in a US multi-regional bucket means that copies of your data will be stored in multiple distinct locations. Per GCP documentation, data will be stored in "at least two separate geographic places separated by at least 100 miles."
Should a disaster render a single US region inaccessible, your data will continue to be available.
Note, this does not mean that if you choose a single region, there is only a single copy of your data. Per GCP documentation: "all Cloud Storage data is redundant within at least one geographic place as soon as you upload it."
An additional advantage to geo-redundancy is that users may observe quicker access (reduced latency) to data in Cloud Storage. However, this is less relevant to typical life sciences users, because typical use cases:
- are throughput-limited workflows and analyses, not latency-limited (i.e. interactive web/phone apps)
- in Terra are within region data access (latency in Terra is determined by how close your VM is to your data, not how close you are to your data)
To learn more about GCP geo-redundancy, see the Google documentation.
Available compute capacity considerations
Using a single region for compute comes with the possibility of reduced capacity for computation, notably for large workflows. Google Cloud's capacity is large, but there are times when VMs with certain hardware requirements are unavailable.
Click the links below for more information on limitations for particular compute configurations
Tradeoffs in Terra
For many organizations and individual labs, the best long-term choice is to select a single region for buckets. In the US, Terra makes available the us-central1 region.
Avoiding data silos with us-cenral or multi-region buckets
For data in the US, the Terra interface supports creating workspace buckets in US multi-region or us-central1. Terra's default region for Workflow and Cloud Environment VMs has historically been us-central1. The exclusion of us-east and us-west regions is intended to help the community avoid unintentionally creating data silos in different regions since cross-analysis of data in different regions is inherently more expensive due to network egress charges. If you have specific need to create workspaces with buckets in these other US regions, please contact Terra Support.
Selecting a single US region will mean saving money month after month on workspace bucket storage.
While compute costs can be significant for large data processing jobs, storage costs are typically the largest cost for life sciences projects on Cloud. Storage costs accumulate month upon month, whereas compute costs are often bursty and short-lived.
So long as you are able to ensure workflow and Cloud Environment VMs run in your selected region, you'll get all of the cost savings while avoiding egress costs.
Case study - How AMP-PD saved more than $20k by switching to regional storage
In 2018, AMP PD chose to standardize on us-central1 for storage and compute. Looking only at storage costs for the large outputs (CRAMs and VCFs for almost 10,000 WGS samples and BAMs for over 8,000 RNASeq samples), per-year, the project saves approximately:
That is a substantial savings, just for storage of outputs. Storing inputs such as the much larger FASTQ files accumulates even more savings!
Anticipated additional storage options in Terra
Google Cloud has additional storage and regionality capabilities, that we hope to add one day to Terra, including:
- Multi-region EU or ASIA
|If you're comfortable with working in GCP console, see how you can take advantage of these capabilities in non-Terra GCP projects: Accessing Advanced GCP features in Terra
To learn more details about regional selections, see: Best practices for Compute Engine regions selection
For details about Cloud pricing, see: Understanding and controlling Cloud costs
To learn more about Terra storage and compute location controls, see: Customizing where your data are stored and analyzed