This article has been DEPRECATED since multi-region buckets have been phased out. Google Cloud provides a choice between creating Cloud Storage buckets for storing data in one US region (us-central1) or using a US multi-region bucket. This article outlines the costs and most relevant trade offs to consider within Terra when selecting between these choices.
As of October 2023, Terra's default bucket location is us-central1. If you don't need your data to be stored in multiple US regions, you can save storage costs in Terra by using the default bucket location. Read on to understand considerations/trade-offs beyond storage costs - i.e., data transfer (formerly "egress") costs and compute availability - if you choose regional storage buckets.
Terra no longer supports US multi-region buckets for workspace storageThis change was implemented as a result of Google Storage pricing changes that increased the cost of storage in multi-region buckets as well as data transfer charges every time you run any analysis on data stored in a multi-region bucket. For more details, see the blog post Moving away from multi-region storage buckets.
Trade offs to consider
For additional details on location choices (including specific locations), see the Google Cloud documentation.
Credit: Source material for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality. |
Storage costs: Regional versus multi-regional buckets (US)
If you don't need your data to be stored in multiple US regions, you can save storage costs in Terra by changing from the default (multi-regional) to regional cloud storage buckets. Read on to understand considerations/trade-offs beyond storage costs - i.e., data transfer (formerly "egress") costs and compute availability - if you choose regional storage buckets.
Example Google Cloud costs
1000 30x WGS samples - assuming an an average of 17.5 GB CRAM and 7.5 GB gVCF.
US multi-regional | US regional | Savings | |
Published pricing | $0.026 / GB / mo | $0.02 /GB / mo | 23% |
Example cost | 12 mo * 1000 * 25 GB * $0.026 / GB / mo = $7,800 / year |
12 mo * 1000 * 25 GB * $0.02 / GB / mo = $6,000 / year |
$1,800 |
For more details on the costs of various storage options, see Google Cloud documentation.
Data transfer costs: Regional versus multi-regional buckets (US)
What are data transfer (formerly "egress") costs?
When data in Google Cloud storage move from one region to another, there can be charges for using the network between those regions (i.e., moving data out of the source region - to do compute in a different region, for example). These are network data transfer out costs.
Data transfer costs tradeoffs
You'll want to consider whether you can keep all of your compute engine (Terra Cloud Environment VMs or Workflow VMs) in the same region as your data (i.e., avoid data transfer costs), or if you are willing to pay the data transfer costs when you must compute outside the storage region (including from US multi-region to a specific region).
Data transfer costs in Terra
Data storage region | Compute (VM) region | Data transfer pricing | Example cost* |
US multi-regional bucket | Any US region | $0.02 / GB** | $1,600** |
US regional bucket | Same as data storage (US) | $0.00 | $0.00 |
US regional bucket | Different US region | $0.01/GB | $800 |
* Example use-case - 1000 30x WGS samples (assume 80 GB paired FASTQ files).
** Note that the amount of Always Free Internet data transfer out will increase from 1 GB per month to 100 GB per month to each qualifying destination.
In making your decision on regional vs. multi-regional, consider:
-
There should be little reason to run compute workflows outside of your storage region
(so you should generally pay $0 in data transfer costs to use your data) - You'll save $900 in less than 2 months on storage of 1000 80 GB files by choosing regional storage instead of multi-region ($0.026 - $0.020 = $0.006 / GB / month).
For more details on network data transfer out costs, see Google Cloud documentation here.
Geo-redundancy considerations
-
Storing data in a US multi-regional bucket means that copies of your data will be stored in multiple distinct locations. Should a disaster render a single US region inaccessible, your data will continue to be available.
Per Google Cloud documentation:
"data will be stored in at least two separate geographic places separated by at least 100 miles." -
This does not mean that if you choose a single region, there is only a single copy of your data, however.
Per Google Cloud documentation:
"all Cloud Storage data is redundant within at least one geographic place as soon as you upload it." -
An additional advantage to geo-redundancy is that users may observe quicker access (reduced latency) to data in Cloud Storage.
-
However, this is less relevant to typical life sciences users, because typical use cases:
-
are throughput-limited workflows and analyses, not latency-limited (i.e., interactive web/phone apps).
- in Terra are within region data access (latency in Terra is determined by how close your VM is to your data, not how close you are to your data).
-
To learn more about Google Cloud geo-redundancy, see the Google documentation.
Available compute capacity considerations
Choosing to store data in a single region means that you'll want to put all of your compute within the same region to avoid network data transfer charges. Using a single region for compute comes with the possibility of reduced capacity for computation, notably for large workflows. Google Cloud's capacity is large, but there are times when VMs with certain hardware requirements are unavailable.
More information on limitations for particular compute configurations
- Preemptible VMs
- GPUs
- CPUs
Tradeoffs in Terra
For many organizations and individual labs, the best long-term choice is to select a single region for workspace storage (i.e., Google buckets). In the US, Terra uses the us-central1
region by default.
Avoiding data silos with us-central buckets
For data in the US, Terra defaults to workspace buckets and compute VMs in us-central1
. The exclusion of other regions is intended to help the community avoid unintentionally creating data silos in different regions, since cross analysis of data in different regions is inherently more expensive due to network data transfer charges.
If you have specific need to create workspaces with buckets in these other US regions, please contact Terra Support.
Selecting a single US region will mean saving money - month after month - on workspace bucket storage.While compute costs can be significant for large data-processing jobs, storage costs are typically the largest cost for life sciences projects in the cloud. Storage costs accumulate month to month, whereas compute costs are often short-lived.
So long as you can ensure workflow and Cloud Environment VMs run in your selected region, you'll get all of the cost savings while avoiding data transfer costs.
Case study - How AMP-PD saved more than $20k by switching to regional storage
In 2018, AMP PD chose to standardize on us-central1
for storage and compute. Looking only at storage costs for the large outputs (CRAMs and VCFs for almost 10,000 WGS samples and BAMs for over 8,000 RNASeq samples), per year, the project saves approximately:
- 12 mo * 216 TiB * ($0.026 - $0.020) / GB / mo = $17,086
- 12 mo * 143 TiB * ($0.026 - $0.020) / GB / mo = $11,323
That is a substantial savings, just for storage of outputs! Storing inputs such as the much larger FASTQ files accumulates even more savings.
Final notes
To simplify and automate cost savings for all data stored in Terra (GCP), all workspace buckets have autoclass enabled by default. Google Cloud has additional storage and regionality capabilities that we hope to add to Terra, including:
- Nearline
- Coldline
- Archive
- Dual-region
- Multi-region EU or ASIA
Additional resources
If you are comfortable working in Google Cloud console, see Accessing advanced Google Cloud features in Terra to take advantage of these capabilities in non-Terra Google Cloud projects.
To learn more about regional selections, see Best practices for Compute Engine regions selection.
To learn more about Cloud pricing, see Understanding and controlling Cloud costs.
To learn more about Terra storage and compute location controls, see Customizing where your data are stored and analyzed.