For researchers with data in the US, Google Cloud (and now Terra) provides a choice between creating Cloud Storage buckets for storing data in one US region (us-central1) or using a US multi-region bucket. This article outlines the most relevant trade offs to consider within Terra when selecting between these choices.
Trade offs to consider
For additional details on location choices (including specific locations), see the Google Cloud documentation.
Credit: Source material for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality. |
Storage costs: Regional versus multi-regional buckets (US)
If you don't need your data to be stored in multiple US regions, you can save storage costs in Terra by changing from the default (multi-regional) to regional cloud storage buckets. Read on to understand considerations/trade-offs beyond storage costs - i.e., egress costs and compute availability - if you choose regional storage buckets.
Example Google Cloud costs
1000 30x WGS samples - assuming an an average of 17.5 GB CRAM and 7.5 GB gVCF.
US multi-regional | US regional | Savings | |
Published pricing | $0.026 / GB / mo | $0.02 /GB / mo | 23% |
Example cost | 12 mo * 1000 * 25 GB * $0.026 / GB / mo = $7,800 / year |
12 mo * 1000 * 25 GB * $0.02 / GB / mo = $6,000 / year |
$1,800 |
For more details on the costs of various storage options, see Google Cloud documentation.
Network egress costs: Regional versus multi-regional buckets (US)
What are egress costs?
When data in Google Cloud storage move from one region to another, there can be charges for using the network between those regions (i.e., moving data out of the source region - to do compute in a different region, for example). These are network egress costs.
Update on egress costs (October 2022)Google has changed pricing for egress across Google Cloud as of October 1, 2022.
Google has agreed to delay these pricing changes taking effect within Terra until October 2023.
For more details, see the blog post Moving away from multi-region storage buckets.
To ensure you are receiving the discount (non-Broad Google Cloud billing accounts) billing account managers should reach out to their Google representatives to determine if they have received the pricing extension or are eligible to receive the pricing extension and discuss options."
Egress costs tradeoffs
You'll want to consider whether you can keep all of your compute engine (Terra Cloud Environment VMs or Workflow VMs) in the same region as your data (i.e., avoid egress costs), or if you are willing to pay the egress costs when you must compute outside the storage region (including from US multi-region to a specific region).
Egress costs in Terra (until October 2023)
Data storage region | Compute (VM) region | Egress pricing | Example cost* |
US multi-regional bucket | Any US region | $0.00 | $0.00 |
US regional bucket | Same as data storage (US) | $0.00 | $0.00 |
US regional bucket | Different US region | $0.01/GB | $800 |
Egress costs in Terra (starting October 2023)
Data storage region | Compute (VM) region | Egress pricing | Example cost* |
US multi-regional bucket | Any US region | $0.02 / GB** | $1,600** |
US regional bucket | Same as data storage (US) | $0.00 | $0.00 |
US regional bucket | Different US region | $0.01/GB | $800 |
* Example use-case - 1000 30x WGS samples (assume 80 GB paired FASTQ files).
** Note that the amount of Always Free Internet egress will increase from 1 GB per month to 100 GB per month to each qualifying egress destination.
In making your decision on regional vs. multi-regional, consider:
- There should be little reason to run compute workflows outside of your storage region
(so you should generally pay $0 in egress to use your data) - You'll save $900 in less than 2 months on storage of 1000 80 GB files by choosing regional storage instead of multi-region ($0.026 - $0.020 = $0.006 / GB / month).
For more details on network egress costs, see Google Cloud documentation here.
Geo-redundancy considerations
-
Storing data in a US multi-regional bucket means that copies of your data will be stored in multiple distinct locations. Should a disaster render a single US region inaccessible, your data will continue to be available.
Per Google Cloud documentation:
"data will be stored in at least two separate geographic places separated by at least 100 miles." -
This does not mean that if you choose a single region, there is only a single copy of your data, however.
Per Google Cloud documentation:
"all Cloud Storage data is redundant within at least one geographic place as soon as you upload it." -
An additional advantage to geo-redundancy is that users may observe quicker access (reduced latency) to data in Cloud Storage.
-
However, this is less relevant to typical life sciences users, because typical use cases:
- are throughput-limited workflows and analyses, not latency-limited (i.e., interactive web/phone apps).
- in Terra are within region data access (latency in Terra is determined by how close your VM is to your data, not how close you are to your data).
To learn more about Google Cloud geo-redundancy, see the Google documentation.
Available compute capacity considerations
Choosing to store data in a single region means that you'll want to put all of your compute within the same region to avoid network egress charges. Using a single region for compute comes with the possibility of reduced capacity for computation, notably for large workflows. Google Cloud's capacity is large, but there are times when VMs with certain hardware requirements are unavailable.
More information on limitations for particular compute configurations
- Preemptible VMs
- GPUs
- CPUs
Tradeoffs in Terra
For many organizations and individual labs, the best long-term choice is to select a single region for workspace storage (i.e., Google buckets). In the US, Terra uses the us-central1
region by default.
Avoiding data silos with us-central or multi-region buckets
For data in the US, the Terra interface supports creating workspace buckets in US multi-region or us-central1
. Terra's default region for Workflow and Cloud Environment VMs has historically been us-central1
. The exclusion of us-east
and us-est
regions is intended to help the community avoid unintentionally creating data silos in different regions, since cross analysis of data in different regions is inherently more expensive due to network egress charges.
If you have specific need to create workspaces with buckets in these other US regions, please contact Terra Support.
Selecting a single US region will mean saving money - month after month - on workspace bucket storage.While compute costs can be significant for large data-processing jobs, storage costs are typically the largest cost for life sciences projects in the cloud. Storage costs accumulate month to month, whereas compute costs are often short-lived.
So long as you can ensure workflow and Cloud Environment VMs run in your selected region, you'll get all of the cost savings while avoiding egress costs.
When you should choose US multi-regional storage You have consistently high, ongoing workflow compute requirements that depend on the full capacity of multiple US regions.
You have limited oversight of workflow and compute choices of labs in your organization and would not be able to enforce standardization in a single region, such as us-central1
.
An unexpected accidental egress charge has a greater impact to your organization than consistently higher storage costs.
You have existing US multi-regional buckets and want to copy data between these buckets and new buckets. Remember - a copy between regional and multi-regional buckets incurs egress charges
Case study - How AMP-PD saved more than $20k by switching to regional storage
In 2018, AMP PD chose to standardize on us-central1
for storage and compute. Looking only at storage costs for the large outputs (CRAMs and VCFs for almost 10,000 WGS samples and BAMs for over 8,000 RNASeq samples), per year, the project saves approximately:
- 12 mo * 216 TiB * ($0.026 - $0.020) / GB / mo = $17,086
- 12 mo * 143 TiB * ($0.026 - $0.020) / GB / mo = $11,323
That is a substantial savings, just for storage of outputs! Storing inputs such as the much larger FASTQ files accumulates even more savings.
Final notes
Google Cloud has additional storage and regionality capabilities that we hope to add to Terra, including:
- Nearline
- Coldline
- Archive
- Dual-region
- Multi-region EU or ASIA
Additional resources
If you are comfortable working in Google Cloud console, see Accessing advanced Google Cloud features in Terra to take advantage of these capabilities in non-Terra Google Cloud projects.
To learn more about regional selections, see Best practices for Compute Engine regions selection.
To learn more about Cloud pricing, see Understanding and controlling Cloud costs.
To learn more about Terra storage and compute location controls, see Customizing where your data are stored and analyzed.