Sharing research data and resource files can help accelerate science and save the research community money on storage costs. However, storage and sharing incur Cloud costs for the data owner. This document is intended to help researchers and organizations, providing useful data to manage those costs.
This document is focused on
This document is not about
- Authorizing/controlling access
- Choosing what to store/share
Creating these tips was motivated by the ongoing addition of regionality capabilities into Terra. For more information on regionality, see US Regional or Multi-regional US buckets: trade-offs.
Source material for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality.
When storing and sharing data in Google Cloud Storage (GCS), there are three categories of costs to consider: storage, retrieval, and network egress. While the storage costs will always be incurred by you as the data owner, retrieval and network egress costs can - optionally - be passed on to the data users via the requester pays bucket option.
This document explains these charges and provides a framework to appropriately share costs and mitigate egress charges.
When creating a Cloud Storage bucket, you have a few key choices to make:
- Storage Class
- Multiregional versus Regional
Google offers multiple storage classes, each with a different cost structure. This document focuses on the choice of Standard storage. Other ("colder") storage classes come with more complex cost structures (adding in retrieval costs, discussed below), which tend to reduce the overall value of the data to the community. All Workspace buckets in Terra are Standard storage class.
Multiregional versus Regional
Within the Standard storage class, you have a choice to store data in a single Google Cloud Region (such as us-central1) or in a multiregion (such as US). The trade off here is between reduced storage cost (e.g. $0.020 / GB / month in us-central1 versus $0.026 / GB / month in US) and accessing this data without network egress charges (discussed below).
You also need to consider where to store your data (i.e., what particular region or multiregion). This decision will likely be driven by expected access patterns. For example, if your data files are most likely accessed from machines in the United States, selecting either Regional us-central1 or US multi-regional will be the best choice to reduce or eliminate network egress charges.
Retrieval charges apply to any Cloud Storage classes other than Standard (i.e., Nearline, Coldline, and Archive). Note: Currently, these archival storage options are not available for Workspace buckets. Retrieval charges are incurred for any retrieval of objects in these Cloud Storage classes.
For data shared with the community, predicting access patterns is very difficult. If the data you share are highly valuable, access is likely to be frequent, cutting into the cost savings of the colder storage option.
For maximum savings to you as a data owner, choose a colder storage class and choose to make your bucket requester pays. This passes on the retrieval cost to members of the community who use that data. If you choose this option, please communicate this to community members, as retrieval costs are not commonly expected.
Network egress charges apply when data are transferred out of a storage region. For example, if data stored in a Regional bucket in us-central1 are copied to a Compute Engine virtual machine (VM) in us-east1, egress charges are incurred. Similarly, if data stored in a multiregional bucket in the US are copied to a Compute Engine VM in europe-west1, egress charges are incurred.
By default, the data owner pays for the egress charges. Egress charges can be passed along to the community by turning on the requester pays storage bucket option. To learn more about requester pays buckets and how to turn on this option, see Using requester pays workspaces/ buckets in Terra.
Making your data available to the community is a great way to accelerate science. Choices you make around sharing that data will depend on your situation (volume of data, frequency of access, organization and project funding). In this section, we make a recommendation based on a common set of assumptions.
- Your data are most used on a particular continent or in a particular country.
- Your data are primarily accessed from Cloud VMs (versus machines off-Cloud).
- Your program is sufficiently funded to store the data in Standard storage.
- Your program is NOT sufficiently funded to absorb unplanned spikes in charges (such as from network egress charges).
With that framing, we recommend setting up your storage as follows
- Create a Cloud project designated for data sharing, separate from other Cloud resources (to simplify access controls).
- Pick the bucket location that best aligns with the location(s) of highest usage.
- If the data files are large and you need to reduce storage costs, make the bucket Regional (in the US, choose us-central1 to align with other existing programs).
- Make your bucket requester pays.
storage.buckets.getpermissions to the public (so the bucket storage options can be discovered programmatically or from "gsutil ls -Lb”).
Then communicate to the community
- Your data storage location.
- Your bucket's configuration as requester pays.
- How users can copy data to a bucket in their own region if different than yours and they need frequent access (rather than incurring egress charges for each usage).
With the above configuration choices, you'll have consistent storage costs and end users will pay any network egress charges. End users will be informed and can make better choices in how they interact with your data.
Preventing egress (experimental)
If you are interested in further helping the community avoid accidental egress charges to themselves when they use your data, you may be able to use VPC service controls within a Cloud Organization to put a service perimeter around the Google Cloud project that contains your Cloud Storage bucket.
Please refer to this document on how to configure GCS to avoid network egress charges.