Sharing research data and resource files can help accelerate science and save the research community money on storage costs. However, storage and sharing incur Cloud costs for the data owner. This document is intended to help researchers and organizations manage those costs.
Source material for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality.
This document is focused on
- Data in Google Cloud Storage
- Costs related to storage and sharing
This document is not about
- Authorizing/controlling access
- Choosing what to store/share
Creating these tips was motivated by the ongoing addition of regionality capabilities into Terra. For more information on regionality, see US Regional or Multi-regional US buckets: trade-offs.
Cost Structure
When storing and sharing data in Google Cloud Storage (GCS), there are three categories of costs to consider: storage, retrieval, and network data transfer out. While the storage costs will always be incurred by you as the data owner, retrieval and network data transfer costs can - optionally - be passed on to the data users via the requester pays bucket option.
This document explains these charges and provides a framework to appropriately share costs and mitigate data transfer charges.
Storage costs
When creating a Cloud Storage bucket, your costs will depend on a few key choices:
- Storage Class
- Multiregional versus Regional
- Location
Storage Class
Google offers multiple storage classes, each with a different cost structure. This document focuses on Standard storage buckets. All Workspace buckets in Terra are Standard storage class. Other ("colder") storage classes come with more complex cost structures (adding in retrieval costs, discussed below), which tend to reduce the overall value of the data to the community and are not available for Terra workspaces.
Multiregional versus Regional
Within the Standard storage class, you have a choice to store data in a single Google Cloud Region (such as us-central1) or in a multiregion (such as US). The trade-off here is between cost and data accessibility. Storage costs are lower for Regional than Multiregional buckets (e.g., $0.020 / GB / month in us-central1 versus $0.026 / GB / month in US). But, accessing a bucket from outside of its region - which is more likely for a Regional than a Multiregional bucket - will incur network data transfer charges (discussed below).
Location
You also need to consider where to store your data (i.e., what particular region or multiregion). This decision will likely be driven by expected access patterns. For example, if your data files are most likely accessed from machines in the United States, selecting either Regional us-central1 or US multi-regional will be the best choice to reduce or eliminate network data transfer charges.
Retrieval costs
Retrieval charges apply to any Cloud Storage classes other than Standard (i.e., Nearline, Coldline, and Archive). Note: Currently, these archival storage options are not available for Workspace buckets. Retrieval charges are incurred for any retrieval of objects in these Cloud Storage classes.
For data shared with the community, predicting access patterns is very difficult. If the data you share are highly valuable, access is likely to be frequent, cutting into the cost savings of the colder storage option.
For maximum savings to you as a data owner, choose a colder storage class and choose to make your bucket requester pays. This passes on the retrieval cost to members of the community who use that data. If you choose this option, please communicate this to community members, as retrieval costs are not commonly expected.
To learn more, see Using requester pays workspaces/buckets in Terra and Best practices for accessing external resources.
Data transfer
Network data transfer charges apply when data are transferred out of a storage region. For example, if data stored in a Regional bucket in us-central1 are copied to a Compute Engine virtual machine (VM) in us-east1, data transfer out charges are incurred. Similarly, if data stored in a multiregional bucket in the US are copied to a Compute Engine VM in europe-west1, data transfer charges are incurred.
By default, the data owner pays for the data transfer charges. Data transfer charges can be passed along to the community by turning on the requester pays storage bucket option. To learn more about requester pays buckets and how to turn on this option, see Using requester pays workspaces/ buckets in Terra.
Recommendations
Making your data available to the community is a great way to accelerate science. Choices you make around sharing that data will depend on your situation (volume of data, frequency of access, organization and project funding). In this section, we make a recommendation based on a common set of assumptions.
Assumptions
- Your data are most used on a particular continent or in a particular country.
- Your data are primarily accessed from Cloud VMs (versus machines off-Cloud).
- Your program is sufficiently funded to store the data in Standard storage.
- Your program is NOT sufficiently funded to absorb unplanned spikes in charges (such as from network data transfer charges).
With that framing, we recommend setting up your storage as follows
- Create a Cloud project designated for data sharing, separate from other Cloud resources (to simplify access controls).
- Pick the bucket location that best aligns with the location(s) of highest usage.
- If the data files are large and you need to reduce storage costs, make the bucket Regional (in the US, choose us-central1 to align with other existing programs).
- Make your bucket requester pays.
- Grant
storage.buckets.get
permissions to the public (so the bucket storage options can be discovered programmatically or fromgsutil ls -Lb
).
Then communicate to the community
- Your data storage location.
- Your bucket's configuration as requester pays.
- How users can copy your data to a bucket in their own region. This is relevant if their region is different than your data's, and they need frequent access to the data.
With the above configuration choices, you'll have consistent storage costs and end users will pay any network data transfer charges. End users will be informed and can make better choices in how they interact with your data.
Preventing data transfer (experimental)
If you are interested in further helping the community avoid paying accidental data transfer charges when they use your data, you may be able to use VPC service controls within a Cloud Organization to put a service perimeter around the Google Cloud project that contains your Cloud Storage bucket.
Please refer to this document on how to configure GCS to avoid network data transfer charges.