Data Publishers tips

Allie Hajian
  • Updated

Sharing research data and resource files can help accelerate science and save the research community money on storage costs. However, storage and sharing does incur Cloud costs for the data owner.  This document is intended to help researchers and organizations with useful data to manage those costs.

This document is focused on

This document is not about

  • Authorizing/controlling access
  • Choosing what to store/share

Creating these tips was motivated by the ongoing addition of regionality capabilities into Terra. For more information on regionality, see US Regional or Multi-regional US buckets: trade-offs

Source material for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality.

Cost Structure

When storing and sharing data in Google Cloud Storage (GCS), there are three categories of costs to consider: storage, retrieval, and network egress. While the storage costs will always be incurred by you as the data owner, retrieval and network egress costs can - optionally - be passed on to the data users via the requester pays bucket option.

This document explains these charges and provides a framework to appropriately share costs and mitigate egress charges.

Storage costs

When creating a Cloud Storage bucket, you have a few key choices to make:

  • Storage Class
  • Multi-regional versus Regional
  • Location

Storage Class

Google offers multiple storage classes, each with a different cost structure. This document focuses on the choice of Standard storage. Other ("colder") storage classes come with more complex cost structures (adding in retrieval costs, discussed below), which tend to reduce the overall value of the data to the community. All Workspace buckets in Terra are Standard storage class. 

Multi-regional  versus Regional

Within the Standard storage class, you have a choice to store data in a single Google Cloud Region (such as us-central1) or in a multi-region (such as US). The trade-off here is between reduced storage cost (e.g. $0.020 / GB / month in us-central1 versus $0.026 / GB / month in US) and being able to access this data without network egress charges (discussed below).

Location

You'll also need to consider where to store your data (i.e. what particular region or multi-region). This decision will likely be driven by expected access patterns. For example, if your data is most likely to be accessed from machines in the United States, selecting either Regional us-central1 or US multi-regional will be the best choice to reduce or eliminate network egress charges.

Retrieval costs

Retrieval charges apply to any Cloud Storage classes other than Standard (i.e. Nearline, Coldline, and Archive). Note that these archival storage options are not currently available for Workspace buckets. Retrieval charges are incurred for any retrieval of objects in these Cloud Storage classes. 

For data shared with the community, predicting access patterns is very difficult. If the data you are sharing is highly valuable, access would likely be frequent, cutting into the cost savings of the colder storage option.

For maximum savings to you as a data owner, you could choose a colder storage class and choose to make your bucket requester pays. This would pass on the retrieval cost to members of the community who use that data. If you choose this option, please communicate this to community members, as retrieval costs are not commonly expected. 

To learn more, see Using requester pays workspaces/buckets in Terra and Best practices for accessing external resources.  

Egress

Network egress charges apply when data is transferred out of a storage region. For example, if data stored in a Regional bucket in us-central1 is copied to a Compute Engine VM in us-east1, egress charges are incurred. Similarly if data stored in a Multi-regional bucket in the US is copied to a Compute Engine VM in europe-west1, egress charges are incurred.

By default, the data owner pays for the egress charges. Egress charges can be passed along to the community by turning on the requester pays storage bucket option. To learn more about requester pays buckets and how to turn on this option, see Using requester pays workspaces/ buckets in Terra

Recommendations

Making your data available to the community is a great way to accelerate science. Choices you make around sharing that data will depend on your situation (volume of data, frequency of access, organization and project funding). In this section, we make a recommendation based on a common set of assumptions.

Assumptions

  • Your data is most used on a particular continent or in a particular country.
  • Your data is primarily accessed from Cloud VMs (versus machines off-Cloud).
  • Your program is sufficiently funded to store the data in Standard storage.
  • Your program is NOT sufficiently funded to absorb unplanned spikes in charges (such as from network egress charges).

With that framing, we recommend setting up your storage as follows

  • Create a Cloud project designated for data sharing, separate from other Cloud resources (to simplify access controls).
  • Pick the bucket location that best aligns with the location(s) of highest usage.
  • If the data is large and you need to reduce storage costs, make the bucket Regional (in the US, choose us-central1 to align with other existing programs).
  • Make your bucket requester pays.
  • Grant storage.buckets.get permissions to the public (so the bucket storage options can be discovered programmatically or from "gsutil ls -Lb”).

Then communicate to the community

  • Your data storage location
  • Your bucket's configuration as requester pays
  • How users can copy data to a bucket in their own region if different than yours and they need frequent access (rather than incurring egress charges for each usage)

With the above configuration choices, you will have consistent storage costs and end users will pay any network egress charges. End users will also be informed and able to make better choices in how they interact with your data.

Preventing egress (experimental)

If you are interested in further helping the community avoid accidental egress charges to themselves when they use your data, you may be able to use VPC service controls within a Cloud Organization to put a service perimeter around the Google Cloud project that contains your Cloud Storage bucket.

Please refer to this document on how to configure GCS to avoid network egress charges.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.