Overview: Controlling Google Cloud costs on Terra

Allie Hajian

This document provides information on the costs of using key cloud services (Google Cloud Storage, Google Computer Engine, and Google BigQuery). as well as examples to help you make informed decisions on controlling costs on Terra. For up-to-date billing information, see the documentation for Google Cloud Pricing

Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies.

Cloud costs overview

The Terra platform is free to use. For example, you can browse showcase workspaces and the Data Library as soon as you register for an account. However, operations in Terra - such as running workflows, running Jupyter Notebooks, and accessing and storing data - may incur Google Cloud charges. These charges are billed by Google Cloud and paid through your Terra Billing project. 

StorageCompute and disksQuery processing | Data transfer out (egress) | Data retrieval

Common use-case examples

See several examples below of things you might do on Terra and their associated Google Cloud costs.

Running a workflow

  • Converting a CRAM to a BAM
  • Aligning a genomic sample to a reference and performing variant calling using GATK Best Practices WDLs
  • Aligning a transcriptomic sample to a reference using STAR

Running a notebook

  • Performing quality control checks on genomic data
  • Analyzing genomic variants
  • Visualization and analysis (in R or Python) on outputs from running a workflow

Storing files in Cloud Storage (Google Buckets or BigQuery)

  • Notebooks (ipynb files)
  • Clinical data (CSV or TSV files)
  • Genomic data
  • Transcriptomics data

Google Cloud Storage

Google Cloud Storage (GCS) is an object store where objects are stored in buckets. You can think of it as a place to store files in a structure similar to folders or directories. For more details, see How subdirectories work.

Storage has a cost. Additionally, accessing or moving data out of GCS may incur charges (data transfer out), depending on where the data is stored and where the data will be accessed from.

Storage cost considerations (primary drivers)

Below are questions that may influence the cost of data storage for you or anyone accessing your data. To learn more, see Customizing where your data are stored and analyzed.  

How much are you storing?

A key cloud concept is that you only pay for what you use. Thus you don't need to preallocate storage in GCS (like buying an array of disks); you simply pay for what you store.

Where do you store it?

Google Cloud Storage provides several different storage classes, each with different pricing. The options are primarily based on where you want to store the data and how frequently you access it. Storing data in more locations (multiple "regions") is more expensive than storing data in fewer locations ("regional").

See US multi-region versus regional storage: tradeoffs. For more information, read about Google Cloud regions and bucket locations.

How frequently do you access it?

Data you access frequently should be stored in a more expensive storage tier. Data you access infrequently can be stored in less expensive "cold" storage. 

Storage classes cost options

Multiregional versus regional storage 

Multiregional storage is the most expensive option at $0.026 per GB per month. Regional storage is less expensive at $0.020 - $0.023 per GB per month (depending on what region - see Google's pricing tables for details). 

Multiregional storage is most appropriate for data that need to be accessed quickly and frequently from many locations (e.g., for a website or gaming).  

This is not typically the case for genomic or transcriptomic research data. With these data types, overall access frequency is low and the emphasis is on managing storage costs.

Nearline and Coldline Storage

For data that are accessed very infrequently, Google Cloud offers Nearline and Coldline storage. These storage classes offer significantly reduced costs for storage ($0.010 per GB for Nearline and $0.004 per GB for Coldline), but add a retrieval charge ($0.01 per GB and $0.05 per GB for Coldline).

These storage classes are most appropriate for archiving data, for example, after processing FASTQs into BAMs or CRAMs.

Data transfer cost considerations

Data transfer charges apply when copying GCS data out of the region(s) where the data are stored.

Data transfer examples

  • Copying data stored in one region to a compute engine virtual machine (VM) or Cloud Environment persistent disk in another region.
  • Copying data stored in one region to a GCS bucket in another region.
  • Copying data stored in a multi -region bucket to a regional GCS bucket.
  • Downloading data to your workstation or laptop.

Network data transfer charges vary, but copying data to a Google Cloud storage location within the United States typically costs $0.01 per GB.

Data transfer out of GCS

Downloading to your local workstation, laptop, or anywhere else outside of Google Cloud is subject to General network pricing charges. 

Amount of data Cost to transfer (per GB)
0-1 TB $0.12
1-10 TB $0.11
10+ TB $0.08

Accessing GCS data from within the same Cloud region where the data are stored incurs no data transfer charges.

Retrieval cost considerations

Retrieval costs apply only to the "cold storage" classes: Nearline and Coldline.

Retrieval applies when you

  1. Copy data from a cold storage bucket
  2. Move data within a cold storage bucket (a move is a copy followed by a deletion)

Google Compute Engine

Google Compute Engine (GCE) provides virtual machines (VMs) and block storage (disks) which can be used for running analyses such as converting a CRAM file to a BAM file or running a Jupyter Notebook to transform and visualize data.

Compute and disks concepts (VMs)

GCE allows you to create and destroy VMs as you need them. You can create VMs of different shapes and sizes (CPU and memory) for different workloads.

GCE follows the cloud philosophy that you only pay for what you use, and you are only billed for VMs and disks between the time that you create them to the time you destroy them. To be clear, however, you "use"(i.e., build up charges) on your CPU, memory, and disk space while your VM is runs, even if it sits idle. 

GCE's virtualization offers additional flexibility in that you can "stop" a running VM (at which point you stop being charged for the CPU and memory, but continue accruing charges for the disk) and "start" it again later. You can even change the amount of CPU and memory when you restart the VM.

Saving money with preemptible VMs

GCE offers significantly reduced costs for using preemptible VMs. If you have a workflow that will run in fewer than 24 hours, you can save up to 80% by using preemptible VMs. To learn more, see Controlling Cloud costs - sample use cases

Compute costs

Detailing GCE pricing flexibility is beyond the scope of this document. See pricing details from the GCE Pricing documentation.

Questions to ask

  • How many CPUs does my compute task require?
  • How much memory does my compute task require?
  • How much disk does my computer task require?
  • Can my compute task finish in fewer than 24 hours?

If your compute need is for a long-running compute node, you should use a "full priced VM", since a preemptible VM lasts 24 hours at most. If your compute need is for fewer than 24 hours, and you can manage the complexity of preemption at any time within that 24 hours, a preemptible VM will cost almost 80% less. For more information, see the Google documentation on Preemption selection.

Disk costs

GCE offers a range of disk types, including

  • Network-attached magnetic disks (persistent disk standard).
  • Network-attached solid state disks (persistent disk SSD).
  • Locally-attached solid state disks (local SSD).

What disk type is right for you?

In general, you pay more for large disks and more performance disks. Most life sciences workflows are not I/O bound, so the least expensive disk (Persistent Disk Standard) is typically the best choice. If your workflow is I/O bound, however, you may find that using Local SSDs on a preemptible instance is the best choice.

Data transfer costs

Data transfer charges apply when copying data out of the zone that a compute engine VM is running in.

Data transfer examples

  • Downloading data to your workstation or laptop.
  • Copying data from a VM in one zone to a VM in another zone.
  • Copying data from a VM in one region to a GCS bucket in another region.
  • Analyzing data stored in a different cloud provider in GCE 

No data transfer charges accrue for data copied between VMs in the same zone, or to copy data between a VM and a GCS bucket in the same zone.

Note that the amount of Always Free Internet data transfer is currently 100 GB per month to each qualifying data transfer destination.

Google BigQuery

Google BigQuery (BQ) is a database where "tables" are stored in "datasets," including both tabular data and nested data. You can issue SQL queries to filter and retrieve data in BigQuery.

See this Google blog on Cost Optimization Best Practices for BigQuery.

Storing and accessing data in BigQuery have associated costs!

When you query data in BigQuery, consider just how much data your query "touches," as BigQuery query billing is based on the amount of data that the query engine "looks at" to satisfy the request.

BigQuery Storage Costs

BigQuery storage costs are $0.02 per GB for the first 90 days after table creation and $0.01 per GB from then on.

Query costs

When you run a query, you're charged by the number of bytes processed in the columns you select or filter on, even if you set an explicit limit on the number of records returned. Be careful about which columns you put in your SELECT lists and WHERE clauses.

BigQuery query costs are $5.00 per TB, with the first 1 TB per month free.

Resources for controlling query costs

BigQuery offers a number of features to help control query costs. See: 

BigQuery data transfer costs

BigQuery does not include explicit network data transfer charges; however, BigQuery has limits on the amount of data you can data transfer. A query has a maximum response size — 10 GB compressed.

Helpful hint: When issuing a query that returns a large amount of data, write the results to another BigQuery table or a GCS bucket.

Controlling storage costs (large data)

While many life sciences projects commit time and energy to optimizing their data-processing workflows, often long-term storage costs dominate the budget. The reason for the high storage costs is the huge amount of data generated in the life sciences, such as genomic and transcriptomic. The following sections provide tips for keeping storage costs of large data under control.

1. Use regional storage

For life sciences data, there is rarely a reason to make data available in multiple Google Cloud regions. The cost of regional storage is 77% of that for multiregional storage. The easiest way to save your project 23% is to put your data and compute in a single region.

2. Compress large data

Compression rates vary, but some common options are:

  • Compress STAR-generated BAMs (and index them) with samtools (discussed above).
  • Convert WGS BAMs to CRAMs (and index them) with samtools.
  • Compress VCFs with bgzip (and index them with tabix.

3. Move data to cold storage (Nearline or Coldline)

Deciding whether you can move large files to cold storage can be tricky. If you move files that are accessed frequently, the access charges can wipe away the storage savings. Note that Terra workspace buckets have autoclass enabled by default. Autoclass automatically transitions objects in your bucket to appropriate storage classes based on each object's access pattern. The feature moves data that is not accessed to colder storage classes to reduce storage cost and moves data that is accessed to Standard storage to optimize future accesses.

However, much life science data goes through a life cycle of

  1. Source data is generated.
  2. Source data is processed into smaller summary information.
  3. Summary information is used extensively.
  4. Source data is used rarely.

FASTQ files for genomics and transcriptomics fit this model and are large. Moving these files to Nearline after initial processing can save a project a lot of money on its largest data.

4. Clean up intermediate files promptly

WDL-based workflows on large files, such as FASTQs, BAMs, and gVCF often have intermediate stages where large files are sharded or converted to different formats, creating many artifacts that get stored in Google Cloud Storage. Leaving these files in Cloud Storage can result in significant costs associated with running workflows. If the workflow succeeds, clean up the intermediate files, especially the large ones.

Controlling compute costs

Many people in the life sciences are familiar with working in an HPC environment. In this case, they have a compute cluster available to them. This cluster is typically a modest fixed size and is often shared with other researchers and departments. The primary driver for computation is toward having jobs finish quickly and minimizing compute resources (CPUs, memory, and disk).

In this environment, if you have 1,000 samples to process, each takes a day to process, and available computing for 100 samples to run concurrently, then such processing will finish in 10 days (if all goes well). If you can reduce the time to process a single sample by 30%, you'll finish your processing in a week.

With cloud computing, generally, you are not constrained by resources in the same way. If you want to run 1,000 samples concurrently, you can do that (just be sure to request more Compute Engine Quota; if working in Terra, see this article). Reducing runtimes and compute resources will save you money; but you have other money-saving knobs to turn on, notably with preemptible VMs. Life science workflow runners, like Cromwell, are designed to take advantage of preemptible VMs).

To save on compute costs, approach optimization in the following order

  1. Use preemptible VMs.
  2. Reduce the number of CPUs (they are the most expensive resource).
  3. Reduce the amount of memory (add monitoring to your workflows).
  4. Reduce the amount of disk used (add monitoring to your workflows).

Below are some specific suggestions around preemptible VMs and monitoring.

1. Use Preemptible VMs

Cromwell can use preemptible VMs and for each task, you can set a number of automatic retries< before falling back to a full-priced VM.

Some additional details to know about using preemptible VMs

  • Smaller VMs are less likely to be preempted than large VMs.
  • Preemption rates are lower during nights and weekends.
  • IO-bound workflows may benefit from using Local SSDs on preemptible instances.
  • Preemptions tend to happen early in a VM's lifetime

This last bullet point is important to understand. It is explained further in Google's documentation

Generally, Compute Engine avoids preempting too many instances from a single customer and will preempt instances that were launched most recently. In the long run, this strategy helps minimize lost work across your cluster. Compute Engine does not charge you for instances if they are preempted in the first minute after they start running. 

So while running on a preemptible VM and getting preempted adds cost overhead (cutting into your savings), such preemptions tend to happen early and the additional cost is modest.

2. Monitor peak use

It is difficult to save on CPUs, memory, and disks if you don't know your peak usage while workflows are running. Adding a little bit of monitoring can go a long way to help understand these usage requirements.

Observations you may make about a workflow stage, once you add monitoring:

This workflow stage for the largest sample uses

about the same <cpu, memory, disk> as the smallest sample much more <cpu, memory, disk> as the smallest sample.

With this information, you can decide whether it is worthwhile to adjust cpu, memory, or disk on a per-sample basis.

You might also observe:

  • This workflow stage runs a sequence of commands, and the disk usage never goes down.
  • If you clean up intermediate files while running, you can allocate less disk space for each workflow.
  • This workflow stage runs a sequence of commands; some are multithreaded and take advantage of more CPUs, and some commands are single-threaded.
  • If you make this a multistage workflow, you can use a single CPU VM for some steps and reduce total CPU cost.
  • This workflow runs on an n1-standard machine, but it never uses all of the memory.
  • You can change to an n1-highcpu machine (or a custom VM).

Controlling data transfer costs

Use preemptible VMs to copy or move from a multiregional bucket to a regional bucket.

More details on controlling data transfer costs

Moving data from a multiregional bucket to a regional bucket incurs data transfer charges at a rate of $0.01/GB.

For example, this means that moving 100TB of data from a Terra workspace bucket (single region US) to your own multi-regional bucket will cost $1,000.

Suppose that 100 TB of data files are made up of one thousand 100 GB files. You could create a workflow on Terra that runs 1000 concurrent n1-standard-1 preemptible VMs, each with a 200 GB disk to:

  • Copy file from multiregional bucket to VM
  • Copy file from VM to Regional bucket
  • Remove the file from the multiregional bucket

Each VM + disk would cost approximately $0.02 per hour and would finish in less than 1 hour. Your cost for transfer is thus on the order of $20.

To learn more, see Controlling Cloud costs - sample use cases.

Was this article helpful?

Comments

14 comments

  • Comment author
    Brendan Reardon

    Allie Hajian, this is an amazing article and should be required reading for any new users on Terra. Well done! To that end, do you know if there is any word of Terra supporting many of the wonderful features that you mentioned such as regional / archival storage or intermediate file clean up? 

    0
  • Comment author
    Allie Hajian

    Brendan Reardon Thanks for the positive feedback! Note that this content came from our amazing Verily partner, Matt Bookman. I followed up on your questions; the good news is that the two features you mention are definitely on the Terra team's radar: in-app functionality for intermediate file cleanup is in active development (2020 Q1), and regional/archival storage is on the lower priority list (late 2020 at the earliest). For breaking news about when things happen, make sure to check in weekly with release notes. Sometimes changes happen so quickly even the comms team feel like we're playing catch up with new features!

    0
  • Comment author
    lck

    I'm struggling a bit with understanding and controlling our costs -- starting in April, a big chunk of our costs (~20%) began to go to "logs ingestion." I would love some advice about how I can control this. I don't have the proper permissions to create logging exclusions, apparently, and I'm having a hard time figuring out what to do. This wasn't in our estimated budget for the project I'm working on, which is causing some trouble. Would appreciate any help or pointers to support resources for someone who is definitely NOT a Google Cloud expert! Thank you.

    0
  • Comment author
    Allie Hajian

    lck This sounds frustrating! Your best bet is to email support@terra.bio and open a support ticket for this. 

    0
  • Comment author
    Leonhard Gruenschloss

    Is there any way to select the cloud region that Terra is using to bring up GCE instances? It might be worth mentioning in case that's hardcoded to be us-central1. You mention the GCS egress costs, but those are hard to predict without knowing how those transfers will happen.

    (I just got surprised to find out that 300 GiB of egress between Australia and the US cost $50 for a non-workspace bucket I used for input data.)

    0
  • Comment author
    Jason Cerrato

    Hi Leonhard Gruenschloss,

    The default region for workflows if none is specified is indeed us-central1. You can read more on the defaults and setting a specific zone for your runtime here: https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/#runtime-attribute-descriptions

    For additional information, Terra workspaces buckets are located multiple regions in United States. You should be able to run workflows in non-US regions, but you would have to take any associated charges writing from/to the US bucket into consideration.

    I hope this helps!

    Kind regards,

    Jason

    1
  • Comment author
    Leonhard Gruenschloss

    Thanks a lot, Jason, that's very helpful!

    0
  • Comment author
    Mark Godek
    • Edited

    Hi, I'm interested in reducing egress costs by using preemptible VMs.

    How is it so cheap to move the files using a VM? Aren't there egress charges when moving the files on to the VM, if the data is in US multi-region and the VM is in US-central1?

    Edit: I've read a little more and ingress costs are free, which is why using the VM to transfer the data is free. Is my understanding accurate?

    Thanks.

    0
  • Comment author
    Allie Cliffe

    Mark Godek Egress costs are separate from compute costs. The cost savings from using preemptible VMs comes from the run-time fees (they cost up to 80% less to run). You will pay less for the time the VM doing the transfer is running by using preemptibles, but the egress costs are fixed. 

    Currently there are no GCP egress costs when moving files from a US-multi-region bucket to a regional (i.e. us-central1) VM. However, GCP pricing will be changing in October (see Google's announcement of the changes here). At that time, there it will cost $0.02/GB to transfer data between us-multi-region and a particular region. 

    To minimize egress costs, Terra is currently shifting all default workspace storage from multi-regional to us-central1 (transfer between the same region will remain free). The default regions for all VMs will also be us-central1. To learn more about minimizing data egress costs, see Data Submitters Resources in Terra Support. Hopefully this helps with your question!

    0
  • Comment author
    Mark Godek

    Thanks for the advice.

    I think it's really great Terra is moving the default storage location to US-Central1. Our lab has been using the default US multi-region for years and I'm trying to migrate as much as I can to US-central1 before October.

    0
  • Comment author
    Jason Cerrato

    Hey Mark Godek,

    Just so you know, we are working on a plan to migrate all existing multi-regional workspace buckets to single-region workspace buckets (with an option to opt out) prior to the price change. As such, you may not necessarily need to migrate the data yourself—Terra Team will be reaching out sometime in the next couple of months with more details about the migration plan.

    Kind regards,

    Jason

    1
  • Comment author
    Mark Godek

    Thanks Jason, that's really useful information. I'll keep it in mind while implementing my lab's new data plan.

    0
  • Comment author
    Mark Godek

    Jason Cerrato I was wondering if there was an update on the migration plan since October 1st is a couple weeks away. I don't recall seeing anything in the Terra newsletter. Thanks.

    0
  • Comment author
    Jason Cerrato

    Hey Mark Godek,

    Thanks for checking in on this! I've received word that the Broad Institute has an 18 month grace period for the pricing change. If your Billing Account is through Broad, you should not see any change in your expenses related to egress.

    Many other institutions who have Google Billing Accounts in use for Terra also have this grace period, but you can contact your institutional Google representative if you aren't sure. The migration is not yet scheduled as the team is focused on the current migration of older workspaces to the current project-per-workspace model. We will definitely send out communications about any migrations to come in advance of them actually taking place.

    If you have any questions about this, please let me know!

    Kind regards,

    Jason

    1

Please sign in to leave a comment.