The Terra platform is free to use, however operations in Terra - such as running workflows, running Jupyter Notebooks, and accessing and storing data - may incur Google Cloud charges. These will be billed directly to your Terra billing account.
This document provides information on key cloud services (Google Cloud Storage, Google Computer Engine, and Google BigQuery) and examples to help you make informed decisions around controlling costs on Terra. For up-to-date billing information, see the documentation for GCP Pricing.
Cloud costs fall into the following general categories:
- Compute and disks
- Query processing
- Data egress
- Data retrieval
Common use cases that would incur charges on Terra include:
- Running a workflow
- Converting a CRAM to a BAM
- Aligning a genomic sample to a reference and performing variant calling using GATK Best Practices WDLs
- Aligning a transcriptomic sample to a reference using STAR
- Running a notebook
- Performing quality control checks on genomic data
- Analyzing genomic variants
- Visualization and analysis (in R or Python) on outputs from running a workflow
- Storing files in Cloud Storage (Google buckets or BigQuery)
- Clinical Data
- Genomic Data
- Transcriptomics Data
Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies.
- Storage costs
- Storage classes
- Egress Costs
- Retrieval costs
- Compute and disks
- Compute costs
- Disk costs
- Egress costs
- Storage costs
- Egress costs
- Query costs
- Controlling storage costs
- Controlling compute costs
- Controlling egress costs
Google Cloud Storage
Google Cloud Storage (GCS) is an "object store" where "objects" are stored in "buckets". You can think of it as a place to store files in a structure similar to "folders" or "directories". For more details, see How subdirectories work.
Storage has a cost associated with it. Additionally, when you access data in GCS, you will want to consider where the data will be accessed from, as moving data out of GCS may incur charges.
The primary drivers that affect storage costs are:
- How much are you storing?
- Where are you storing it?
- How frequently is it accessed?
How much are you storing?
A key cloud concept is that you only pay for what you use. Thus you don't need to pre-allocate storage in GCS (like buying an array of disks); you simply pay for what you store.
Where are you storing it?
Google Cloud Storage provides several different storage classes, each with different pricing. The options are primarily based around where you want to store the data and how frequently you will access the data.
Storing data in more locations (multiple "regions") is more expensive than storing data in fewer locations ("regional"). For more information, you can read about Google Cloud regions and bucket locations.
How frequently are you accessing it?
Data that will need to access the most frequently should be stored in a more expensive storage tier. Data that you will access infrequently can be stored in less expensive "cold" storage.
Multi-regional versus regional storage
Multi-Regional storage is most appropriate for data that needs to be accessed quickly and frequently from many locations (for a web site or for gaming, for example).
This is not typically the case for genomic or transcriptomic research data. With these data types, overall access frequency is low and emphasis is on managing storage costs.
Nearline and Coldline Storage
For data that will be accessed very infrequently, Google Cloud offers Nearline and Coldline storage. These storage classes offer significantly reduced costs for storage ($0.010 per GB for Nearline and $0.007 per GB for Coldline), but add a retrieval charge ($0.01 per GB and $0.05 per GB for Coldline).
These storage classes are most appropriate for archive data, for example, after processing FASTQs into BAMs or CRAMs.
Egress charges apply when copying GCS data out of the region(s) that the data is stored in. For example:
- Downloading data to your workstation or laptop
- Copying data stored in one region to a compute engine VM in another region
- Copying data stored in one region to a GCS bucket in another region
- Copying data stored in a multi-region bucket to a regional GCS bucket
Network egress charges vary, but within the United States, the cost is typically $0.01 per GB.
To access GCS data from within the same Cloud region where the data are stored incurs no no egress charges.
Retrieval costs apply only to the "cold storage" classes, nearline and coldline. Note that retrieval applies to:
- Copying data from a cold storage bucket
- Moving data within a cold storage bucket (a move is a copy followed by a deletion)
Google Compute Engine
Google Compute Engine (GCE) provides virtual machines (VMs) and block storage (disks) which can be used for running analyses such as converting a CRAM file to a BAM file or running a Jupyter Notebook to transform and visualize data.
Compute and disks concepts
GCE allows you to create and destroy VMs as you need them. You can create VMs of different shapes and sizes (CPU and memory) for different workloads.
GCE follows the cloud philosophy that you only pay for what you use, and you are only billed for VMs and disks between the time that you create them to the time you destroy them. To be clear, however, you are "using"(i.e. building up charges) your CPU, memory, and disk space while your VM is running, even if it is sitting idle.
GCE's virtualization offers additional flexibility in that you can "stop" a running VM (at which point you stop being charged for the CPU and memory, but continue accruing charges for the disk) and "start" it again later. You can even change the amount of CPU and memory when you restart the VM.
GCE offers significantly reduced costs for using preemptible VMs. If you have a workflow that will run in fewer than 24 hours, you can save up to 80% by using preemptible VMs.
Detailing GCE pricing flexibility is beyond the scope of this document. You are encouraged to get pricing details from the GCE Pricing documentation. Your main questions to ask are:
- How many CPUs does my compute task require?
- How much memory does my compute task require?
- How much disk does my compute task require?
- Can my compute task finish in fewer than 24 hours?
If your compute need is for a long running compute node, you should use a "full priced VM", since a preemptible VM lasts at most 24 hours. If your compute need is for fewer than 24 hours, and you can manage the complexity of preemption at any time within that 24 hours, a preemptible VM will cost almost 80% less. For more information, see the documentation on Preemption selection.
GCE offers a range of disk types, including:
- Network-attached magnetic disks (persistent disk standard)
- Network-attached solid state disks (persistent disk SSD)
- Locally-attached solid state disks (local SSD)
In general, you pay more for large disks and more performance disks. Most life sciences workflows are not I/O bound and so the least expensive disk (Persistent Disk Standard) is typically the best choice. If your workflow is I/O bound, however, you may find that using Local SSDs on a preemptible instance is the best choice.
Egress charges apply when copying data out of the zone that a compute engine VM is running in. For example:
- Downloading data to your workstation or laptop
- Copying data from a VM in one zone to a VM in another zone
- Copying data from a VM in one region to a GCS bucket in another region
No egress charges accrue for data copied between VMs in the same zone, or to copy data between a VM and a GCS bucket in the same zone.
Google BigQuery (BQ) is a database where "tables" are stored in "datasets," including both tabular data and nested data. You can issue SQL queries to filter and retrieve data in BigQuery.
Storage has a cost associated with it. When you query data in BigQuery, you want to consider just how much data your query "touches," as BigQuery query billing is based on the amount of data that the query engine "looks at" to satisfy the request.
BigQuery storage costs are $0.02 per GB for the first 90 days after table creation and $0.01 per GB from then on.
When you run a query, you're charged according to the number of bytes processed in the columns you select or filter on, even if you set an explicit limit on the number of records returned. This means that you want to be careful about which columns you put in your SELECT lists and WHERE clauses.
BigQuery query costs are $5.00 per TB, with the first 1 TB per month free.
Resources for controlling query costs
BigQuery offers a number of features to help control query costs. See:
- BigQuery best practices: Controlling costs
- Estimating query costs
- Partitioned Tables
- Clustered Tables
BigQuery does not include explicit network egress charges, however BigQuery has limits on the amount of data that one can egress. A query has the maximum response size — 10 GB compressed.
Helpful hint: When issuing a query that returns a large amount of data, you may write the results to another BigQuery table or to a GCS bucket.
General cost control advice
The following is general advice for controlling costs when using Google Cloud for typical life sciences work. These "quick tips" are explained in more detail below.
To control storage costs
Keep storage costs of large data under control
- Use Regional storage
- Compress large data
- Move data to cold storage (Nearline or Coldline)
- Clean up workflow intermediate files promptly
To control compute costs
- Use preemptible VMs (saves up to 80%)
- Use fewer cores
- Use less memory
- Use less disk
- Monitor your workflows
To control egress costs
- Use preemptible VMs to copy or move from a multi-regional bucket to a regional bucket
Controlling storage costs
While many life sciences projects will commit a lot of time and energy into optimizing their data processing workflows, it is often long term storage costs that will dominate the budget. The reason for the high storage costs is the huge amount of data generated in the life sciences, such as genomic and transcriptomic. The following sections provide tips for keeping storage costs of large data under control.
Use regional storage
For life sciences data, there is rarely a reason to make the data available in multiple Google Cloud regions. The cost of regional storage is 77% of that for multi-regional storage. The easiest way to save your project 23% is to put your data and compute in a single region.
Compress large data
Compression rates vary, but some common options are:
- Compress STAR-generated BAMs (and index them) with samtools (discussed above)
- Convert WGS BAMs to CRAMs (and index them) with samtools
- Compress VCFs with bgzip (and index them with tabix)
Move data to cold storage (Nearline or Coldline)
Determining whether you can move large files to cold storage can tricky. If you move files that are accessed frequently, the access charges can wipe away the storage savings. However, much life science data goes through a life cycle of:
- Source data is generated
- Source data is processed into smaller summary information
- Summary information is used extensively
- Source data is used rarely
FASTQ files for genomics and transcriptomics fit this model and are large. Moving these files to Nearline after initial processing can save a project a lot of money on its largest data.
Clean up workflow intermediate files promptly
WDL-based workflows on large files, such as FASTQs, BAMs, and gVCF often have intermediate stages where large files are sharded or converted to different formats, creating many artifacts that get stored in Google Cloud Storage. Leaving these files in Cloud Storage can result in significant costs associated with running workflows. If the workflow succeeds, clean up the intermediate files, especially the large ones.
Controlling compute costs
Many people in the life sciences are familiar with working in an HPC environment. In this case, they have a compute cluster available to them. This cluster is typically a modest fixed size and is often shared with other researchers and departments. The primary driver for computation is toward having jobs finish quickly and to minimize compute resources (CPUs, memory, and disk).
In this environment, if you have 1,000 samples to process, each takes a day to process, and available computing for 100 samples to run concurrently, then such processing will finish in 10 days (if all goes well). If you can reduce the time to process a single sample by 30%, you'll finish your processing in a week.
With cloud computing, you are generally not constrained by resources in the same way. If you want to run 1,000 samples concurrently, you can generally do that (just be sure to request more Compute Engine Quota; if working in Terra, see this article). Reducing runtimes and compute resources will save you money, but you have other knobs to turn on money saving notably with preemptible VMs. Life science workflow runners, like Cromwell, are designed to take advantage of preemptible VMs).
As a general approach to saving on compute costs, approach optimization in the following order:
- Use preemptible VMs
- Reduce the number of CPUs (they are the most expensive resource)
- Reduce the amount of memory (add monitoring to your workflows)
- Reduce the amount of disk used (add monitoring to your workflows)
Below are some specific suggestions around preemptible VMs and monitoring.
Cromwell has the ability to use preemptible VMs and for each task, you can set a number of automatic retries before falling back to a full priced VM. Some additional details to know about using preemptible VMs:
- Smaller VMs are less likely to be preempted than large VMs
- Preemption rates are lower during nights and weekends
- IO-bound workflows may benefit from using Local SSDs on preemptible instances
- Preemptions tend to happen early in a VMs lifetime
This last bullet point is important to understand. It is explained further in Google's documentation:
Generally, Compute Engine avoids preempting too many instances from a single customer and will preempt instances that were launched most recently. This might be a bit frustrating at first, but in the long run, this strategy helps minimize lost work across your cluster. Compute Engine does not charge you for instances if they are preempted in the first minute after they start running.
So while running on a preemptible VM and getting preempted adds cost overhead (cutting into your savings), such preemptions tend to happen early and the additional cost is modest.
It is difficult to save on CPUs, memory, and disk if you don't know your peak usage while workflows are running. Adding a little bit of monitoring can go a long way to help understand these usage requirements.
Observations you may make about a workflow stage, once you have added monitoring:
This workflow stage for the largest sample uses
- about the same <cpu, memory, disk> as the smallest sample
- much more <cpu, memory, disk> as the smallest sample
With this information, you can decide whether it is worthwhile to adjust cpu, memory, or disk on a per-sample basis.
You might also observe:
- This workflow stage runs a sequence of commands and the disk usage never goes down. If we cleaned up intermediate files while running, we could allocate less disk for each workflow.
- This workflow stage runs a sequence of commands; some are multi-threaded and take advantage of more CPUs and some commands are single-threaded. If we made this a multi-stage workflow, we could use a single CPU VM for some steps and reduce total CPU cost.
- This workflow runs on an n1-standard machine, but we never use all of the memory. We could change to an n1-highcpu machine (or a custom VM).
Controlling egress costs
Moving data from a multi-regional bucket to a regional bucket incurs egress charges at a rate of $0.01/GB. This means, for example, that moving 100 TB of data from a Terra workspace bucket (multi-regional US) to your own regional bucket will cost $1,000.
Suppose that 100 TB of data is made up of one thousand 100 GB files. You could create a workflow on Terra that runs 1000 concurrent n1-standard-1 preemptible VMs, each with a 200 GB disk to:
- Copy file from multi-regional bucket to VM
- Copy file from VM to Regional bucket
- Remove the file from the multi-regional bucket
Each VM + disk would cost approximately $0.02 per hour and would finish in less than 1 hour. Your cost for transfer is thus on the order of $20.
For examples of how to control costs in specific use-case, see this article.