Controlling Cloud costs - sample use cases

Allie Hajian

Considering key Cloud costs described in this article, we can examine some uses cases and provide the framework for their costs. Specific costs will vary based on software versions, data sizes, storage and access locations. 

This document provides information on key cloud services (Google Cloud Storage, Google Compute Engine, and Google BigQuery) and examples so that you can make informed decisions around controlling costs on Terra. For up-to-date billing information, see the documentation for GCP Pricing

Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies

Running a workflow 

The Terra environment provides a workflow engine called Cromwell. The following discussion assumes that a workflow has been crafted to run using Cromwell. Many workflows in cloud, independent of the workflow engine, will follow a similar model.

A typical single-stage workflow follows the model:

  • Create Compute Engine VM
  • Copy inputs from GCS to the VM
  • Run code on the VM
  • Copy outputs from the VM to GCS
  • Delete the VM

A typical multi-stage workflow is a sequence of single-stage workflows. A more complex multi-stage workflow may run intermediate stages in parallel.

The core cost considerations for running all such workflows are the same:

  • How much compute do you need (CPUs, memory, and disk)?
  • Can you run on preemptible VMs?
  • Can you run your compute nodes in the same region as your data in GCS (to avoid egress charges)?

Whenever possible, run your VMs in the same region as your data. If you run your VMs in a different region, you will incur network egress charges. If you run your VM in the same region as your data, you incur no network egress charges.  

Converting a CRAM to a BAM 

A typical 30x WGS CRAM file is about 17.5 GB in size. Let's look at the cost of converting a CRAM file to a BAM file.

For this example, we used a workflow published in Dockstore (seq-format-conversion/CRAM-to-BAM) and available via the help-gatk/Seq-Format-Conversion workspace.

The workflow has two stages:

  1. CramToBamTask (Convert CRAM to BAM)
  2. ValidateSamFile (Validate the BAM)

Let's focus on CramToBamTask. Converting a CRAM to a BAM is a single-threaded operation, and there is likely little to no advantage to allocating more CPUs. There could be advantages to adding more memory, but at certain memory sizes, GCE requires you to add more CPUs, which increases cost.

There is also cost for the disk associated with the VM. For this operation, a 200 GB disk was allocated, which at $0.04 / GB / month is $8 per month or $0.01 per hour.

Let's compare a few configurations:

Machine

Hourly cost (preemptible/full price)

Runtime

Cost (preemptible VM)

Cost (full-priced VM)

4 CPUs, 15 GB
(n1-standard-4)

$0.04 / $0.19

4h 37m

$0.051 * 4.62 = $0.24

$0.20 * 4.62 = $0.92

1 CPU, 3.75 GB
(n1-standard-1)

$0.01 / $0.0475

5h 51m

$0.02 * 5.85 = $0.12

$0.0575 * 5.85 = $0.34

1 CPU, 6.5 GB
(custom)

$0.012 / $0.059

5h 26m 

$0.022 * 5.43 = $0.12

$0.069 * 5.43 = $0.37


Notice the smallest VM shape had the highest runtime, but the least cost.

There are many things to highlight from this example:

  • Preemptible VMs make a significant price difference
  • Between CPUs and memory, adding more memory is much less expensive
  • Adding more CPUs can decrease runtimes, but at a significant cost multiple
  • Increasing CPU or memory allocations can reduce the amount spent on disk (by shortening runtime), but disk is the least expensive of the three resource types

Note that at 17.5 GB, downloading the CRAM file to convert to BAM on your own workstation would cost $0.175.

While the total computational costs in this example are all very small, let's be sure to look at what happens when you scale up the number of samples to 1000:

Operation

Number of Samples

Estimated Cost

CRAMtoBAM
(
1 CPU, 3.75 GB, preemptible)

1000

$120

CRAMtoBAM
(
4 CPUs 15 GB, full price)

1000

$920

Download

1000

$175

Aligning to a reference and variant calling (GATK Best practices)

The Broad Institute has published the five-dollar-genome-analysis-pipeline to Dockstore and made it available in the help-gatk/five-dollar-genome-analysis-pipeline workspace. Read through the workspace description for example costs of running the workflow.

Be aware that the "five dollar genome" is named for the typical amount of Compute Engine charges generated during processing of a 30x WGS sample. As important (if not more important) is the costs associated with file storage!

Long Term Storage

A typical 30x WGS sample produces a 17.5 GB CRAM file and a 6.5 GB gVCF file. Long term storage of these outputs in a Regional bucket ($0.02 / GB / month) would be:

Number of Samples

Monthly Cost

Annual Cost

1

$0.48

$5.76

100

$48.00

$576.00

1000

$480.00

$5,760.00 

Short Term Storage

Multi-stage workflows like the "five dollar genome" store intermediate results in Google Cloud Storage. Such workflows store large interim results, such as complete BAM files or shards of FASTQ files. It is very important to clean up the interim results when you are done with the workflow. Inattention to cleaning up this storage can significantly increase your per-sample costs.

A typical single sample processing of a 30x WGS sample, can produce more than 300GB of interim data files to store. This is more than 12x the size of the final outputs! Storing these interim results for a month would cost (in a Multi-regional bucket @ $0.026 / GB / month):

Number of Samples

Monthly Cost

1

$7.80

100

$780.00

1000

$7,800.00

Aligning a transcriptomic sample to a reference using STAR 

The following estimates used STAR version 2.6.1d. Processing cost for most samples was between $1.50 and $2.50. As a single-stage workflow, there were no intermediate results to clean up. There are a few details to take away from this workflow:

  • The per-sample cost was kept lower primarily by using preemptible VMs. Using a full priced VM would be more than 4 times as expensive.
  • Having a separate workflow using samtools to sort and compress the BAM shortened total runtimes and allowed more samples to be processed with preemptible VMs.
  • Having a separate workflow using samtools to sort allowed us to reduce the disk size from 1 TB down to 200 GB.
  • Using a higher compression level (samtools defaults to 6, STAR defaults to 1) can save significantly on long term storage costs.

Compute Analysis

The STAR alignReads workflow used 16 threads (--runThreadN 16), and we used a VM defined as custom (16 vCPUs, 80 GB memory) and (initially) 1 TB of persistent disk. At preemptible rates, this VM is approximately $0.178; at full price, this VM is approximately $0.844/hr.

Preemptible VMs must finish their work within 24 hours. We observed that large samples and samples with high multi-mapping rates could take 24 hours or more. The impact of this is that a sample that could take 23 hours on a preemptible VM would cost $4.09 for compute, while a sample that took just over 24 hours on a full priced VM would cost $20.26 for compute.

To get more samples to complete in less than 24 hours, we changed the workflow such that STAR would not sort the BAM. This saved between 1-2 hours for the workflow.

This did mean that we needed to create another workflow to compress the BAM. However, we already had a need to generate a BAM index file, thus we allowed samtools to compress and re-index the BAM. We found that we could index BAMs on a small, single-core VM (n1-standard-1) with a 200 GB disk. At preemptible rates of $0.01 per hour for the VM and $0.01 for the disk, these workflows cost pennies to complete for each sample.

Storage Analysis

Note that the STAR default compression level is 1, while samtools default compression level is 6. Thus when we had samtools recompress the BAM files, we saw a 25% reduction in size. This has a tremendous long term cost benefit.n one dataset we looked at, an average sized RNASeq BAM file was 20 GB (level 6 compression) vs 27 GB (level 1 compression).

The cost of storing these BAMs in Regional storage would be:

Number of BAMs

Level 1 compression (mo/yr)

Level 6 compression (mo/yr)

1

$0.54 / $6.48

$0.40 / $4.80

100

$54.00 / $640.80

$40.00 / $480.00

1000

$540.00 / $6,408.00

$400.00 / $4,800.00

Running a notebook 

The Terra environment provides the ability to run analyses using Jupyter notebooks. In this section, we look at costs around using the Jupyter notebook service, along with costs for running a couple of example notebooks.

To fully understand the notebook environment on Terra, you can read these articles:

When breaking down the costs for using the notebook service, there are two broad categories to consider:

  • Compute costs
  • Egress and Query costs

Compute costs

Your compute costs are based on the VM that is allocated for you, whether that VM is doing any computation or not. Consider:

  • A GCE VM is created for you when you open your first notebook for editing
    • While the VM is running, you will be charged for the allocated CPUs, memory, and disk
  • You can pause the VM
    • While the VM is paused, you will only be charged for the disk
  • You can delete the VM
    • When the VM is deleted, you will not be charged.

Example runtime costs

By default, your Notebook Runtime allocates a VM with 4 cores, 15 GB of memory (n1-standard-4), and a 500 GB disk.

 The cost for this VM and associated disk are:

  • VM: $0.190 / hour
  • Disk: $0.027 / hour (approximate)

While the VM is running, you will be charged about $0.217 per hour. When your VM is paused, you will be charged $0.027 / hour.

If you were to have a notebook VM running for 20 hours per week, your weekly charges would be:

($0.217 / hr * 20 hr) + ($0.027 / hr * 148 hr) = $8.34

or monthly costs of about $33.34. 

Egress and Query costs

You incur the same compute charges whether your notebook is running or sitting idle. How your notebook accesses available data determines any additional charges. For this section, we look at whether your data is in Google Cloud Storage or in BigQuery to assess any additional charges.

Google Cloud Storage 

If your data is in GCS and is in the same region as your notebook VM, then you pay no access charges. As of the time of writing, Terra notebook VMs run in zones in us-central1. If your data is published in us-central1, then no egress charges are incurred for accessing this data from your notebooks.

Google BigQuery

If your data is in BigQuery and the data is "small", then you are likely to incur no additional costs for accessing the data. BigQuery query pricing is "$5.00 per TB" and "First 1 TB per month is free".

Suppose your clinical data is less than 100 MB. You would have to query this data more than 10,000 times in a single month before you would incur charges. After that you would have to query all of your clinical data 200 times before you are charged $0.01.

If your data is in BigQuery and the data is "large", then you will want to pay close attention to how you query the data. Review the discussion above on BigQuery Query costs.

Example notebooks 

Quality Control checks on genomic data

A notebook that performs quality control checks on genomic is typically driven by "analysis ready" data, such as QC metrics emitted by the Picard set of tools. In such cases, if the metrics are aggregated in BigQuery, the data is very small (on the order of MB) and thus is virtually free to query.

If the notebook also queries a large table such as a table of variants, you may begin to generate notable charges. We looked at a 2 TB table containing 73 million variants. A query that selects all of the values from this variants table will cost over $10, while a much more compact and targeted query can be much less.

 Analyzing genomic variants

If you are looking at a targeted region of the genome, querying the typical _variants table can be fairly inexpensive. Some variant tables takes advantage of BigQuery clustering - for example if they are clustered on the reference_name (the chromosome), the start_position, and the end_position. In such a case, when you know the region of interest you can direct BigQuery to just look at the cluster where your data of interest is.

For example, a query that includes in the WHERE clause:

reference_name = 'chr4'

will look only at the records for chromosome 4 (which is less than 7% of the genome), and so the cost of the query will be less than 7% than if you queried the entire table.

If you are looking at the entire genome with your analysis, be mindful of your queries as you can generate meaningful charges. It may be worth evaluating other options which include processing VCF files on a VM. Copying the VCF(s) from GCS eliminate the query charges. This is not always the better solution, as you may only be trading off query time for increased compute time. These types of trade-offs require deeper analysis of the specific use case. 

Storing files in Cloud 

In this section, we provide a quick look at what it costs to store certain types of data, on average. For your own data, you are encouraged to use the Cloud Storage Pricing Guide.

Notebooks

Notebook files typically store very little data, and notebook file contents (code and text) is quite small. An analysis of Jupyter notebooks in github indicates an average size of 600 KB. At $0.026 per GB per month (multi-regional), 1000 notebooks (600 MB) will cost about $0.016 per month.

Clinical Data

Clinical data is typically small. For example, 19MB of clinical data in GCS at $0.02 per GB per month (regional) costs $0.00038 per month.

Genomic Data

Genomic data is large. A typical 30x whole genome will have files of size:

  • FASTQs: 75 GB
  • CRAM: 17.5 GB
  • gVCF: 6.5 GB

Depending on your access patterns, you should consider storing these files in Regional ($0.02 / GB / month) or Nearline ($0.01 / GB / month) storage.

FASTQ 

Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$1.50

$18.00

$0.75

$9.00

100

$150.00

$1,800.00

$75.00

$90.00

1000

$1,500.00

$18,000.00

$750.00

$900.00

 CRAM

Number of Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$0.35

$4.20

$0.175

$2.10

100

$35.00

$420.00

$17.50

$210.00

1000

$350.00

$4,200.00

$175.00

$2,100.00

gVCF 

Number of Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$0.13

$1.56

$0.065

$0.78

100

$13.00

$156.00

$6.50

$78.00

1000

$130.00

$1,560.00

$65.00

$780.00

Transcriptomics Data 

Transcriptomic data is large. A typical 100 million read RNA seq will have FASTQs and BAMs, each around 15 GB. Depending on your access patterns, you should consider storing these files in Regional ($0.02 / GB / month) or Nearline ($0.01 / GB / month) storage.

Storing either of these will cost approximately:

Number of Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$0.30

$3.60

$0.15

$1.80

100

$30.00

$360.00

$15.00

$180.00

1000

$300.00

$3,600.00

$150.00

$1,800.00

Was this article helpful?

2 out of 2 found this helpful

Have more questions? Submit a request

Comments

2 comments

  • Comment author
    Allie Hajian

    Thanks for the catch, Lon Blauvelt! You are completely right. I checked with our Verily partners and updated the article. 

    0
  • Comment author
    Lon Blauvelt

    The FASTQ Monthly Cost (Regional) vs. Monthly Cost (Nearline) looks wrong.  $7,500 should be $750 and $750 should be $75.

    0

Please sign in to leave a comment.