Understanding Google Cloud costs - sample use cases

Allie Hajian

To make the costs described in Overview: Controlling Google Cloud costs on Terra more concrete, we break down some examples and work through their costs -- including key cloud services like Google Cloud Storage, Google Compute Engine, and Google BigQuery.

Note that specific costs will vary based on software versions, data sizes, storage and access locations. 

Content for this article was contributed by Matt Bookman from Verily Life Sciences based on work done in Terra for AMP PD, a public/private partnership collaborating toward biomarker discovery to advance the development of Parkinson’s Disease therapies

Workflow costs

Overview: running a workflow 

The Terra environment provides a workflow engine called Cromwell. The following discussion assumes that a workflow has been crafted to run using Cromwell. Many workflows in cloud, independent of the workflow engine, will follow a similar model.

Typical single-stage workflow procedure

Note: A typical multistage workflow is a sequence of single-stage workflows. A more complex multi-stage workflow may run intermediate stages in parallel.

  1. Create Compute Engine virtual machine (VM)
  2. Copy inputs from GCS to the VM
  3. Run code on the VM
  4. Copy outputs from the VM to GCS
  5. Delete the VM

Core cost considerations

  • How much compute do you need (CPUs, memory, and disk)?
  • Can you run on preemptible VMs?
  • Can you run your compute nodes in the same region as your data in GCS (to avoid data transfer charges)?

Whenever possible, run your VMs in the same region as your dataIf you run your VMs in a different region, you will incur network data transfer charges. If you run your VM in the same region as your data, you incur no network data transfer charges. To learn more, see Customizing where your data are stored and analyzed.

Workflow example 1: Converting a CRAM to a BAM

A typical 30x WGS CRAM file is about 17.5 GB in size. Let's look at the cost of converting a CRAM file to a BAM file. For this example, we used a workflow published in Dockstore (seq-format-conversion/CRAM-to-BAM) and available via the help-gatk/Seq-Format-Conversion workspace.

The workflow has two stages:

  1. CramToBamTask (Convert CRAM to BAM)
  2. ValidateSamFile (Validate the BAM)

Let's focus on CramToBamTask. Converting a CRAM to a BAM is a single-threaded operation, and there is likely little to no advantage to allocating more CPUs. There could be advantages to adding more memory, but at certain memory sizes, the Google Compute Engine (GCE) requires you to add more CPUs, which increases cost.

There is also cost for the disk associated with the VM. For this operation, a 200 GB disk was allocated, which at $0.04 / GB / month is $8 per month or $0.01 per hour.

Let's compare a few configurations:


Machine  

Hourly cost
(preemptible/full price)


Runtime

Cost
(preemptible VM)

Cost
(full-priced VM)

4 CPUs, 15 GB  
(n1-standard-4)  

$0.04 / $0.19

4h 37m

$0.051 * 4.62 = $0.24

$0.21 * 4.62 = $0.97

1 CPU, 3.75 GB  
(n1-standard-1)  

$0.01 / $0.0475

5h 51m

$0.02 * 5.85 = $0.12

$0.0604 * 5.85 = $0.35

1 CPU, 6.5 GB  
(custom)  

$0.012 / $0.059

5h 26m 

$0.022 * 5.43 = $0.12

$0.072 * 5.43 = $0.39


Notice the smallest VM shape had the highest runtime, but the least cost.

Workflow example 1 highlights There are many things to highlight from this example:

- Preemptible VMs are significantly cheaper.

- Between CPUs and memory, adding more memory is much less expensive.

- Adding more CPUs can decrease runtimes, but at a significant cost multiple.

- Increasing CPU or memory allocations can reduce the amount spent on disk (by shortening runtime), but disk is the least expensive of the three resource types.

Note: At 17.5 GB, downloading the CRAM file to convert to BAM on your own workstation would cost $0.175.

While the total computational costs in this example are all very small, let's look at what happens when you scale up the number of samples to 1000:


Operation

Number of Samples

Estimated Cost

CRAMtoBAM
(1 CPU, 3.75 GB, preemptible)

1000

$120

CRAMtoBAM
(4 CPUs 15 GB, full price)

1000

$966

Download

1000

$175

Workflow example 2: Aligning to a reference and variant calling (GATK Best practices)

The Broad Institute has published the five-dollar-genome-analysis-pipeline to Dockstore and made it available in the help-gatk/five-dollar-genome-analysis-pipeline workspace. Read through the workspace description for example costs of running the workflow.

Be aware that the "five dollar genome" is named for the typical amount of Compute Engine charges generated during processing of a 30x WGS sample. As important (if not more important) are the costs associated with file storage!

Long Term Storage

A typical 30x WGS sample produces a 17.5 GB CRAM file and a 6.5 GB gVCF file. Long-term storage of these outputs in a Regional bucket ($0.02 / GB / month) would be:

Number of Samples


Monthly Cost


Annual Cost

1

$0.48

$5.76

00

$48.00

$576.00

1000

$480.00

$5,760.00 

Short-Term Storage

Multistage workflows like the "five dollar genome" store intermediate results in Google Cloud Storage. Such workflows store large interim results, such as complete BAM files or shards of FASTQ files. It's very important to clean up the interim results when you are done with the workflow. Inattention to cleaning up this storage can significantly increase your per-sample costs.

A typical single sample processing of a 30x WGS sample, can produce more than 300GB of interim data files to store. This is more than 12x the size of the final outputs! Storing these interim results for a month would cost (in a multiregional bucket @ $0.026 / GB / month):

Number of Samples


Monthly Cost

1

$7.80

100

$780.00

1000

$7,800.00

Workflow example 3: Aligning a transcriptomic sample to a reference using STAR 

The following estimates used STAR version 2.6.1d. Processing cost for most samples was between $1.50 and $2.50. As a single-stage workflow, there were no intermediate results to clean up. There are a few details to take away from this workflow:

Workflow example 3 highlights - The per-sample cost was kept lower primarily by using preemptible VMs. Using a full priced VM would be more than 4 times as expensive.

- Having a separate workflow using samtools to sort and compress the BAM shortened total runtimes and allowed more samples to be processed with preemptible VMs.

- Having a separate workflow using samtools to sort allowed us to reduce the disk size from 1 TB down to 200 GB.

- Using a higher compression level (samtools defaults to 6, STAR defaults to 1) can save significantly on long-term storage costs.

Compute Analysis

The STAR alignReads workflow used 16 threads (--runThreadN 16), and we used a VM defined as custom (16 vCPUs, 80 GB memory) and (initially) 1 TB of persistent disk. At preemptible rates, this VM is approximately $0.178; at full price, this VM is approximately $0.886/hr.

Preemptible VMs must finish their work within 24 hours. We observed that large samples and samples with high multi-mapping rates could take 24 hours or more. The impact of this is that a sample that could take 23 hours on a preemptible VM would cost $4.09 for compute, while a sample that took just over 24 hours on a full priced VM would cost $21.27 for compute.

To get more samples to complete in less than 24 hours, we changed the workflow such that STAR would not sort the BAM. This saved between 1-2 hours for the workflow.

This meant we needed to create another workflow to compress the BAM. However, we already had a need to generate a BAM index file, thus we allowed samtools to compress and re-index the BAM. We found we could index BAMs on a small, single-core VM (n1-standard-1) with a 200 GB disk. At preemptible rates of $0.01 per hour for the VM and $0.01 for the disk, these workflows cost pennies to complete for each sample.

Storage Analysis

Note: The STAR default compression level is 1, while samtools default compression level is 6. Thus, when we had samtools recompress the BAM files, we saw a 25% reduction in size. This has a tremendous long-term cost benefit.n one dataset we looked at, an average sized RNASeq BAM file was 20 GB (level 6 compression) vs 27 GB (level 1 compression).

The cost of storing these BAMs in Regional storage would be:

Number of BAMs


Level 1 compression (mo/yr)


Level 6 compression (mo/yr)

1

$0.54 / $6.48

$0.40 / $4.80

100

$54.00 / $640.80

$40.00 / $480.00

1000

$540.00 / $6,408.00

$400.00 / $4,800.00

Notebook costs

Overview: Running a notebook

The Terra environment provides the ability to run analyses using Jupyter Notebooks. In this section, we look at costs around using the Jupyter Notebook service, along with costs for running a couple of example notebooks.

To fully understand the notebook environment on Terra, you can read these articles:

When breaking down the costs for using the notebook service, there are two broad categories to consider:

  • Compute costs
  • Data transfer and Query costs

Notebook Compute costs

Your compute costs are based on the VM that's allocated for you, whether that VM is doing any computation or not.

Notebook VM costs highlights - A Google Compute Engine Virtual Machine (VM) is created for you when you start a Cloud Environment in your workspace.

- While the VM is running, you're charged for the allocated CPUs, memory, and disk.

- While the VM is paused, you're charged only for the disk.

- You're not charged once you've deleted the VM.

Example runtime costs

By default, your Notebook Runtime allocates a VM with 4 cores, 15 GB of memory (n1-standard-4), and a 500 GB disk.

The cost for this VM and associated disk are:

  • VM: $0.190 / hour
  • Disk: $0.027 / hour (approximate)

While the VM is running, you're charged about $0.217 per hour. When your VM is paused, you are charged $0.027 / hour.

If you were to have a notebook VM running for 20 hours per week, your weekly charges would be:

($0.217 / hr * 20 hr) + ($0.027 / hr * 148 hr) = $8.34

or monthly costs of about $33.34. 

Data transfer and Query costs

You incur the same compute charges whether your notebook is running or sitting idle. Additional charges are based on whether your notebook accesses your data using Google Cloud Storage or Google BigQuery. 

  • If your data is in GCS and is in the same region as your notebook VM, then you pay no access charges. As of the time of writing, Terra notebook VMs run in zones in us-central1. If your data is published in us-central1, then no data transfer charges are incurred for accessing this data from your notebooks.
  • If your data are in BigQuery and the data files are "small", then you are likely to incur no additional costs for accessing the data. BigQuery query pricing is "$5.00 per TB" and "First 1 TB per month is free".

    Suppose your clinical data files are less than 100 MB. You would have to query this data more than 10,000 times in a single month before you would incur charges. After that, you would have to query all of your clinical data 200 times before you are charged $0.01.

    If your data are in BigQuery and the data files are "large", then pay close attention to how you query the data. Review the discussion above on BigQuery Query costs.

Notebook example 1: Quality Control checks on genomic data

A notebook that performs quality control checks on genomic data is typically driven by "analysis ready" data, such as QC metrics emitted by the Picard set of tools. In such cases, if the metrics are aggregated in BigQuery, the data files are very small (on the order of MB) and thus is virtually free to query.

If the notebook also queries a large table such as a table of variants, you may begin to generate notable charges. We looked at a 2 TB table containing 73 million variants. A query that selects all of the values from this variants table will cost over $10, while a much more compact and targeted query can be much less.

Notebook example 2: Analyzing genomic variants

If you are looking at a targeted region of the genome, querying the typical _variants table can be fairly inexpensive. Some variant tables takes advantage of BigQuery clustering - for example, if they are clustered on the reference_name (the chromosome), the start_position, and the end_position. In such a case, when you know the region of interest you can direct BigQuery to just look at the cluster where your data of interest is.

For example, a query that includes in the WHERE clause:

reference_name = 'chr4'

will look only at the records for chromosome 4 (which is less than 7% of the genome), and so the cost of the query will be less than 7% than if you queried the entire table.

If you are looking at the entire genome with your analysis, be mindful of your queries as you can generate meaningful charges. It may be worth evaluating other options which include processing VCF files on a VM. Copying the VCF(s) from GCS eliminate the query charges. This is not always the better solution, as you may only be trading off query time for increased compute time. These types of trade-offs require deeper analysis of the specific use case. 

File storage costs

In this section, we provide a quick look at what it costs to store certain types of data, on average. For your own data, you are encouraged to use the Cloud Storage Pricing Guide.

Notebooks

Notebook files typically store very little data, and notebook file contents (code and text) are quite small. An analysis of Jupyter Notebooks in github indicates an average size of 600 KB. At $0.026 per GB per month (multi-regional), 1000 notebooks (600 MB) will cost about $0.016 per month.

Clinical Data

Clinical data files are typically small. For example, 19MB of clinical data in GCS at $0.02 per GB per month (regional) costs $0.00038 per month.

Genomic Data

Genomic data files are large. Below are example file size and storage costs for a typical 30x whole genome. Depending on your access patterns, consider storing these files in Regional ($0.02 / GB / month) or Nearline ($0.01 / GB / month) storage.

FASTQ (75 gB) 


Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$1.50

$18.00

$0.75

$9.00

100

$150.00

1,800.00

$75.00

$90.00

1000

$1,500.00

$18,000.00

$750.00

$900.00

 CRAM (17.5 GB)


 Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$0.35

$4.20

$0.175

$2.10

100

>$35.00

$420.00

$17.50

$210.00

1000

$350.00

$4,200.00

$175.00

$2,100.00

gVCF (6.5 GB)


Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$0.13

$1.56

$0.065

$0.78

100

$13.00

$156.00

$6.50

$78.00

1000

$130.00

$1,560.00

$65.00

$780.00

Transcriptomics Data 

Transcriptomic data files are large. A typical 100 million read RNA seq will have FASTQs and BAMs, each around 15 GB. Depending on your access patterns, consider storing these files in Regional ($0.02 / GB / month) or Nearline ($0.01 / GB / month) storage.

Storing either of these will cost approximately:


Samples

Monthly Cost (Regional)

Annual Cost (Regional)

Monthly Cost (Nearline)

Annual Cost (Nearline)

1

$0.30

$3.60

$0.15

$1.80

100

$30.00

$360.00

$15.00

$180.00

1000

$300.00

$3,600.00

$150.00

$1,800.00

Was this article helpful?

4 out of 4 found this helpful

Comments

2 comments

  • Comment author
    Lon Blauvelt

    The FASTQ Monthly Cost (Regional) vs. Monthly Cost (Nearline) looks wrong.  $7,500 should be $750 and $750 should be $75.

    0
  • Comment author
    Allie Hajian

    Thanks for the catch, Lon Blauvelt! You are completely right. I checked with our Verily partners and updated the article. 

    0

Please sign in to leave a comment.