Costs of selected featured workflows

Yashasvika Duggal
  • Updated

Many times Terra users want to know if they can estimate the cost of using Terra or the cost of their workflow. This article goes over selected featured workflows and associated estimated costs. For each of these estimates, actual time and cost will vary depending on the size of your dataset and the use of preemptive instances.  

Ultima Genomics whole genome germline

This workspace contains a fully reproducible example workflow for pre-processing germline whole-genome sequence data derived from the Ultima Genomics Platform.

Estimated time and cost to run on sample data

Workflow Configuration Sample Name Sample Size Time Cost $
Ultima_Genomics downsampled_NA12878 ~3.00 GB 3 hr 46 min 0.93
Ultima_Genomics 004731-UGAv3-30-CTGCCAGACTGTGA 55.62 GB 26 hrs 14.82

CNest - Terra

This workspace runs CNest which is a copy number estimator and variant caller that has been specifically developed for large scale analysis of copy number from NGS data.

It primarily uses read depth information to generate robust copy number estimates for individual samples and is most appropriate for use in very large cohorts (minimum of 1000 samples).

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878 64.89 GB 3:05:00 0.65

GATK4 Germline Preprocessing Variant Calling Joint Calling

This workspace contains tutorial notebooks and workflows that cover pre-processing, SNP and Indel variant calling.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878_24RG_small 3.11 GB 1:28:00 0.18
NA12878 64.89 GB 22:35:00 4.98
downsampled-1kgp-50-exomes 32.13 GB 02:07:00 7.29

Human-Pangenome-Giraffe-DeepVariant-AnVIL-ASHG-Jan22

This workspace demonstrates variant calling using the Human Pangenome Reference Consortium's (HPRC) year 1 pangenome with the Giraffe/DeepVariant pipeline for calling germline variants. This workspace is intended to be a demonstration of utilizing a pangenome from the HPRC in AnVIL and Terra.

Estimated time and cost to run on sample data

Input Coverage Time Cost $
35X 10 hours $15

GEM Showcase

This workspace demonstrates a gene-environment interaction analysis pipeline using Terra.

Specifically, we will use the software program GEM (Gene-Environment interaction analysis for Millions of samples).

Estimated time and cost to run on sample data

Analysis Sample size # variants Time (CPU hrs) Cost $
1KG genome-wide interaction study 1656 13.5M 1.94 0.38

DRAGEN-GATK whole genome germline pipeline

This workspace contains a fully reproducible example workflow for whole-genome germline sequence data pre-processing using the DRAGEN-GATK mode of the Whole Genome Germline Single Sample (WGS) Pipeline.

Estimated time and cost to run on sample data

Workflow Configuration Sample Name Number of Entities Sample Size Time Cost $
Functional Equivalence NA12878 24 ~3.00 GB 4 h 12 min 0.86
Maximum Quality NA12878 24 ~3.00 GB 4 h 7 min 0.86

Functional Equivalence

This workflow performs an evaluation of functional equivalence. Prompted by scientific need to combine results from multiple sources into larger datasets, functional equivalence is answering the question of how we can ensure that genomic data from different sources, processed with different pipelines, can be used interchangeably without risking batch effects.

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Sample set No. Replicates Time Cost $
HG002 3 90 min 2.88

GATK4 RNA Germline Variant Calling

This workspace demonstrates how to call germline short variants (SNPs/Indels) from RNAseq data using GATK v4.1 and related tools. 

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878 3.09 GB 9:32:00 0.47

TRUST4

Tcr Receptor Utilities for Solid Tissue (TRUST) is a computational tool to analyze TCR and BCR sequences using unselected RNA sequencing data, profiled from solid tissues, including tumors. TRUST4 performs de novo assembly on V, J, C genes including the hypervariable complementarity-determining region 3 (CDR3) and reports consensus of BCR/TCR sequences. 

Estimated time and cost to run on sample data

Sample Format Read Pairs Time Cost $
FZ-116 BAM 86M 47m $0.05
FZ-116 FASTQ 86M 1h 22m $0.09

Peat-Demo

Demo of how to use Peat(external link) to save overhead by grouping jobs into fewer WDL scatter branches. To compare scatter with and without Peat, this workspace has two simple demo workflows using WDL scatter, one with and one without using Peat.

Scatter Without Peat

Performs a simple job (writing a line to a file) many times via simple WDL scatter, then additionally concatenates all files into a single output file.

Estimated time and cost to run on sample data

n_jobs time link cost $
1000 0:30 link 2.85
1200 0:57 link 3.63
1500 1:00 link 4.84
2000 1:14 link 6.20

Scatter With Peat

Performs the same job, but using Peat to run multiple jobs on each WDL scatter branch, then additionally concatenates all files into a single output file.

Estimated time and cost to run on sample data

n_jobs n_groups time link cost $
1000 50 0:11 link 0.15
1200 50 0:32 link 0.23
1500 50 0:33 link 0.15
2000 50 0:29 link 0.15

Intro to HCA data on Terra

This tutorial workspace is a step-by-step guide to importing, accessing, and analyzing standardized cell-by-gene count matrices (Loom format) from the Human Cell Atlas (HCA) Data Portal using community-supported single-cell analysis tools.

Estimated time and cost to run on sample data

Workflow/Notebook Timing Notes Cost ($)
Cumulus workflow 18 min Runs on entire matrix (5 donors) 0.15
Bioconductor notebook ~ 28 min Runs on matrix subset (1 donor) 0.09
Pegasus notebook ~ 5 min Runs on matrix subset (1 donor) 0.02
Scanpy notebook ~ 8 min Runs on matrix subset (1 donor) 0.03
Seurat notebook ~ 21 min Runs on matrix subset (1 donor) 0.07

InferCNV

A fully reproducible example workflow for inferring copy number from single-cell RNA sequencing data

Estimated time and cost to run on sample data

Time Cost ($)
30 minutes

< $0.01

CRDC-Dynamic-Queries-for-NIH-Genomic-Data-Commons-Projects

This workspace shows you how to take a query result from the NCI Genomic Data Commons (GDC) data portal and use it as the input to a workflow (or Notebook) in FireCloud.

Estimated time and cost to run on sample data

file Name Time Cost $
htseq_counts.txt.gz 4m <0.01

CTAT mutations

A fully reproducible example workflow for detecting variants from RNA sequencing data

Estimated time and cost to run on sample data

Sample Name Time Cost $
test 3 hours, 17 minutes $0.17

ANVIL T2T-MISSING INFO

GATK Structural Variation on Single Samples

An integrated structural variation detection and resolution pipeline designed to call many forms of structural variation in whole genome sequencing data obtained from a single sample. The pipeline will identify, genotype, and annotate structural variation.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878 18.17 GiB 23hrs ~$7.34

Whole Genome Analysis Pipeline

This workspace contains fully reproducible example workflows for whole genome sequence data pre-processing, germline short variant discovery, and joint variant calling, as used for production by the Genomics Platform at the Broad Institute and recommended for research purposes.

Estimated time and cost to run on sample data

Sample Name Time Cost $
WGS_JointGenotyping 04:05:00 $7.55

NHGRI AnVIL Notebooks Collection--MISSING INFO

This workspace is a collection of notebooks that are helpful for AnVIL users in Terra. The notebooks in this collection can be copied to a different workspace to aid users in their individual research needs.

CHIP Detection Mutect2

This workspace aims to build upon the GATK4 somatic variant WDL workflow, Mutect2, enabling investigators to perform variant calling and filtering for CHIP data in a consistent and reproducible manner. Users who would benefit from this workspace include investigators interested in the biological implications of CHIP including its role in both malignant and non-malignant disease.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
SRS000030 (Mutated) 59.84 GB 1:50:00 $0.10
SRS000035 37.53 GB 1:49:00 $0.10

bisulfite-seq-tools-grch38

Methods from this workspace can be used for alignment and quality control analysis for various DNA methylation protocols including Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Hybrid Selection Bisulfite Sequencing (HSBS).

Note: This workspace is pre-configured for GRCh38. 

ml4h-toolkit-for-machine-learning-on-clinical-data- IMPROPER FORMAT

This workspace demonstrates the notebooks used by clinicians and researchers to review and annotate both clinical data model inputs, such as phenotypes, ECGs and MRIs, and model outputs such as predicted left ventricular mass.

Working with gnomAD in Terra Missing costs

This workspace demonstrates several options for working with cloud-hosted gnomAD data from within Terra.

  • Running a workflow on the callset VCFs
  • Exploring the callset using Hail in a notebook
  • Exploring the callset using BigQuery in a notebook

Viral Insertion Detection

This workspace demonstrates a proof-of-concept for an approach to viral insertion detection. It includes a pipeline for identifying viral reads found in a host organism, and detects potential insertion sites in a host's genome.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
1 753 KB 0:21 0.03

Exome Analysis Pipeline

This workspace contains fully reproducible example workflows for exome sequence data pre-processing, germline short variant discovery, and joint variant calling, as used for production by the Genomics Platform at the Broad Institute and recommended for research purposes.

Estimated time and cost to run on sample data

Sample Name Number of Entities Sample Size Time Cost $
NA12878 2 8.08 GB 06:21:00 ~$0.61

Terra Data Tables Quickstart IMPROPER FORMAT

This workspaces demonstrates how to use workspace data tables to organize, access and analyze data - including sets of data - in the cloud.

Trinity

A fully reproducible example workflow for RNA-Seq de-novo assembly using Trinity

Estimated time and cost to run on sample data

Number of reads Time Cost $
10 million 130 minutes $1.33
50 million 360 minutes $4.34

HCA_Optimus_Pipeline

The Optimus pipeline, developed in collaboration with the Human Cell Atlas Data Coordination Platform (HCA DCP) and the BRAIN Initiative Cell Census Network (BICCN), processes 3 prime single-cell or single-nucleus transcriptome data from the 10x Genomics v2 or v3 assay. This workspace currently describes v5.5.0 of the Optimus pipeline and provides fully reproducible examples of the workflow.

Estimated time and cost to run on sample data

Sample Set Name Set Size Sample Set R1.fastq Size Sample Set R2.fastq Size Time Cost $
neurons2k_mouse 6 entities 88.26 MB 277.58 MB 1:22:00 0.09
pbmc4k_human 2 entities 26.84 MB 59.58 MB 1:14:00 0.15
pbmc_human_v3 2 entities 106.95 MB 220.04 MB 1:36:00 0.11

ENCODE-Tutorial-May-2020

Learn how to search, analyze, and visualize ENCyclopedia Of DNA Elements (ENCODE) data. The resources in this workspace cover binning ENCODE ChIP-seq datasets into non-overlapping 5 kB bins and determining the signal enrichment in each bin. More information about the ENCODE project can be found here: https://www.encodeproject.org/(external link).

Estimated time and cost to run on sample data

Workflow Name Time to Run 1 file Time to Run 100 files 1 file (range) 100 files
PBS-bam 10-15 minutes 15-30 minutes $0.03 < $3.00

COVID-19_Broad_Viral_NGS MISSING INFO

Cumulus

This workspace is a showcase of Cumulus(external link), a cloud-based single-cell/single-nucleus data analysis framework. It uses a large-scale single-cell dataset, and demonstrates Cumulus on both workflow and interactive analysis.

Estimated time and cost to run on sample data

Step CPU Memory Time Cost $
cellranger_workflow 32 * 8 120 GB * 8 1h34min $2.52
cumulus 32 200 GB 22min $0.16

Bioconductor MISSING INFO

Genomics-in-the-Cloud-v1 MISSING INFO

BioData Catalyst CollectionMISSING INFO

COVID-19_cross_tissue_analysisMISSING INFO

Germline-CNVs-GATK4MISSING INFO

2019_ASHG_Reproducible_GWAS-V2

This workspace reproduces the fundamental steps in a genome wide association study (GWAS), using 1,000 Genomes Project¹ (phase 3) genotypes and simulated phenotypes.

The analysis is structured in two parts:

  1. Explore phenotypes and population structure (Jupyter Notebook - Hail/Python)
  2. Test for genetic associations using mixed-models and generate summary visualizations (WDL workflow)

Estimated time and cost to run on sample data

Sample Size # Variants Time Cost $
2,500 samples 22,000 8m $0.47

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.