Costs of selected featured workflows

This article includes data storage and analysis cost estimates for selected featured workspaces. Actual time and cost will vary depending on the size of your dataset and whether you use preemptible (spot) VMs.

See also How much did my workflow cost? and how to estimate costs using a Workflows cost-estimating notebook. Note that setting workflow cost thresholds is available in Preview (click the link to see how to enable the feature) and workflows cost reporting will include cost estimates for workflows currently running (see How to set up workflows cost reporting).

Ultima Genomics whole genome germline

This workspace contains a fully reproducible example workflow for pre-processing germline whole-genome sequence data derived from the Ultima Genomics Platform.

Estimated time and cost to run on sample data

Workflow Configuration	Sample Name	Sample Size	Time	Cost $
Ultima_Genomics	downsampled_NA12878	~3.00 GB	3 hr 46 min	0.98
Ultima_Genomics	004731-UGAv3-30-CTGCCAGACTGTGA	55.62 GB	26 hrs	15.56

CNest - Terra

This workspace runs CNest, a copy number estimator and variant caller developed for large scale analysis of copy number from NGS data.

It primarily uses read depth information to generate robust copy number estimates for individual samples and is most appropriate for use in very large cohorts (minimum of 1000 samples).

Estimated time and cost to run on sample data

Sample Name	Sample Size	Time	Cost $
NA12878	64.89 GB	3:05:00	0.68

GATK4 Germline Preprocessing Variant Calling Joint Calling

This tutorial workspace contains notebooks and workflows for pre-processing and SNP and Indel variant calling.

Estimated time and cost to run on sample data

Sample Name	Sample Size	Time	Cost $
NA12878_24RG_small	3.11 GB	1:28:00	0.19
NA12878	64.89 GB	22:35:00	5.23
downsampled-1kgp-50-exomes	32.13 GB	02:07:00	7.65

Human-Pangenome-Giraffe-DeepVariant-AnVIL-ASHG-Jan22

This workspace demonstrates the Giraffe/DeepVariant pipeline for calling germline variants using the Human Pangenome Reference Consortium's (HPRC) year one pangenome. This workspace is a demonstration of using a pangenome from the HPRC in AnVIL and Terra.

Estimated time and cost to run on sample data

Input Coverage	Time	Cost $
35X	10 hours	$15.75

GEM Showcase

This workspace demonstrates a gene-environment interaction analysis pipeline on Terra using the software program GEM (Gene-Environment interaction analysis for Millions of samples).

Estimated time and cost to run on sample data

Analysis	Sample size	# variants	Time (CPU hrs)	Cost $
1KG genome-wide interaction study	1656	13.5M	1.94	0.40

DRAGEN-GATK whole genome germline pipeline

This workspace contains a fully reproducible example workflow for whole-genome germline sequence data pre-processing using the DRAGEN-GATK mode of the Whole Genome Germline Single Sample (WGS) Pipeline.

Estimated time and cost to run on sample data

Workflow Configuration	Sample Name	Number of Entities	Sample Size	Time	Cost $
Functional Equivalence	NA12878	24	~3.00 GB	4 h 12 min	0.90
Maximum Quality	NA12878	24	~3.00 GB	4 h 7 min	0.90

Functional Equivalence

This workflow evaluates functional equivalence to allow researchers to be able to combine results from multiple sources into larger datasets. Functional equivalence ensures that genomic data from different sources, processed with different pipelines, can be used interchangeably without risking batch effects.

Estimated time and cost to run on sample data

Sample set	No. Replicates	Time	Cost $
HG002	3	90 min	3.03

GATK4 RNA Germline Variant Calling

This workspace demonstrates how to call germline short variants (SNPs/Indels) from RNAseq data using GATK v4.1 and related tools.

Estimated time and cost to run on sample data

Sample Name	Sample Size	Time	Cost $
NA12878	3.09 GB	9:32:00	0.49

TRUST4

Tcr Receptor Utilities for Solid Tissue (TRUST) is a computational tool to analyze TCR and BCR sequences using unselected RNA sequencing data, profiled from solid tissues, including tumors. TRUST4 performs de novo assembly on V, J, C genes including the hypervariable complementarity-determining region 3 (CDR3) and reports consensus of BCR/TCR sequences. See the TRUST4 workspace.

Estimated time and cost to run on sample data

Sample	Format	Read Pairs	Time	Cost $
FZ-116	BAM	86M	47m	$0.05
FZ-116	FASTQ	86M	1h 22m	$0.09

Peat-Demo

Demo of how to use Peat (external link) to save overhead by grouping jobs into fewer WDL scatter branches. To compare scatter with and without Peat, this workspace has two simple demo workflows using WDL scatter: one with, and one without Peat.

Scatter Without Peat

Performs a simple job (writing a line to a file) many times via simple WDL scatter, then additionally concatenates all files into a single output file.

Estimated time and cost to run on sample data (without Peat)

n_jobs	time	link	cost $
1000	0:30	link	2.99
1200	0:57	link	3.81
1500	1:00	link	5.08
2000	1:14	link	6.51

Scatter With Peat

Performs the same job, but using Peat to run multiple jobs on each WDL scatter branch, then additionally concatenates all files into a single output file.

Estimated time and cost to run on sample data (with Peat)

n_jobs	n_groups	time	link	cost $
1000	50	0:11	link	0.16
1200	50	0:32	link	0.24
1500	50	0:33	link	0.16
2000	50	0:29	link	0.16

Intro to HCA data on Terra

This tutorial workspace is a step-by-step guide to importing, accessing, and analyzing standardized cell-by-gene count matrices (Loom format) from the Human Cell Atlas (HCA) Data Portal using community-supported single-cell analysis tools.

Estimated time and cost to run on sample data

Workflow/Notebook	Timing	Notes	Cost ($)
Cumulus workflow	18 min	Runs on entire matrix (5 donors)	0.16
Bioconductor notebook	~ 28 min	Runs on matrix subset (1 donor)	0.09
Pegasus notebook	~ 5 min	Runs on matrix subset (1 donor)	0.02
Scanpy notebook	~ 8 min	Runs on matrix subset (1 donor)	0.03
Seurat notebook	~ 21 min	Runs on matrix subset (1 donor)	0.07

InferCNV

A fully reproducible example workflow for inferring copy number from single-cell RNA sequencing data. See the InferCNV workspace.

Estimated time and cost to run on sample data

Time	Cost ($)
30 minutes	< $0.01

CRDC-Dynamic-Queries-for-NIH-Genomic-Data-Commons-Projects

This workspace shows you how to take a query result from the NCI Genomic Data Commons (GDC) data portal and use it as the input to a workflow (or Notebook) in Terra.

Estimated time and cost to run on sample data

file Name	Time	Cost $
htseq_counts.txt.gz	4m	<0.01

CTAT mutations

A fully reproducible example workflow for detecting variants from RNA sequencing data. Go to the workspace.

Estimated time and cost to run on sample data

Sample Name	Time	Cost $
test	3 hours, 17 minutes	$0.18

GATK Structural Variation on Single Samples

This integrated structural variation detection and resolution pipeline calls many forms of structural variation in whole genome sequencing data obtained from a single sample. The pipeline will identify, genotype, and annotate structural variation. .Go to the workspace.

Estimated time and cost to run on sample data

Sample Name	Sample Size	Time	Cost $
NA12878	18.17 GiB	23hrs	~$7.71

Whole Genome Analysis Pipeline

This workspace contains fully reproducible example workflows for whole genome sequence data pre-processing, germline short variant discovery, and joint variant calling, as used for production by the Genomics Platform at the Broad Institute and recommended for research purposes.

Estimated time and cost to run on sample data

Sample Name	Time	Cost $
WGS_JointGenotyping	04:05:00	$7.93

CHIP Detection Mutect2

This workspace builds on the GATK4 somatic variant WDL workflow, Mutect2, enabling investigators to perform variant calling and filtering for CHIP data in a consistent and reproducible manner. Users who would benefit from this workspace include investigators interested in the biological implications of CHIP including its role in both malignant and non-malignant disease.

Estimated time and cost to run on sample data

Sample Name	Sample Size	Time	Cost $
SRS000030 (Mutated)	59.84 GB	1:50:00	$0.10
SRS000035	37.53 GB	1:49:00	$0.10

bisulfite-seq-tools-grch38

Workflows in this workspace can be used for alignment and quality control analysis for DNA methylation protocols including Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS), and Hybrid Selection Bisulfite Sequencing (HSBS).

Note: This workspace is pre-configured for GRCh38.

Viral Insertion Detection

This workspace demonstrates a proof-of-concept approach to viral insertion detection. It includes a pipeline for identifying viral reads found in a host organism and detecting potential insertion sites in a host's genome.

Estimated time and cost to run on sample data

Sample Name	Sample Size	Time	Cost $
1	753 KB	0:21	0.03

Exome Analysis Pipeline

This workspace contains fully reproducible example workflows for exome sequence data pre-processing, germline short variant discovery, and joint variant calling, as used for production by the Genomics Platform at the Broad Institute and recommended for research purposes.

Estimated time and cost to run on sample data

Sample Name	Number of Entities	Sample Size	Time	Cost $
NA12878	2	8.08 GB	06:21:00	~$0.64

Trinity

A fully reproducible example workflow for RNA-Seq de-novo assembly using Trinity. Go to the workspace.

Estimated time and cost to run on sample data

Number of reads	Time	Cost $
10 million	130 minutes	$1.40
50 million	360 minutes	$4.56

HCA_Optimus_Pipeline

The Optimus pipeline, developed in collaboration with the Human Cell Atlas Data Coordination Platform (HCA DCP) and the BRAIN Initiative Cell Census Network (BICCN), processes 3 prime single-cell or single-nucleus transcriptome data from the 10x Genomics v2 or v3 assay. This workspace currently describes v5.5.0 of the Optimus pipeline and provides fully reproducible examples of the workflow.

Estimated time and cost to run on sample data

Sample Set Name	Set Size	Sample Set R1.fastq Size	Sample Set R2.fastq Size	Time	Cost $
neurons2k_mouse	6 entities	88.26 MB	277.58 MB	1:22:00	0.09
pbmc4k_human	2 entities	26.84 MB	59.58 MB	1:14:00	0.16
pbmc_human_v3	2 entities	106.95 MB	220.04 MB	1:36:00	0.11

ENCODE-Tutorial-May-2020

Learn how to search, analyze, and visualize ENCyclopedia Of DNA Elements (ENCODE) data. The resources in this workspace cover binning ENCODE ChIP-seq datasets into non-overlapping 5 kB bins and determining the signal enrichment in each bin. More information about the ENCODE project can be found at https://www.encodeproject.org (external link).

Estimated time and cost to run on sample data

Workflow Name	Time to Run 1 file	Time to Run 100 files	1 file (range)	100 files
PBS-bam	10-15 minutes	15-30 minutes	$0.03	< $3.15

Cumulus

This workspace is a showcase of Cumulus(external link), a cloud-based single-cell/single-nucleus data analysis framework. It uses a large-scale single-cell dataset, and demonstrates Cumulus on both workflow and interactive analysis.

Estimated time and cost to run on sample data

Step	CPU	Memory	Time	Cost $
cellranger_workflow	32 * 8	120 GB * 8	1h34min	$2.65
cumulus	32	200 GB	22min	$0.17

2019 ASHG Reproducible GWAS (v2)

This workspace reproduces the steps in a genome wide association study (GWAS), using 1,000 Genomes Project¹ (phase 3) genotypes and simulated phenotypes.

The analysis is structured in two parts

Explore phenotypes and population structure (Jupyter Notebook - Hail/Python)
Test for genetic associations using mixed-models and generate summary visualizations (WDL workflow)

Estimated time and cost to run on sample data

Sample Size	# Variants	Time	Cost $
2,500 samples	22,000	8m	$0.49

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Scatter Without Peat

Estimated time and cost to run on sample data (without Peat)

Scatter With Peat

Estimated time and cost to run on sample data (with Peat)

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

Estimated time and cost to run on sample data

The analysis is structured in two parts

Estimated time and cost to run on sample data

Was this article helpful?

That’s great, can you tell us why? (Click all that apply)

Thanks for your feedback, help us improve by telling us what you think could be better (click all that apply)