Costs of selected featured workflows

Yashasvika Duggal
  • Updated

This article includes data storage and analysis cost estimates for selected featured workspaces. Actual time and cost will vary depending on the size of your dataset and whether you use preemptible VMs.  

Ultima Genomics whole genome germline

This workspace contains a fully reproducible example workflow for pre-processing germline whole-genome sequence data derived from the Ultima Genomics Platform.

Estimated time and cost to run on sample data

Workflow Configuration Sample Name Sample Size Time Cost $
Ultima_Genomics downsampled_NA12878 ~3.00 GB 3 hr 46 min 0.98
Ultima_Genomics 004731-UGAv3-30-CTGCCAGACTGTGA 55.62 GB 26 hrs 15.56

CNest - Terra

This workspace runs CNest, a copy number estimator and variant caller developed for large scale analysis of copy number from NGS data.

It primarily uses read depth information to generate robust copy number estimates for individual samples and is most appropriate for use in very large cohorts (minimum of 1000 samples).

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878 64.89 GB 3:05:00 0.68

GATK4 Germline Preprocessing Variant Calling Joint Calling

This tutorial workspace contains notebooks and workflows for pre-processing and SNP and Indel variant calling.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878_24RG_small 3.11 GB 1:28:00 0.19
NA12878 64.89 GB 22:35:00 5.23
downsampled-1kgp-50-exomes 32.13 GB 02:07:00 7.65

Human-Pangenome-Giraffe-DeepVariant-AnVIL-ASHG-Jan22

This workspace demonstrates the Giraffe/DeepVariant pipeline for calling germline variants using the Human Pangenome Reference Consortium's (HPRC) year one pangenome. This workspace is  a demonstration of using a pangenome from the HPRC in AnVIL and Terra.

Estimated time and cost to run on sample data

Input Coverage Time Cost $
35X 10 hours $15.75

GEM Showcase

This workspace demonstrates a gene-environment interaction analysis pipeline on Terra using the software program GEM (Gene-Environment interaction analysis for Millions of samples).

Estimated time and cost to run on sample data

Analysis Sample size # variants Time (CPU hrs) Cost $
1KG genome-wide interaction study 1656 13.5M 1.94 0.40

DRAGEN-GATK whole genome germline pipeline

This workspace contains a fully reproducible example workflow for whole-genome germline sequence data pre-processing using the DRAGEN-GATK mode of the Whole Genome Germline Single Sample (WGS) Pipeline.

Estimated time and cost to run on sample data

Workflow Configuration Sample Name Number of Entities Sample Size Time Cost $
Functional Equivalence NA12878 24 ~3.00 GB 4 h 12 min 0.90
Maximum Quality NA12878 24 ~3.00 GB 4 h 7 min 0.90

Functional Equivalence

This workflow evaluates functional equivalence to allow researchers to be able to combine results from multiple sources into larger datasets. Functional equivalence ensures that genomic data from different sources, processed with different pipelines, can be used interchangeably without risking batch effects.

Estimated time and cost to run on sample data

Sample set No. Replicates Time Cost $
HG002 3 90 min 3.03

GATK4 RNA Germline Variant Calling

This workspace demonstrates how to call germline short variants (SNPs/Indels) from RNAseq data using GATK v4.1 and related tools. 

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878 3.09 GB 9:32:00 0.49

TRUST4

Tcr Receptor Utilities for Solid Tissue (TRUST) is a computational tool to analyze TCR and BCR sequences using unselected RNA sequencing data, profiled from solid tissues, including tumors. TRUST4 performs de novo assembly on V, J, C genes including the hypervariable complementarity-determining region 3 (CDR3) and reports consensus of BCR/TCR sequences. See the TRUST4 workspace

Estimated time and cost to run on sample data

Sample Format Read Pairs Time Cost $
FZ-116 BAM 86M 47m $0.05
FZ-116 FASTQ 86M 1h 22m $0.09

Peat-Demo

Demo of how to use Peat (external link) to save overhead by grouping jobs into fewer WDL scatter branches. To compare scatter with and without Peat, this workspace has two simple demo workflows using WDL scatter: one with, and one without Peat.

Scatter Without Peat

Performs a simple job (writing a line to a file) many times via simple WDL scatter, then additionally concatenates all files into a single output file.

Estimated time and cost to run on sample data (without Peat)

n_jobs time link cost $
1000 0:30 link 2.99
1200 0:57 link 3.81
1500 1:00 link 5.08
2000 1:14 link 6.51

Scatter With Peat

Performs the same job, but using Peat to run multiple jobs on each WDL scatter branch, then additionally concatenates all files into a single output file.

Estimated time and cost to run on sample data (with Peat)

n_jobs n_groups time link cost $
1000 50 0:11 link 0.16
1200 50 0:32 link 0.24
1500 50 0:33 link 0.16
2000 50 0:29 link 0.16

Intro to HCA data on Terra

This tutorial workspace is a step-by-step guide to importing, accessing, and analyzing standardized cell-by-gene count matrices (Loom format) from the Human Cell Atlas (HCA) Data Portal using community-supported single-cell analysis tools.

Estimated time and cost to run on sample data

Workflow/Notebook Timing Notes Cost ($)
Cumulus workflow 18 min Runs on entire matrix (5 donors) 0.16
Bioconductor notebook ~ 28 min Runs on matrix subset (1 donor) 0.09
Pegasus notebook ~ 5 min Runs on matrix subset (1 donor) 0.02
Scanpy notebook ~ 8 min Runs on matrix subset (1 donor) 0.03
Seurat notebook ~ 21 min Runs on matrix subset (1 donor) 0.07

InferCNV

A fully reproducible example workflow for inferring copy number from single-cell RNA sequencing data. See the InferCNV workspace

Estimated time and cost to run on sample data

Time Cost ($)
30 minutes

< $0.01

CRDC-Dynamic-Queries-for-NIH-Genomic-Data-Commons-Projects

This workspace shows you how to take a query result from the NCI Genomic Data Commons (GDC) data portal and use it as the input to a workflow (or Notebook) in Terra.

Estimated time and cost to run on sample data

file Name Time Cost $
htseq_counts.txt.gz 4m <0.01

CTAT mutations

A fully reproducible example workflow for detecting variants from RNA sequencing data. Go to the workspace

Estimated time and cost to run on sample data

Sample Name Time Cost $
test 3 hours, 17 minutes $0.18

GATK Structural Variation on Single Samples

This integrated structural variation detection and resolution pipeline calls many forms of structural variation in whole genome sequencing data obtained from a single sample. The pipeline will identify, genotype, and annotate structural variation. .Go to the workspace.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
NA12878 18.17 GiB 23hrs ~$7.71

Whole Genome Analysis Pipeline

This workspace contains fully reproducible example workflows for whole genome sequence data pre-processing, germline short variant discovery, and joint variant calling, as used for production by the Genomics Platform at the Broad Institute and recommended for research purposes.

Estimated time and cost to run on sample data

Sample Name Time Cost $
WGS_JointGenotyping 04:05:00 $7.93

CHIP Detection Mutect2

This workspace builds on the GATK4 somatic variant WDL workflow, Mutect2, enabling investigators to perform variant calling and filtering for CHIP data in a consistent and reproducible manner. Users who would benefit from this workspace include investigators interested in the biological implications of CHIP including its role in both malignant and non-malignant disease.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
SRS000030 (Mutated) 59.84 GB 1:50:00 $0.10
SRS000035 37.53 GB 1:49:00 $0.10

bisulfite-seq-tools-grch38

Workflows in this workspace can be used for alignment and quality control analysis for DNA methylation protocols including Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS), and Hybrid Selection Bisulfite Sequencing (HSBS).

Note: This workspace is pre-configured for GRCh38. 

Viral Insertion Detection

This workspace demonstrates a proof-of-concept approach to viral insertion detection. It includes a pipeline for identifying viral reads found in a host organism and detecting potential insertion sites in a host's genome.

Estimated time and cost to run on sample data

Sample Name Sample Size Time Cost $
1 753 KB 0:21 0.03

Exome Analysis Pipeline

This workspace contains fully reproducible example workflows for exome sequence data pre-processing, germline short variant discovery, and joint variant calling, as used for production by the Genomics Platform at the Broad Institute and recommended for research purposes.

Estimated time and cost to run on sample data

Sample Name Number of Entities Sample Size Time Cost $
NA12878 2 8.08 GB 06:21:00 ~$0.64

Trinity

A fully reproducible example workflow for RNA-Seq de-novo assembly using Trinity. Go to the workspace

Estimated time and cost to run on sample data

Number of reads Time Cost $
10 million 130 minutes $1.40
50 million 360 minutes $4.56

HCA_Optimus_Pipeline

The Optimus pipeline, developed in collaboration with the Human Cell Atlas Data Coordination Platform (HCA DCP) and the BRAIN Initiative Cell Census Network (BICCN), processes 3 prime single-cell or single-nucleus transcriptome data from the 10x Genomics v2 or v3 assay. This workspace currently describes v5.5.0 of the Optimus pipeline and provides fully reproducible examples of the workflow.

Estimated time and cost to run on sample data

Sample Set Name Set Size Sample Set R1.fastq Size Sample Set R2.fastq Size Time Cost $
neurons2k_mouse 6 entities 88.26 MB 277.58 MB 1:22:00 0.09
pbmc4k_human 2 entities 26.84 MB 59.58 MB 1:14:00 0.16
pbmc_human_v3 2 entities 106.95 MB 220.04 MB 1:36:00 0.11

ENCODE-Tutorial-May-2020

Learn how to search, analyze, and visualize ENCyclopedia Of DNA Elements (ENCODE) data. The resources in this workspace cover binning ENCODE ChIP-seq datasets into non-overlapping 5 kB bins and determining the signal enrichment in each bin. More information about the ENCODE project can be found at https://www.encodeproject.org (external link).

Estimated time and cost to run on sample data

Workflow Name Time to Run 1 file Time to Run 100 files 1 file (range) 100 files
PBS-bam 10-15 minutes 15-30 minutes $0.03 < $3.15

Cumulus

This workspace is a showcase of Cumulus(external link), a cloud-based single-cell/single-nucleus data analysis framework. It uses a large-scale single-cell dataset, and demonstrates Cumulus on both workflow and interactive analysis.

Estimated time and cost to run on sample data

Step CPU Memory Time Cost $
cellranger_workflow 32 * 8 120 GB * 8 1h34min $2.65
cumulus 32 200 GB 22min $0.17

2019 ASHG Reproducible GWAS (v2)

This workspace reproduces the steps in a genome wide association study (GWAS), using 1,000 Genomes Project¹ (phase 3) genotypes and simulated phenotypes.

The analysis is structured in two parts

  1. Explore phenotypes and population structure (Jupyter Notebook - Hail/Python)
  2. Test for genetic associations using mixed-models and generate summary visualizations (WDL workflow)

Estimated time and cost to run on sample data

Sample Size # Variants Time Cost $
2,500 samples 22,000 8m $0.49

Was this article helpful?

2 out of 3 found this helpful

Comments

0 comments

Please sign in to leave a comment.