Start with curated sample workspaces
FollowOne of the best ways to get started in Terra is to explore Featured Workspaces, curated templates that span a variety of use cases. Standardized for completeness and ease of use, they're great as templates or to help reproduce instructive results and learn established methodologies. You should find enough detail in the workspace description to enable you analyse the included sample data. Cost and time estimates give you the confidence to run on your own data, if you want.
You'll find these resources in the Library - Showcase & Tutorials (access using the navigation menu at the top left of any screen in Terra).
Below is a list of available Featured Workspaces grouped by subject. Workspaces that include notebooks indicate what language they use in parentheses: (i.e. Py3). If the title doesn't reveal enough, you can scroll down for a short description.
Contents
Intro or Tutorials workspaces
Analysis-focused workspaces
Data-focused workspaces (more coming soon!)
Workflow (pipeline)-focused workspaces
GATK best practices showcase workspaces
GATK Workshop Tutorials
Intro or tutorial workspaces
Need to get up to speed on particular analysis packages or fundamentals? You'll find lots of useful, primarily notebooks-based analysis tools in these workspaces. QuickStart tutorial workspaces include hands-on exercises to get you started, wrapped as complete packages with sample data and information on cost.
- Terra Workflows Quickstart
- Terra Notebooks QuickStart (does not require R or Python coding experience; includes an optional Intro to Jupyter notebooks)
- Jupyter Notebooks 101 (does not require R or Python coding)
- Terra Notebooks Playground (includes both R and Py3 notebooks)
- PyGMQL playground
- GMQL 101
- Hail Notebooks Tutorial (Py3 notebook)
- Reproducing the paper Tetralogy of Fallot (Includes a cluster analysis in an R-based notebook)
Expand for workspace details
Terra-Notebooks-QuickStart: In this tutorial workspace, you'll get hands-on practice accessing and analyzing data in the Data Library:
- Browse 1,000 genomes data in the Data Library and define a subset of data (cohort) for analysis
- Import the cohort from the Terra Data Library to the workspace
- Set up a Jupyter notebook virtual application to analyze the data
- Analyze the cohort of data in an Interactive Jupyter notebook
Jupyter Notebooks 101: Maybe you have heard of Jupyter notebooks and you're interested using them to do interactive analysis on large amounts of data. This workspace will help explain 1) what notebooks are and how to use them in biomedical research; 2) the relationship between a notebook and a workspace; 3) Jupyter Notebook basics: how to use a notebook, install packages, and import modules
Terra Notebooks Playground: This workspace contains a set of Jupyter Notebooks that allow users to play with the interactive functionality of Jupyter notebooks, a web-based application that supports code in a variety of languages (R and Python, among others). Notebooks are organized into two categories: R and Python, and streamlines interaction with cloud-based resources.
PyGMQL Playground: an introduction to the GenoMetric Query Language and its engine. GMQL is a query language designed to handle tertiary genomic data. It operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples. Based on Hadoop framework and Apache Spark platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets.
GMQL 101: An introduction to the GenoMetric Query Language and its integration inside the FireCloud infrastructure and pipelines. The workspace will offer three different methods of increasing complexity.
Hail notebook tutorial: Hail provides an open-source framework to analyze the largest genetic data sets in existence and to meet the exploding needs of hospitals, diagnostic labs, and industry. Hail's efficient and scalable framework currently powers dozens of major academic studies. This workspace is an introductory tutorial that demonstrates the basics of using the Hail package. It does not currently support Hail’s high-performance features and should not be used to attempt to run Hail at large scale.
Reproducing the paper: Variant analysis of Tetralogy of Fallot: This workspace reproduces a classic example of a study to understand the genetics that underlie a particular phenotype. described by Matthieu Miossec and collaborators in the bioRxiv preprint "Deleterious genetic variants in NOTCH1 are a major contributor to the incidence of non-syndromic Tetralogy of Fallot (ToF). The workspace reproduces all steps in the study as closely as possible, from processing the raw data (BAM) files, to calling variants, to the clustering analysis that led to the final result. The workspace serves as a template of best practices for making your own work easily reproducible with a detailed explanation of how we reproduced the ToF study using a cloud-based analysis platform. Sample data and notebooks allow users to reproduce the process themselves.
Analysis-focused workspaces
These workspaces focus on a particular analysis, including published papers.
- ASHG 2019 - Reproducible GWAS v2
- Waddington-OT
- TOSC19 - Variant Spark
- idap - SNVs and Indels on tumor-normal samples
- Cloud-based RNA sequencing (scRNA-Seq)
- STAR Fusion Transcript Detection
Expand for workspace summaries
The analysis is structured in two parts:
- Explore phenotypes and population structure (Jupyter Notebook - Hail/Python)
- Generate mixed-models for genetic association tests and visualizations (WDL workflow)
Waddington-OT: This tutorial provides a practical, hands-on introduction to inferring developmental trajectories with Waddington-OT. Single cell RNA-sequencing allows us to profile the diversity of cells along a developmental time-course by recording static snapshots at different time points. However, we cannot directly observe the progression of any individual cell over time because the measurement process is destructive... Waddington-OT is designed to infer the temporal couplings of a developmental stochastic process from samples collected independently at various time-points.
TOSC19 Variant Spark: VariantSpark is a machine learning library for real-time genomic data analysis (for thousands of samples and millions of variants) and is...
- Built on top of Apache Spark and written in Scala
- Authored by the team at CSIRO Bioinformatics in Australia
- Uses a custom machine learning random forest implementation to find the most important variants attributing to a phenotype of interest
idap - SNVs and Indels in tumor-normal samples: A reproducible workflow integrating the analysis results from Mutect2 and Varscan2 for the detection of single nucleotide mutations and small deletions and insertions on paired tumor-normal samples.
Cloud-based scRNA Sequencing: Reproduces major steps in the published analysis "Nuclei multiplexing with barcoded antibodies for single-nucleus genomics" (Nature Communications). Three workflows:
- Generate RNA gene-count and hashtag count matrices (Cellranger Count)
- Demultiplex nucleus-hashing data based on the hashtag count matrix (demuxEM)
- Process the demultiplexed singlets for single-nucleus RNA-Seq analysis (including quality-control, dimension reduction, clustering analysis, and visualization) (cumulus)
STAR Fusion: A fully reproducible example workflow for fusion transcript detection. STAR-Fusion, a component of the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), uses the STAR aligner to identify candidate fusion transcripts supported by Illumina reads. STAR-Fusion further processes the output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set.
Data-focused workspaces (more coming soon!)
These workspaces highlight specific data sets and include examples of how to access and process data in those workspaces. Data sets can include public-access data with a broad range of audiences and use cases and restricted-access data for specific research groups.
- Introduction to the TARGET dataset
- ENCODE tutorial (includes R notebook)
- Introduction to TCGA dataset
- HCA Optimus pipeline
Expand for workspace summaries
Learn how to search, analyze, and visualize ENCyclopedia Of DNA Elements (ENCODE) data. The resources in this workspace cover binning ENCODE ChIP-seq datasets into non-overlapping 5 kB bins and determining the signal enrichment in each bin. More information about the ENCODE project can be found here: https://www.encodeproject.org/.
From the ENCODE Terra Tutorial given May, 2019, the workspace includes the following steps:
- How to access and import selected ENCODE data from the Data Explorer
- How to use a workflow tool to calculate the Probability of Being Signal (PBS) to indicate the presence of H3K27ac histone marks
- How to identify regions of interest by plotting a comparison between BED files generated by the PBS Tool in a Jupyter Notebook
- How to zero in on regions of interest by visualizing tracks in IGV in the web browser
Intro to TCGA data: Practice accessing and analysing TCGA data with basic analysis Tools. Data processing workflows allow you to create a panel of normal VCF (1-Mutect2_PON) and then use the that VCF to perform somatic SNP and indel calling (Mutect2_GATK4).
HCA Optimus Pipeline: The Optimus pipeline, developed by the Data Coordination Platform of the Human Cell Atlas (HCA DCP), processes 3 prime single cell transcriptome data from the 10X Genomics v2 (and v3) assay. This workspace describes the pipeline and provides a fully reproducible example of the workflow.
Workflow (pipeline)-focused workspaces
These workspaces showcase workflows and tools for general use by the genomics community. Many contain tools developed at and supported by Broad.
- Sequence format conversion
- HCA Optimus pipeline
- DNA methylation pre-processing
- InferCNV SCP scRNA seq
Expand for workspace summaries
HCA Optimus Pipeline: The Optimus pipeline, developed by the Data Coordination Platform of the Human Cell Atlas (HCA DCP), processes 3 prime single cell transcriptome data from the 10X Genomics v2 (and v3) assay. This workspace describes the pipeline and provides a fully reproducible example of the workflow.
DNA methylation pre-processing: Suite of tools to conduct methylation data analysis. Methods from this workspace can be used for alignment and quality control analysis for various protocols including Whole Genom Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Hybrid Selection Bisulfite Sequencing (HSBS).
InferCNV SCP scRNAseq: The purpose of inferCNV is to explore tumor single cell RNA-seq data to identify evidence for copy number variations such as deletion and gain of entire and/or large segments of chromosomes. inferCNV compares expression intensity of genes across positions of tumor genome to reference 'normal' cells. A resulting heatmap illustrates the relative expression intensities across each chromosome to depict higher or lower expression of regions of the tumor genome compared to expression in ‘normal’ cells...
Variant Calling with Spark Multicore: This workspace highlights a pipeline for calling variants from aligned input data on a single multicore machine. The pipeline, ReadsPipelineSpark
, includes the following GATK “Best Practices” tools:
- Mark Duplicates
- BQSR
- Haplotype Caller
GATK4 Best Practices showcase workspaces
Working examples of best practices GATK workflows, along with sample data and time and cost estimates.
- Variant Calling Spark Multicore
- Germline SNPs and Indels GATK4 hg38
- Exome analysis pipeline
- Somatic CNV discovery - GATK4
- Somatic SNVs and Indels discovery - GATK4
- RNA-Germline variant calling
- Whole Genome Analysis pipeline
- CNN variant filter
- Germline SNPs and Indels GATK4 hg37
- SNP/Indel calling in mitochondria
Expand for workspace summaries
ReadsPipelineSpark
, includes the following GATK “Best Practices” tools:
- Mark Duplicates
- BQSR
- Haplotype Caller
SNPs & Indels GATK4 hg38: Processing For Variant Discovery, HaplotypeCallerGVCF, and Joint Discovery workflows. These workflows make up the Pre-processing and Variant Discovery portions of the Best Practices for Germline SNP & Indel Discovery. The reference genome is hg38.
Exome analysis pipeline: A fully reproducible example workflow for exome sequence data pre-processing and germline short variant discovery.
Somatic CNVs discovery - GATK4:The variant discovery portion of GATK CNV; one workflow creates a panel of normals and a second runs the GATK CNV pipeline on a matched pair with Oncotator. Detailed descriptions of the workflows are available in GATK's Best Practices Document.
Single sample Somatic SNVs and Indels: A fully reproducible example of somatic SNV and Indels variant discovery using the Mutect 2 workflow. Also includes a validation WDL for users planning to edit the WDL or workflows. Detailed description of the workflow is available in GATK's Best Practices Document.
Germline variant calling in RNAseq: Best Practices WDL workflow calls germline short variants (SNPs/Indels) from RNAseq data using GATK v4.1 and related tools.
Whole Genome Germline SNPs and Indels: A fully reproducible example of data pre-processing and germline short variant discovery. This is the production version of the pipeline which contains several quality control task within the workflow in addition to the regular data processing. The workflow takes unmapped pair-end sequencing data (unmapped BAM format) and returns a GVCF and other metrics read for joint genotyping.
CNN and variant filter: A fully reproducible example of a workflow filtering variants using GATK CNN. Supplemental workflows have been added for advanced users looking to generate and evaluate their own training model. Please read the following discussion to learn more about the CNN tool: Deep Learning in GATK4.
Germline SNPs & Indels GATK4 hg37: Processing For Variant Discovery, HaplotypeCallerGVCF, and Joint Discovery workflows. These workflows make up the Pre-processing and Variant Discovery portions of the Best Practices for Germline SNP & Indel Discovery. The reference genome is hg37.
SNP/Indel calling in mitochondria - hg38: A fully reproducible example of mitochondrial SNP and Indel variant calling (including low allele frequencies 1-5%) from whole-genome sequencing data. The pipeline takes into consideration inherent mitochondrial characteristics such as circular DNA structure and presence of NuMts ("nuclear mitochondrial DNA segment" that transpose to the nuclear genome of eukaryotic organisms).
GATK Workshop Tutorials
These are the workspaces we use in our popular 4-day GATK Bootcamp Workshops. Updated with interactive Jupyter notebooks, they are intended to include enough documentation so you can run on your own, or recommend to friends and colleagues who weren't able to attend a workshop.
Expand for workspace summaries
GATK Tutorials - Somatic: Day 3 of the Genome Analysis Toolkit (GATK) workshop focuses on Somatic Variant Discovery. In this workspace are two forms of Somatic Analysis: one comparing tumor and normal samples using Mutect2 workflow for variant differences, and another using the Copy Number Alterations (CNA) workflow for copy number variations.
GATK Tutorials - Pipelining: Day 4 of the Genome Analysis Toolkit (GATK) workshop. After an introductionn to WDL and Cromwell, you'll practice starting a workspace from scratch using this empty workspace.
Comments
0 comments
Please sign in to leave a comment.