One of the best ways to get started in Terra is to explore Featured Workspaces, curated workspaces spanning a variety of possible use cases for replication. They are standardized for completeness and ease of use and enable users to reproduce instructive results and learn established methodologies. The workspace description should have enough detail to allow you to run on the included sample data, and cost and time estimates give you the confidence to run on your own data.
You will find these resources in the Library - Showcase & Tutorials (access using the hamburger menu at the top left of your screen in Terra).
Below is a list of available Featured Workspaces grouped by subject. Workspaces that include template or tutorial notebooks and what language they are in are noted in parentheses: (i.e. Py3 notebook).
- Jupyter Notebooks 101 (does not require R or Python coding)
- Terra Notebooks Playground (includes both R and Py3)
- GMQL 101
- Hail tutorial (Py3 notebook)
- Reproducing the paper Tetralogy of Fallot (Includes a cluster analysis in an R-based notebook)
- Terra BigQuery hands on (R)
- Probability of Being Signal comparison notebook (R)
Data-focused workspaces (more coming soon!)
- Introduction to the TARGET dataset
- ENCODE tutorial (includes R notebook)
- Seq format conversion
- HCA Optimus pipeline
- DNA methylation pre-processing
- InferCNV SCP scRNA seq
- Germline SNPs and Indels GATK4 hg38
- Somatic CNV discovery
- GATK Best Practices for Single Tumor-Normal Pair or Single Tumor Samplee
- CNN variant filter
- Pre-processing b37 v3
- SNP/Indel calling in mitochondria
- Five dollar genome analysis pipeline
- Pre-processing hg38 v2
Notebooks-based analysis, and intro or tutorial workspaces
Need to get up to speed on particular analysis packages or fundamentals? You'll find lots of useful (mostly) notebooks-based analysis tools in these workspaces. You'll also find hands-on exercises to get you started, wrapped as complete packages with sample data and information on runtime cost.
Jupyter Notebooks 101: Maybe you have heard of Jupyter notebooks and you're interested using them to do interactive analysis on large amounts of data. This workspace will help explain 1) what notebooks are and how to use them in biomedical research; 2) the relationship between a notebook and a workspace; 3) Jupyter Notebook basics: how to use a notebook, install packages, and import modules
Terra Notebooks Playground: This workspace contains a set of Jupyter Notebooks that allow users to play with the interactive functionality of Jupyter notebooks, a web-based application that supports code in a variety of languages (R and Python, among others). Notebooks are organized into two categories: R and Python, and streamlines interaction with cloud-based resources.
GMQL 101: an introduction to the GenoMetric Query Language and its engine. GMQL is a query language designed to handle tertiary genomic data. It operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples. Based on Hadoop framework and Apache Spark platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets.
Hail notebook tutorial: Hail provides an open-source framework to analyze the largest genetic data sets in existence and to meet the exploding needs of hospitals, diagnostic labs, and industry. Hail's efficient and scalable framework currently powers dozens of major academic studies. This workspace is an introductory tutorial that demonstrates the basics of using the Hail package. It does not currently support Hail’s high-performance features and should not be used to attempt to run Hail at large scale.
Terra BigQuery hands-on: A hands on notebook to generate a cohort in data Explorer, export to a workspace and access and analyze in a Jupyter notebook. A second notebook walks through two ways of importing a query to a workspace for downstream analysis, demonstrating how to analyze data in real time using BiGQuery and Jupyter notebooks.
Reproducing the paper: Variant analysis of Tetralogy of Fallot: This workspace reproduces a classic example of a study to understand the genetics that underlie a particular phenotype. described by Matthieu Miossec and collaborators in the bioRxiv preprint "Deleterious genetic variants in NOTCH1 are a major contributor to the incidence of non-syndromic Tetralogy of Fallot (ToF). The workspace reproduces all steps in the study as closely as possible, from processing the raw data (BAM) files, to calling variants, to the clustering analysis that led to the final result. The workspace serves as a template of best practices for making your own work easily reproducible with a detailed explanation of how we reproduced the ToF study using a cloud-based analysis platform. Sample data and notebooks allow users to reproduce the process themselves.
PBS comparison analysis (ENCODE data): Learn how to search, analyze, and visualize ENCyclopedia Of DNA Elements (ENCODE) data. The resources in this workspace cover binning ENCODE ChIP-seq datasets into non-overlapping 5 kB bins and determining the signal enrichment in each bin. In addition to workflows that calculate the Probability of Being Signal, a notebook analyzes PBS workflow output to identify regions of interest
Data-focused workspaces (more coming soon!)
These highlight specific data sets and include examples of how to access and process data in those workspaces. Data sets can include public-access data with a broad range of audiences and use cases and restricted-access data for specific research groups.
Introduction to TARGET dataset: Practice retrieving data from the Genomic Data Commons Data Portal and running tools that use the TARGET dataset to create a panel-of-normals VCF (2-CNV_Somatic_Panel), then use the that VCF to identify somatic copy number variations (3-CNV_Somatic_Pair).
TCGA: Practice accessing and analysing TCGA data with basic analysis Tools. Data processing workflows allow you to create a panel of normal VCF (1-Mutect2_PON) and then use the that VCF to perform somatic SNP and indel calling (Mutect2_GATK4).
ENCODE: This workspace outlines how to access and analyze ENCODE ChIP-seq data. Steps include 1) exploring and exporting data in the ENCODE data explorer 2) binning the data into non-overlapping 5 KB bins and determining the signal enrichment in each bin using a WDL tool 3) using the output from 2 to identify 5 kB regions of interest and 4) zero in on these regions using IGV in a web browser.
Workflow (pipeline)-focused workspaces
These workspaces showcase workflows and tools for general use by the genomics community. Many contain tools developed at and supported by Broad.
Seq format conversion: Provides users with example WDL's for converting sequence data file formats for downstream analysis. Conversion methods include Paired FASTQ to Unmapped BAM, BAM to Unmapped BAM, and CRAM to BAM. The Validate BAM method is also added for confirming if SAM or BAM files are in the proper format.
HCA Optimus Pipeline: The Optimus pipeline, developed by the Data Coordination Platform of the Human Cell Atlas (HCA DCP), processes 3 prime single cell transcriptome data from the 10X Genomics v2 (and v3) assay. This workspace describes the pipeline and provides a fully reproducible example of the workflow.
DNA methylation pre-processing: Suite of tools to conduct methylation data analysis. Methods from this workspace can be used for alignment and quality control analysis for various protocols including Whole Genom Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Hybrid Selection Bisulfite Sequencing (HSBS).
InferCNV SCP scRNAseq: The purpose of inferCNV is to explore tumor single cell RNA-seq data to identify evidence for copy number variations such as deletion and gain of entire and/or large segments of chromosomes. inferCNV compares expression intensity of genes across positions of tumor genome to reference 'normal' cells. A resulting heatmap illustrates the relative expression intensities across each chromosome to depict higher or lower expression of regions of the tumor genome compared to expression in ‘normal’ cells...
GATK4 showcase workspaces
Working examples of best practices GATK workflows, along with sample data and time and cost estimates.
Germline SNPs & Indels GATK4 hg38: Processing For Variant Discovery, HaplotypeCallerGVCF, and Joint Discovery workflows. These workflows make up the Pre-processing and Variant Discovery portions of the Best Practices for Germline SNP & Indel Discovery.
Somatic CNV discovery: A fully reproducible example of the somatic copy number variation workflow, which makes up the Variant Discovery portion of the Best Practices for Somatic CNV Discovery.
GATK Best Practices for Single Tumor-Normal Pair or Single Tumor Sample: A fully reproducible example of the Mutect2 workflow, which makes up the Variant Discovery portion of the Best Practices for Somatic SNV and Indel Discovery.
CNN and variant filter: A fully reproducible example of a workflow filtering variants using GATK CNN. Supplemental workflows have been added for advanced users looking to generate and evaluate their own training model. Please read the following discussion to learn more about the CNN tool: Deep Learning in GATK4.
Pre-processing b37 v3: A fully reproducible example of data Pre-processing. These workflows make up the Pre-processing portion of the Best Practices for Germline SNP & Indel Discovery.
SNP/Indel calling in mitochondria: A fully reproducible example of mitochondrial SNP and Indel variant calling (including low allele frequencies 1-5%) from whole-genome sequencing data. The pipeline takes into consideration inherent mitochondrial characteristics such as circular DNA structure and presence of NuMts ("nuclear mitochondrial DNA segment" that transpose to the nuclear genome of eukaryotic organisms).
Five dollar genome analysis pipeline: A fully reproducible example of the GATK Best Practices workflows for Data Pre-processing and Germline Short Variant Discovery. A scientific description of the workflow is available in Gatk's Best Practices Document. The "$5 Genome Analysis Pipeline" name refers to the cost of running the full pipeline (with all options turned to do the maximum amount of work) on a typical whole genome dataset, on the Google Cloud Platform, as explained on the Broad site in a Gatk blog post.
Pre-processing hg38 v2: A fully reproducible example of data Pre-processing. These workflows make up the Pre-processing portion of the Best Practices for Germline SNP & Indel Discovery. Detailed description of the workflow is available in Gatk's Best Practices Document.