GATK workshop at BroadE [March 2019]

Robert Majovski
  • Updated

On March 21, 22, 26, and 27, 2019, members of the Broad Institute community participated in a Genome Analysis Toolkit (GATK) workshop as part of the BroadE workshop series. The workshop focused on the core steps involved in calling variants with the Broad’s GATK, using the “Best Practices” developed by the GATK team. Participants learned why each step is essential to the variant discovery process, the operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of their dataset.

This workshop is notable because it was the first time that the GATK workshop was conducted on Terra!

Workshop synopsis

Best Practices for variant calling with the Genome Analysis Toolkit

This workshop focuses on calling germline short variants and somatic short variants and copy number alterations with Broad's Genome Analysis Toolkit (GATK), using best practices developed by the DSP Methods development team, who develop GATK. The developers will give talks explaining the rationale, theory, and real-world applications of the GATK Best Practices. You will learn why each step is essential to the variant-calling process, what key operations are performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset. If you are an experienced GATK user, you will gain a deeper understanding of how the GATK works under the hood and how to improve your results further, especially with respect to the latest innovations.

The hands-on GATK tutorials in this workshop will be conducted on Terra, a new platform developed at Broad in collaboration with Verily Life Sciences for accessing data, running analysis tools and collaborating securely and seamlessly.


Workshop sessions and materials

Day 1: Introduction to GATK Best Practices

0. Introduction to the workshop
    Geraldine van der Auwera, Associate Director, Outreach & Communications
    Materials: Slides; Video

1. Introduction to high-throughput sequencing data: Understanding the origin and shape of the data
    Mark Fleharty, Computational Scientist
    Materials: Slides; Video

2. Introduction to data preprocessing: Mapping and cleaning up sequencing data
     Yossi Farjoun, Associate Director, Computational Research Methods
     Materials: Slides; Video

3. Introduction to variant discovery: Basic concepts, variant types, and their respective workflows
     Megan Shand, Senior Computational Associate
     Materials: Slides; Video

4. Introduction to pipelining platforms: How we run workflows
     Ruchi Munshi, Senior Software Product Manager
     Materials: Slides; Video

5. Introductory case study: Tetralogy of Fallot
    Anton Kovalsky, Science Writer
    Materials: Slides; VideoTerra workspace


Day 2: Germline short variant discovery

0. Introduction to germline short variant discovery: Key considerations and workflow logic
    Laura Gauthier, Associate Director, Germline Computational Methods
    Materials: Slides; Video

1. Variant calling with HaplotypeCaller: Basic operation and algorithm
    James Emery, Software Engineer
    Materials: SlidesVideo

2. Joint variant calling: GVCF-based workflow using GenomicsDB and GenotypeGVCFs
    Geraldine van der Auwera, Associate Director, Outreach & Communications
    Materials: Slides; Video

3. Germline variant discovery tutorial 
    Kate Noblett, Senior Project Coordinator
    Materials: Germline variant discovery tutorial workspace; Video

4. Variant filtering by Variant Quality Score Recalibration: Assessing accurate confidence scores to each putative mutation call
    Sam Friedman, Machine Learning Scientist
    Materials: Slides; Video

5. Genotype refinement workflow: Using additional data to improve genotype calls and likelihoods
    Takuto Sato, Senior Computational Associate
    Materials: Slides; Video

6. Callset evaluation: Comparing statistics between your callset and external resources
    Rori Cremer, Software Engineer
    Materials: Slides; Video


Day 3: Somatic variant discovery

0. Introduction to somatic variant discovery: Key considerations and workflow logic
    Lee Lichtenstein, Associate Director, Somatic Computational Methods
    Materials: Slides; Video

1. Somatic SNVs and Indel variant discovery
    Andrey Smirnov, Software Engineer
    Materials: SlidesVideo

2. GATK Mutect 2 tutorial
    Adelaide Rhodes, Senior Computational Associate
    Materials: Somatic variant discovery tutorial workspace

3. Somatic copy number alterations
    Steve Huang, Computational Scientist
    Materials: SlidesVideo


Day 4: Additional hands-on practice workspaces

0. Pipelining with WDL and Cromwell (Using this empty workspace, you'll practice starting a workspace from scratch)
    Dan Billings, Principal Software Engineer 
    Materials: Workspace; WorksheetSlidesVideo

1. WDL puzzles
    Kate Knoblett, Senior Project Coordinator
    Materials: Worksheet 

2. How to access and analyze genomics data in real time with BigQuery and a Jupyter Notebook
    Allie Hajian, Science Writer
    Materials: Workspace; Slides; Video

2. Understanding and using Docker containers  
    Adelaide Rhodes, Senior Computational Associate
    Materials: Slides; Video; Worksheet (note that due to a technical issue we couldn't bring you the video from this section, but we've provided a video from a presentation Adelaide did on the same topic a few weeks later)



Additional resources

The Data Biosphere
A Data Biosphere for Biomedical Research

Terra Resources
     - Documentation
     - Ask questions through the button in the upper left hamburger menu, or on the community forum  
      - Make a feature request here

Running workflows on Terra
     - Configure a Tool to run on your data

Terra's Jupyter Notebooks Environment
     - Part I - Key Components
     - Part II - Key Operations
Dos and Don'ts - How not to lose data output files or collaborator edits in a notebook

Jupyter Notebooks Resources
Jupyter Notebooks 101 
Jupyter Notebooks for data Science (extensions, widgets, and more!) 
Jupyter notebooks cheat sheet
     - Mastering markdown 
Markdown cheat sheet 

R Resources
   Data wrangling, visualization, and analysis
R for Data Science  
Cheat Sheets for commonly used R packages 
   Developing and (finding the best) R packages
Advanced R 
R packages 
Finding the best R package amongst the available options:

BigQuery Resources
Comprehensive BigQuery documentation
BigQuery best practices (controlling costs, optimizing Query performance, optimizing storage)
See the giant list of analytical functions on the right-hand side nav bar here
     - Using client libraries (and your favorite programming language) with BigQuery . 
BigQuery YouTube videos (from the Google Cloud Platform developers)

Google Cloud Platform   
     - Understanding and controlling cloud costs  
     - Controlling Cloud costs - sample use cases   
     - Google Cloud Platform (GCP) for Bioinformatics

     - Setting up Chrome Profiles

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request



Article is closed for comments.