Getting started with GATK workflows in the cloud FAQs

Allie Hajian
  • Updated

If you're new to running GATK on a cloud-based platform, or new to Terra, this information will help get you started. From preprocessing raw sequencing data through variant calling and joint calling, showcase workspaces provide fully reproducible workflows for critical use cases and include extensive documentation and sample data to practice on. 

What are GATK Best Practice showcase workspaces?

Curated templates with all the components of a complete project workspace to enable you to run the GATK Best Practices pipelines. 

Workspaces include

  • Fully reproducible GATK workflows
  • Sample data
  • Extensive documentation

Use case examples

  • Preprocessing genomics data
  • Variant discovery for germline and somatic SNPs and Indels
  • Copy number and structural variations

You can explore the featured workspaces in read-only mode, or clone a copy to your own billing project (funded by $300 GCP getting started credits). Then, try running the workflows on the included sample or on your own data. 

To learn more about the tools in these workspaces, see GATK's best practices documentation

GATK Best Practices Showcase workspaces

For the most up-to-date list, be sure to check the Showcase Workspaces Library (filter by GATK under analysis tools). 

What input file types does a workflow accept?

Most of the Broad's GATK workflows accept unaligned BAM files (uBAM). Read each workspace's dashboard to learn more about the input file's format requirements, and see GATK's best practices documentation for the exact specifications.   

Data files not in unmapped BAM (uBAM) format?If your data files are not in unmapped BAM format, check out this sequence file conversion workspace. It contains workflows for converting formats for use in GATK analysis tools.

How does the workflow get the input data?

Your workflows need to know where to find the input data stored in the Cloud. You can enter the complete file paths for a single input in the workflow configuration form, or use the data table to store metadata for your input file. We recommend organizing data with tables. To understand why, watch Why use data tables (6:35 minutes on YouTube). 

For more information about the steps to use controlled-access input data, see How to access controlled data on external servers (such as Gen3).  

For step-by-step instructions on how to populate the workspace data table, a video, and practice exercises, see Managing data with workspace tables.    

How to run a workflow on inputs from a data table

To run a workflow on data in the data table, first select the data table with the data you want to use from the left side of the Data page.

Screenshot of the sample table in the Data tab of a GATK workspace. The image is annotated with an orange box to highlight the name of the table in the Tables sidebar.

Select the row with the data to analyze (highlighted with an arrow in the screenshot below), click on the Open with icon above the table (in the highlighted rectangle below), choose the Open With option , and select Workflow (in highlighted box at the far right).

Screenshot of the Data tab of a GATK workspace in Terra. The image is annotated with an orange arrow to highlight the checkbox that is used to select a row of the sample data table and an orange box highlighting the Open With option at the top of the table. The image also shows the menu that opens after selecting Open with, and the workflow option is highlighted with an orange box.

Where is the generated data stored?Data from a workflow analysis is stored in the workspace cloud storage (Google Bucket or Azure blob storage container) by default. To streamline downstream analysis, Showcase workflows are preconfigured to write the URI for output files to the same data table that contains the input files. 

How do I change the input/output files a workflow uses?

Showcase workspaces are preconfigured to run on sample data included in the workspace. To run a showcase workspace's workflows on your own data, you need to update the workspace data table to include links to your own data. To learn how to modify, add, or delete a data table, see How to add a table to a Terra workspace. You can use the Terra interface to change input or output file names or locations. See Workflow setup: VM and other options for step-by-step instructions, a video tutorial, and a practice exercise. 

Screenshot of the Workflow configuration page of a workflow in a GATK Featured Workspace showing where the inputs and outputs of a workflow are specified in Terra.

Additional GATK resources

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.