Currently, Terra supports two varieties of analysis.
Batch processing with workflows, which includes but is not limited to:
- Read alignment
- Variant calling
- Joint filtering
Interactive analysis, which includes but is not limited to:
- R/Python-based downstream analysis
This article describes the two analysis modes, and links to hands-on workspaces to practice.
1. Reads-to-variants workflows - i.e. batch processing with GATK best practices workflows
Note: "Raw" data comes in different forms, and whether and what steps you need to do will depend on how much your data has already been processed.
The GATK Best Practices provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. There are several different GATK Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed. The Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencing machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses.
For a list of curated GATK Best Practice showcase workspaces, see this page.
The first phase in all cases involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome, as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis. Note that your data may already be analysis-ready BAMs, in which case you may skip this step. More information about our pre-processing best practices can be found here, and a featured workspace containing example workflows can be found here.
The next step proceeds from analysis-ready BAM files and produces variant calls. This involves identifying genomic variation in one or more individuals and applying filtering methods appropriate to the experimental design. The output is typically in VCF format although some classes of variants (such as CNVs) are difficult to represent in VCF and may therefore be represented in other structured text-based formats.
Depending on the application, additional steps such as filtering and annotation may be required to produce a callset ready for downstream genetic analysis. This typically involves using resources of known variation, truthsets and other metadata to assess and improve the accuracy of the results as well as attach additional information.
Hands-on practice: Workflows-QuickStart
Try this tutorial workspace to practice setting up, launching, and monitoring workflows in Terra. Three hands-on exercises let you experience increasing amounts of complexity, from pre-packaged samples to a process more like a real-world analysis. The tutorial uses two file format conversion workflows, which run quickly and inexpensively.
How long will it take to run? How much will it cost?
If you use the suggested data samples for analysis, it should take around 15-30 minutes per per exercise. Total workflow runtime charges (Google Cloud service costs) for all three exercises should be much less than $ 1 USD.
2. Statistical analysis and visualization in real time
The platform's integration of Jupyter notebooks expands analysis options in Terra. Notebooks are applications that contain code cells to run interactive analysis (in R or Python) and documentation cells in flexible markdown language. Notebooks enable you to run complex statistics and visualization interactively on large amounts of data, including tabular data (think medical records or wearables data). Instead of programing an analysis or visualization to run, going away while it runs, and returning to see the results, you can run the cells in your notebook and see the results immediately. And because every code cell has (ideally) documentation, they are an ideal way to collaborate and share your analysis.
Terra-based Jupyter Notebooks are enabled with a variety of additional useful tools. Some examples of tasks you can do in a Jupyter notebook on Terra include:
- Using Hail - Python-based library for interacting with genomic data
- Using BigQuery - Cloud data warehouse with built in machine learning
- Using Bioconductor - R-based packages for analysis and visualization of genomic data
- Interacting with public datasets
You can find featured tutorial workspaces in our showcase section, such as this workspace explaining Hail analysis, or this workspace containing a case study with an example of downstream clustering analysis. If you are new to Jupyter notebooks, see this Intro to Jupyter Notebooks article.
Some benefits of analysis in a Jupyter notebook
Notebooks make it easy to record and reproduce data analysis steps
Insights in biomedical research require data analysis, but complex analysis is hard to document, share and reproduce. Notebooks enable researchers to quickly develop a rich scientific document that conducts an analysis, shows the results, and explains scientific context. Each code cell of a notebook executes commands to manipulate and explore your data. Code cells can be written in Python, R, or other languages already familiar to the researcher. It is straightforward to expand the functionality of the source code by installing pre-existing libraries, packages or modules of code in a variety of languages. Markdown cells contain formatted explanatory text, links, and images to compliment code cells. Better than "notes", Julyter notebooks mean you will never have questions you can't answer because you forgot your exact analysis steps from eight years ago.
The notebook's linear structure records each step you take in order. When shared, someone else can see how you manipulated the data and can execute the cells in order to reproduce your analysis.
They enable interactive analysis
When you “run” a code cell, output displays right away in a new cell directly underneath the original cell. Working in a notebook, it is possible to run an analysis, observe the result, then change the parameters and re-run the analysis step by step, in real time.
Notebooks extend the information content of published articles
Today, researchers can lend detail about how they derived their results and make it easy for others to reproduce or replicate their analysis by publishing notebooks as an addendum to a traditional publication. Traditional scientific journals can only capture so much detail. Most of the critical data analytics process is under-the-hood, and missing from a summary section. Seeing and executing the actual code tells so much more.
Further, when others query your notebook, they can poke around, and even build on your findings. They can easily access your methods and apply them to other populations.
They make collaborating and sharing seamless
Because Notebooks are easy to share, and self-contained, collaborating and sharing work in process and results is a simple matter of sharing a workspace.
Learn more about interactive analysis on Terra here.
Hands-on practice: Terra-Notebooks-QuickStart
In this tutorial workspace, you'll get hands-on practice with interactive analysis on Terra:
- Browsing and accessing a subset of data (cohort) in the Terra Data Library
- Importing the cohort to the workspace
- Analyzing in an Interactive Jupyter notebook
How long will it take to run? How much will it cost?
It will take around 5-10 minutes to explore and access data and 15-30 minutes to run each notebook. Using the default notebook configuration, the Terra notebook runtime charges are $0.19/hour for Google Cloud service costs. It should cost much less than a dollar to run the notebooks.