In our last blog post, we featured a public workspace containing best-practices workflows for viral genome analysis developed by Dr. Danny Park's Viral Genomics group, and used to process COVID-19 research data. We have been working with the viral genomics team to make additional improvements to the workspace, and I’m excited to tell you about a couple of major updates.
- Updated WDL for reference-based viral genome assembly
- Addition of a WDL to run Augur, a NextStrain tool for phylogenetic analysis and visualization
These additions mean that you can now go all the way from sequencing data to an interactive phylogenetic tree in the same workspace. Let's dive into the specifics for each of these important updates.
New reference-based viral genome assembly workflow: simpler and more efficient
The viral genomics team previously provided an “assisted de novo” viral assembly pipeline that has been refined over the past decade of use and validated on metagenomic Illumina data from diverse viral taxa, including Lassa, Ebola, Zika, Mumps, Influenza A, HIV, Rabies, Hepatitis A, and several herpes viruses (HHV 1, 2, 3, and 5). It was designed to assemble contigs, scaffold, and polish assemblies for viruses that may exhibit up to 30% nucleotide divergence from available reference databases. This approach is robust for a wide range of viral taxa, but may be more computationally intensive and complex than necessary for viruses that exhibit very limited diversity—such as those involved in single-origin disease outbreaks.
In recent months, the scientific and public health community tackling SARS-CoV-2 genomics has increasingly been favoring simplified approaches for both data generation and data analysis, many of which are documented by the CDC. Simple align-to-reference based approaches for consensus calling (similar to those used in the study of non-diverse genomes, such as humans) provide more efficient analysis processes and ease of interpretation. Additionally, the popularity of PCR tiled amplicon-based data generation approaches (such as ARTIC) frequently necessitates specialized filtration steps to remove the primer artifacts during analysis (the iVar trimming tool from Scripps being one of the more popular tools for ARTIC+Illumina data).
In this update, we are adding the viral genomics team's reference based viral assembly tool (assemble_refbased.wdl), which they've updated to reflect these best practices and is appropriate for use on any Illumina data generated from SARS-CoV-2. In particular, there is an optional input parameter, a BED file, to describe any PCR amplicon primers used in the process of data generation (this can be omitted if no such primers were used). The original de novo assembly workflow is still provided in this workspace; although it is not necessary for SARS-CoV-2, it is applicable to a much broader range of viruses than the reference-based workflow.
A workflow for phylogenetic analysis with Augur (NextStrain)
Working with the viral genomics team, DSP created a WDL that runs the Augur tool from NextStrain, which we added to the Terra COVID-19 workspace. This allows you to run a phylogenetic analysis on a set of assembled viral genomes (files that are output by the assembly workflow described above) and visualize the resulting tree. The workspace we provide includes a set of publicly available genomes imported from the NCBI SRA repository, but you can import your own data as well.
What are Augur and NextStrain?
NextStrain is a collection of open source tools that help scientists, epidemiologists and public health officials in their understanding of pathogen spread and evolution, especially in outbreak scenarios. Augur is one of those tools, developed for tracking pathogen evolution from sequencing data. With this particular tool, epidemiologists can build trees to analyze the evolutionary relationships between viral strains isolated from cases of COVID-19, which helps map the initial emergence and sustained transmission of the virus.
The nextstrain.org portal provides analysis results produced by running Augur on all publicly available datasets for SARS-CoV-2/COVID-19. However, researchers may need to perform "community builds" on defined subsets of data, or provide previews of data prior to public release. These “community builds” allow them to create their own analyses and store their results on GitHub.
Running the Augur workflow in Terra: configuration notes
The workflow that we provide (augur_build_tree.wdl) is configured to run on the collection of assembled FASTA files generated by the viral genome assembly workflow described above (assemble_refbased). You can however change the configuration to run on any set of assembled viral genomes in FASTA format.
By default, the workflow will use a set of resources and parameters that are appropriate for SARS-CoV-2 genomes. The reference FASTA and Genbank (.gb) files are SARS-CoV-2 specific references. The default auspice_config.json file represents the metadata we have modeled in the data tab and was curated to include the metadata available from SRA (output from SRA_to_uBAM). If you plan to run the workflow on your own data, make sure to prepare your metadata file according to these directions.
We hope that these new resources will prove useful to you; as always, we welcome your feedback and suggestions for improving them.