This tutorial workspace showcases tools used by the community for COVID-19 Surveillance. The workspace contains example SARS-CoV-2 genomic data and workflows that enable you to pull in SARS-CoV-2 data from NCBI's Sequence Read Archive (SRA), perform reference-based assembly, and create visualizations using NextStrain. The tools available in this workspace were created by the Viral Genomics Group at the Broad.
Featured workspace overview
The COVID-19 Surveillance workspace includes workflows for assembling viral genomes and building trees to analyze phylogeny and evolutionary relationships using the Nextstrain tool Augur to produce a phylogenomic tree you can render in the Auspice visualization tool.
1. Clone and set up workspace
You will first set up your workspace - the computational sandbox within Terra where you will store, organize, and analyze data in the cloud. When you create or copy an existing workspace, Terra sets up Azure resources within your subscription to power workspace infrastructure.
- You must have a Terra account (either Google or Microsoft ID).
- You must have access to an Azure-backed Terra billing project (so you can clone the read-only workspace).
1.1. Go to app.terra.bio and log in with your Microsoft or Google ID.
1.2. From the welcome page, navigate to Workspaces by clicking on the My Workspaces card.
1.3. Select the COVID-19-Surveillance workspace from the Featured Workspaces list.
1.4. Click the three-dot action icon at the far right and select Clone from the menu to make your own copy of the workspace.
1.5. Give the workspace a unique name (we suggest adding your initials and/or the date) and choose an Azure billing project from the dropdown. All cloud charges associated with the workspace will be covered by the Azure subscription linked to the Terra Billing project.
1.6. Click Clone Workspace to create your own copy.
What to expect
Initially , your workspace will not have any data or analysis tools. Behind-the-scenes, Terra will set up the cloud resources for data tables and workflows in your copy of the workspace under the Terra Billing Project you assign. Note that this process will take a few minutes.
Launch Cromwell (the workflows application)
1.7. Go to the Workflows page and click the blue Launch Workflows App button.
It may take several minutes to set up the cloud resources for tables and workflowsBoth data tables and Cromwell must be ready before you can move on to the next step. See Data tables: Additional resources for more details about the Workspace Data Services that power data tables.
When data tables are ready
Once data tables are launched, you’ll see two nextstrain and sra tables and the active import data button in the top left section of the Data page.
When workflows are ready
After a few minutes, you will see the workflows navigation on the left and three workflows in the main section of the Workflows page.
2. Confirm data tables
The workflows in this tutorial are set up to pull inputs (URIs for data files in open-access Azure blob storage containers) from the data table.
What's in the sample TSV?
This example TSV contains accession IDs that reference Sars-CoV-2 samples in the NCBI Sequence Read Archive (SRA). These examples were selected because they represent diverse geographies and diverse sequencing platforms, including Illumina and Oxford Nanopore.
3. Run fetch_sra_to_bam to import SARS-CoV-2 data
This workflow downloads sequence files from the Sequence Read Archives (SRA), given an SRA_ID as input. The workflow will then create a new column in the data table with links to the unaligned BAM file that is needed for viral assembly.
- An SRA Accession number
3.1. Navigate to the Workflows tab of your workspace, where you will see three preconfigured WDL workflows.
3.2. Start the first part of the analysis by clicking on the fetch_sra_to_bam workflow.
What to expect
You’ll expose a configuration screen where you can set the input data table, select data to analyze and configure workflow inputs and outputs.
3.3. Select the sample data table from the dropdown. You’ll see the table contents at the bottom of the form (below the Select Data tab).
3.4. The table includes six samples. Select all six from the Select Data tab by checking the boxes.
Running the workflow will launch six separate workflow jobs, one per sample selected.
3.5. Next, click Inputs to verify which table columns to use for required inputs.
The workflow variable SRA_ID should have Fetch from Data Table as the input source and sra_id as the input attribute. The other two variables are optional.
3.6. Click Outputs to configure the workflow to write a new column to the sample data table for each output variable. In this tutorial, the outputs are pre-configured, so you can just verify the values.
fetch_sra_to_bam output variables
Terra will create a reads_ubam column to hold the URL for the output BAM file. The data file, which is in workplace cloud storage by default, is used as input for the next analysis. Other workflow outputs include metadata about these samples, such as the sequencing technology used to create them and their geographic origin.
3.7. Now you’re ready to click SUBMIT to launch the job with the Cromwell execution engine. Note that you can name your submission and add notes in the Send Submission popup.
3.8. You can check the job status in the Submission History page. Each submission status will go through a few status steps, such as initializing, running, error, or succeeded.
3.9. This workflow generally takes about 15 minutes to complete. You may want to step away and return to the job submission page to check the status of your job.
When your job has completed successfully, you will see a screen similar to this.
Congratulations! You have completed your first analysis step.
You can check the outputs of your workflow by navigating back to the workspace Data page and clicking on your sample data table. You will see several new columns with the outputs of your workflow. For example, the reads_uBAM column includes links to the sequences file (circled below) that now lives in your workspace’s cloud storage.
You will use these sequence files in your next analysis step, where you will assemble your viral sequences to a known reference.
4. Assemble viral sequences
The assemble_refbased workflow takes a raw read file (uBAM) and assembles a viral genome by aligning it to a known reference genome.
- reads_uBAM (an output from fetch_sra_to_bam or your own data)
- reference_fasta (We provide a URL for this reference from an open-access Azure storage container)
- sample_name (the sra_id for your sample)
Viral genome assemblies for all input files and quality metadata that can be used to build a Nextstrain analysis. See the section below, "Using sarscov2_nextstrain.wdl to create a tree with your data," for documentation on developing inputs for a nextstrain analysis using your own data.
4.1. Go to the Workflows page.
4.2. In the assemble_refbased workflow card, click the blue configure button to reveal the Submission Configuration pane.
4.3. Select Data
The input data table for this workflow is the sample data table and you will again run on all six samples in the table.
4.4. Specify Inputs
Select reads_ubam (generated in the prior workflow) for the variable reads_unmapped_bams and sra_id for the sample_name variable.
The preconfigured workflow inputs also include a direct value - a URL to a SARS_CoV2 reference assembly in an open-access Azure storage container - for the reference_fasta variable. You’ll use this reference to align your input samples.
4.5. Specify Outputs
Outputs of this workflow that will be written to the sample data table are already filled in.
4.6. Repeat steps 3.7 - 3.9 to submit the workflow, monitor its progress, and view the output results in the data table.
This workflow takes about an hour and 20 minutes to complete.
5. Create a phylogenomic tree with Nextstrain
In this step, you'll run the sarscov2_nextstrain workflow to align assemblies, build a phylogenomic tree, and produce a JSON file for NextStrain visualization.
Where's the input data?
The Featured Workspace's cloud storage includes a concatenated fasta file and the corresponding metadata file for the fasta from NCBI Virus. The URIs of these files in the Featured Workspace storage are referenced in the nextstrain table you uploaded in step 2. These files can be used as input in the sarscov2_nextstrain workflow, to generate phylogenomic trees.
- an aligned fasta
- (optional) auspice_json
This workflow uses the Nextstrain data table, which includes a larger example dataset than generated with the previous two workflows. This example will show population-level resolution for COVID-19 surveillance.
To generate your own required inputs for this workflow, you can leverage the prior two workflows and the following Nextstrain resources.
- This document outlines the requirements for the metadata.tsv file: (https://github.com/nextstrain/augur/blob/master/docs/faq/metadata.md#parsing-from-the-header). The metadata.tsv file has to be curated manually with any metadata that you want to use for your tree. The strain column must match the FASTA headers to work.
- There are several ways that you can filter and configure how you view your data. For example, the following parameters are available for filtering data (please see https://nextstrain-augur.readthedocs.io/en/stable/usage/cli/cli.html for details on each parameter)
5.1. Go to the Workflow page with the three workflows.
5.2. Select the sarscov2_nextstrain workflow. and click the blue Configure button.
5.3. In the submission configuration form, select the data table nextstrain as input and check the box beside the single row (ma-omicron-light) of data.
Input data details
The column `assembly_fastas` contains many sequence files representing population-level resolution for the Nextstrain visualization.
5.4. The three required inputs will be pulled from the nextstrain data table: build_name, build_yaml, and assembly_fastas. The attributes have been pre-configured.
5.5. The following outputs will be written back to the nextstrain data table.
5.6. Once all inputs and outputs are set, click Submit to launch your workflow job to Cromwell. You can monitor the status of your submission in the Submission History page.
What to expect
This workflow takes approximately 1 day and three hours to complete. You'll need the generated data for the next step, so don't move on until this workflow is complete.
To check the status of the workflow
- Click the Cromwell icon in the right sidebar from any page
- Click Open to open the Cromwell environment in a new tab
- Click the submission history button
6: Download the Auspice input JSON file
In the next analysis step, you'll use the generated output file (found in the column auspice_input_json in the nextstrain data table) with the phylogenomic visualization tool available at http://auspice-us.herokuapp.com/.
6.1. Go to your workspace data tables and click on the Nextstrain data table.
6.2. In the current example, we only have one row in this table called ma-omicron-light. In the column auspice_input_json, click the link in the row titled auspice-mass-omicron_auspice.json to open a modal with metadata about the file.
6.3. To save this file to your local machine, click the Download button and Save Link As option.
6.4. Check to verify that you saved the file locally.
7: Upload the input file to the Auspice browser
In this step, you’ll navigate away from Terra to view and interact with the phylogenomic tree using an online visualization tool called Auspice from the Nexstrain collection.
7.1. In a new tab, go to http://auspice-us.herokuapp.com/.
7.2. Drag and drop your local file auspice-mass-omicron_auspice.json into the browser’s area Drag & Drop a dataset on here to view.
7.3. This will create an interactive phylogenomic tree with population-level resolution of SARS-CoV-2 samples collected in the Northeastern United States.
Congratulations! You have now run three workflows on Terra to create a high-resolution analysis for conducting COVID-19 surveillance.
- Learn how to interact with this example using the Auspice documentation at https://docs.nextstrain.org/projects/auspice/en/stable/.
- Read about next steps, such as bringing your own data into this analysis, on the Featured Workspace Dashboard.