Covid-19 Surveillance tutorial guide

Allie Cliffe
  • Updated

This tutorial workspace showcases tools used by the community for COVID-19 Surveillance. The workspace contains example SARS-CoV-2 genomic data and workflows that enable you to pull in SARS-CoV-2 data from NCBI's Sequence Read Archive (SRA), perform reference-based assembly, and create visualizations using NextStrain. The tools available in this workspace were created by the Viral Genomics Group at the Broad.

Featured workspace overview

The COVID-19 Surveillance workspace includes workflows for assembling viral genomes and building trees to analyze phylogeny and evolutionary relationships using the Nextstrain tool Augur to produce a phylogenomic tree you can render in the Auspice visualization tool.

1. Set up workspace and cloud infrastructure

Prerequisites

  • You must have a Terra account (either Google or Microsoft ID).
  • You must have access to an Azure-backed Terra billing project (so you can clone the read-only workspace).
  • If you log in with a Microsoft account, you will need to be on the allow list.
    To be included on the allow list, submit your request by filling out the form here.

Step-by-step instructions

1.1. Go to app.terra.bio and log in with your Microsoft or Google ID.

1.2. From the welcome page, navigate to Workspaces by clicking on the My Workspaces card. 

1.3. Select the COVID-19-Surveillance workspace from the Featured Workspaces list. 

1.4. Click the three-dot action icon at the far right and select Clone from the menu to make your own copy of the workspace. 

Terra-on-Azure_Covid-19_Featured-workspace-on-workspaces-page_Screenshot.png

1.5. Give the workspace a unique name (we suggest adding your initials and/or the date) and choose an Azure billing project from the dropdown. All cloud charges associated with the workspace will be covered by the Azure subscription linked to the Terra Billing project. 

1.6. Click Clone Workspace to create your own copy. 

What to expect

Once you create a new workspace, Terra will automatically launch the cloud infrastructure to power data tables. You will need to take a few additional steps to launch the workflows application (Cromwell). 

Launch Cromwell (the workflows application)

1.7. Click on the cloud icon in the right sidebar. 

ToA_Launch-Cromwell-app_Cloud-icon-in-sidebar_Screenshot.png

1.8. Click the gear icon under the Cromwell logo in the Cloud Environment Details pop-up. 

ToA_Launch-Cromwell-app_Cloud-Environment-settings-popup_Screenshot.png

1.9. Click the blue Create button in the Cromwell Cloud Environment pop-up. 

ToA_Launch-Cromwell-app_Create-Cromwell-Environment-button_Screenshot.png

What to expect

It may take several minutes to requisition and set up the cloud infrastructureThese must be ready before you can move on to the next step. See Data tables: Additional resources for more details about the Workspace Data Services that power data tables.

When data tables are ready

Once data tables are launched, you’ll see the active import data button in the top left section of the Data page.

When workflows are ready

After a few minutes, you will see the little pig icon for Cromwell in the right-hand sidebar with a little green dot that shows it’s ready to use.

ToA_Cromwell-icon-in-Analyses-tab-sidebar_Screenshot.png

2. Set up data tables

The workflows in this tutorial are set up to pull inputs (URIs for data files in open-access Azure blob storage containers) from the data table. Because data tables are not currently copied over when you clone a workspace, you will first need to generate the input data table by downloading the example data from the featured workspace and uploading the pre-staged TSV.

2.1. Go back to the Data page of the read-only tutorial workspace.

2.2. Click on the three-dot action icon beside the sample table and select Download TSV

ToA_Download-sample-TSV_Screenshot.png

2.3. Click the save button to download sample.tsv to local storage. 

ToA-Download-sample.tsv_Screenshot.png

What's in this TSV?

This example TSV contains accession IDs that reference Sars-CoV-2 samples in the NCBI Sequence Read Archive (SRA). These examples were selected because they represent diverse geographies and diverse sequencing platforms, including Illumina and Oxford Nanopore.

2.4. Navigate back to the Data page of your own copy of the Covid-19 tutorial workspace

2.5. Click the Import Data button (left side near the top) and select the Upload TSV option to create and populate the data table.

ToA-Covid-workspace_Import-data-Upload-TSV_Screenshot.png

2.6. In the Import Table Data popup, fill in "sample" for the table name and select the sample.tsv you just downloaded.  

ToA-Covid-19-workspace_Import-sample-table-popup_Screenshot.png

2.7. Click the Start Import Job button. 

What to expect

You should see your data is now visible as a sample table in the tables section of the Data page

ToA-Covid-19-workspace_Sample-table_Screenshot.png

2.8. Repeat steps 2.2 to 2.7 for the nextstrain.tsv. Make sure to title this data table "nextstrain" in step 6. 

ToA_Import-table-popup_Screenshot.png

3. Run fetch_sra_to_bam to import SARS-CoV-2 data to your workspace

This workflow downloads sequence files from the Sequence Read Archives (SRA), given an SRA_ID as input. The workflow will then produce the unaligned BAM file that is needed for viral assembly.

Required Inputs

  • An SRA Accession number

Required Outputs:

  • reads_uBAM

Step-by-step instructions

3.1. Navigate to the Analyses tab of your workspace.

3.2. In the right side panel you should see the Cromwell Environment icon with a green circle showing it is running. Click on this icon.

ToA_Cromwell-icon-in-Analyses-tab-sidebar_Screenshot.png

3.3. From here you can launch the Cromwell environment by clicking Open. The Cromwell environment will open in a new browser tab.

3.4. In the new window you will see three preloaded WDL workflows. Start the analysis by clicking on the fetch_sra_to_bam workflow

What to expect

Once selected, you’ll expose a configuration screen where you can set the input data table, select data to analyze, and configure workflow inputs and outputs.

ToA_Submission-configuration-pane_Screenshot.png

3.5. Select the “sample” data table, created in section 2, from the dropdown. You’ll see the table at the bottom of the form (below the Select Data tab).

3.6. The table includes six samples. Select all six from the Select Data tab by checking the boxes.

ToA_Select-data-tab-in-workflow-submission-configuration-pane_Screenshot.png

Running the workflow will launch six separate workflow jobs, one per sample selected.

3.7. Next, click Inputs to tell the workflow which table columns to use for required inputs.

For the workflow variable SRA_ID select Fetch from Data Table as the input source and sra_id as the input attribute from the dropdown menus. The other two variables are optional.

ToA_Configure-inputs-in-configuration-pane_Screenshot.png

3.8. Next, click Outputs to configure the workflow to write a new column to the sample data table to for each output variable. In this tutorial, the outputs are pre-configured, so you can just verify the values.

ToA_Configure-outputs-in-submission-pane_Screenshot.png

fetch_sra_to_bam output variables

For example, the ‘reads_ubam` output will create a column of the same name that includes the sequence files (uBAM format) for the next analysis. Other workflow outputs include metadata about these samples, such as the sequencing technology used to create them and their geographic origin.

3.8. Now you’re ready to click SUBMIT to launch the job with the Cromwell execution engine. Note that you can name your submission and add notes in the Send Submission popup.

ToA_Send-workflow-submision-form_Screenshot.png

3.9. You can check the job status in the Submission History page. Each will go through a few status steps, such as initializing, running, error, or succeeded.

ToA_Workflows-running_Screenshot.png3.10. This workflow generally takes about 15 minutes to complete. You may want to step away and return to the job submission page to check the status of your job.

When your job has completed successfully, you will see a screen similar to this.

ToA_Successful-submission-history_Screenshot.png

3.11. Congratulations! You have completed your first analysis step.

Check outputs

You can check the outputs of your workflow by navigating back to the workspace Data page and clicking on your sample data table. You will see several new columns that contain the outputs of your workflow. For example, the reads_uBAM column includes links to the sequences file (circled below) that now lives in your workspace’s cloud storage.

ToA_Data-table-with-new-columns-of-generated-data_Screenshot.png

You will use these sequence files in your next analysis step where you will assemble your viral sequences to a known reference.

4. Assemble viral sequences

The assemble_refbased workflow takes a raw read file (uBAM) and assembles a viral genome by aligning it to a known reference genome.

Required Inputs

  • reads_uBAM (an output from fetch_sra_to_bam or your own data)
  • reference_fasta (We provide a URL for this reference from an open-access Azure storage container)
  • sample_name (the sra_id for your sample)

Outputs

Viral genome assemblies for all input files and quality metadata that can be used to build a Nextstrain analysis. See the section below "Using sarscov2_nextstrain.wdl to create a tree with your data" for documentation on developing inputs for a nextstrain analysis using your own data.

4.1. Repeat steps 3.1 through 3.3 to open the Submit a workflow page.

4.2. Select the assemble_refbased workflow to reveal the Submission configuration pane. 

ToA_assemble_refbased-configuration-pane_Screenshot.png

4.3. Select Data
The input data table for this workflow is the sample data table and you will again run on all six samples in the table.

4.4. Specify Inputs
Select reads_ubam (generated in the prior workflow) for the variable reads_unmapped_bams and sra_id for the sample_name variable.

The preconfigured workflow inputs also include a direct value - a URL to a SARS_CoV2 reference assembly in an open access Azure storage container - as an input. You’ll use this reference to align your input samples.

ToA_assemble-refbased-worflow_Inputs_Screenshot.png

4.5. Specify Outputs
Outputs of this workflow that will be written to the Sample data table are already filled in.

4.6. Repeat steps 3.8 - 3.11 to submit the workflow, monitor its progress, and view the output results in the data table.

This workflow takes about an hour and 20 minutes to complete.

5. Create a phylogenomic tree with Nextstrain

Overview

In this step, you'll run the sarscov2_nextstrain workflow to align assemblies, build a phylogenomic tree, and produce a JSON file for NextStrain visualization.

Where's the input data?

The Featured Workspace's cloud storage contains includes a concatenated fasta file and the corresponding metadata file for the fasta from NCBI Virus. The URIs of these files in the Featured Workspace storage are referenced in the nextstrain table you uploaded in step 2. These files can be used as input in the sarscov2_nextstrain workflow, to generate phylogenomic trees.

Terra-on-Azure-Covid-19-workspace_Fasta-files-in-cloud-storage_Screenshot.png

Required Inputs

  • an aligned fasta
  • metadata.tsv
  • build.yaml
  • (optional) auspice_json

Outputs

auspice_input_json

Workflow details

This workflow uses the Nextstrain data table from step 2.7. The data table includes a larger example dataset than generated with the previous two workflows. This example will show population-level resolution for COVID-19 surveillance.

To generate your own required inputs for this workflow, you can leverage the prior two workflows and the following Nextstrain resources.

Nextstrain resources

  • This document outlines the requirements for the metadata.tsv file: (https://github.com/nextstrain/augur/blob/master/docs/faq/metadata.md#parsing-from-the-header). The metadata.tsv file has to be curated manually with any metadata that you want to use for your tree. The strain column must match the FASTA headers to work.
  • There are several ways that you can filter and configure how you view your data. For example, the following parameters are available for filtering data (please see https://nextstrain-augur.readthedocs.io/en/stable/usage/cli/cli.html for details on each parameter)

5.1. Repeat steps 3.1 through 3.3 to open the Submit a workflow page with the three workflows.

5.2. Select the sarscov2_nextstrain workflow.

5.3. In the submission configuration form, select the data table nextstrain as input and check the box beside the single row (ma-omicron-light) of data.

ToA_Covid-tutorial_sars-covi-2-nextstrain_Select-data_Screenshot.png

Input data details

The column `assembly_fastas` contains many sequence files representing population-level resolution for the Nextstrain visualization.

5.4. The three required inputs will be pulled from the nextstrain data table: build_name, build_yaml, and assembly_fastas. The attributes have been pre-configured. 

ToA_Covid-tutorial_sars-covi-2-nextstrain-Inputs_Screenshot.png

5.5. The following outputs will be written back to the nextstrain data table.

ToA_Covid-tutorial_sars-covi-2-nextstrain-Outputs_Screenshot.png5.6. Once all inputs and outputs are set, click Submit to launch your workflow job to Cromwell. You can monitor the status of your submission in the Submission History page.

What to expect

This workflow takes approximately 1 day and three hours to complete. You'll need the generated data for the next step, so don't move on until this workflow is complete. 

To check the status of the workflow

  1. Click the Cromwell icon in the right sidebar from any page
  2. Click Open to open the Cromwell environment in a new tab
  3. Click the submission history button 

6: Download the Auspice input JSON file

In the next analysis step, you'll use the generated output file (found in the column auspice_input_json in the nextstrain data table) with the phylogenomic visualization tool available at http://auspice-us.herokuapp.com/.

6.1. Go to your workspace data tables and click on the Nextstrain data table

ToA-Covid-19-workspace_Nextstrain-data-table_Screenshot.png

6.2. In the current example, we only have one row in this table called ma-omicron-light. In the column auspice_input_json, click the link in the row titled auspice-mass-omicron_auspice.json to open a modal with metadata about the file.

ToA-Covid-19-workspace_Download-auspice-mass-omicron-json_Screenshot.png

6.3. To save this file to your local machine, click the Download button and Save Link As option.

ToA-Covid-19-workspace_Save-auspice-json-link-as_Screenshot.png

6.4. Check to verify that you saved the file locally.

7: Upload the input file to the Auspice browser

In this step, you’ll navigate away from Terra to view and interact with the phylogenomic tree using an online visualization tool called Auspice from the Nexstrain collection.

7.1. In a new tab, go to http://auspice-us.herokuapp.com/.

7.2. Drag and drop your local file auspice-mass-omicron_auspice.json into the browser’s area Drag & Drop a dataset on here to view.

ToA-Covid-tutorial_Auspice-landing-page_Screenshot.png

7.3. This will create an interactive phylogenomic tree with population-level resolution of SARS-CoV-2 samples collected in the Northeastern United States.

ToA-Covid-tutorial_Interactive-phylogenomic-tree-on-auspice_Screenshot.png

Congratulations! You have now run three workflows on Terra to create a high-resolution analysis for conducting COVID-19 surveillance.

Additional resources

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.