This article walks through the setup process for running a workflow in Terra on Azure when using inputs (data samples) from a data table. Please note that this document only includes functionality available for public preview.
Overview: Workflows in public preview (current functionality)
For the public preview, you can try out running a workflow by creating a workspace. All Terra on Azure workspaces include three pre-staged workflows.
For step-by-step instructions to run the three workflows as a complete analysis, see the Covid-19 Surveillance Featured Workspace and accompanying Guide.
Step 1: Launch data tables and Cromwell infrastructure
1.1. Go to Your Workspaces (select from the main navigation menu at the top left of any page).
1.2. Create a workspace by clicking on the Create Workspace button at the top of the page.
What to expect
Once you create a new workspace, Terra will automatically launch the cloud infrastructure to power data tables. You will need to take a few additional steps to launch the workflows application (Cromwell).
Launch Cromwell (the workflows application)
1.7. Click on the cloud icon in the right sidebar.
1.8. Click the gear icon under the Cromwell logo in the Cloud Environment Details pop-up.
1.9. Click the blue Create button in the Cromwell Cloud Environment pop-up.
What to expect
It may take several minutes to requisition and set up the cloud infrastructureThese must be ready before you can move on to the next step. See Data tables: Additional resources for more details about the Workspace Data Services that power data tables.
When data tables are ready
Once data tables are launched, you’ll see the active import data button in the top left section of the Data page.
When workflows are ready
After a few minutes, you will see the little pig icon for Cromwell in the right-hand sidebar with a little green dot that shows it’s ready to use.
Step 2: Upload the data table
The workflows are set up to pull inputs (URIs for data files in open-access Azure blob storage containers) from the data table. Because data tables are not currently copied over when you clone a workspace, you will first need to generate the input data table by uploading a pre-staged TSV.
2.1. Go back to the Data page of the read-only tutorial workspace.
2.2. Click on the three-dot action icon beside the sample table and select Download TSV.
2.3. Click the save button to download sample.tsv to local storage.
What's in this TSV?
This example TSV contains accession IDs that reference six Sars-CoV-2 samples in the NCBI Sequence Read Archive (SRA). These examples were selected because they represent diverse geographies and diverse sequencing platforms, including Illumina and Oxford Nanopore.
2.4. Navigate back to the Data page of your own copy of the Covid-19 tutorial workspace.
2.5. Click the Import Data button (left side near the top) and select the Upload TSV option to create and populate the data table.
2.6. In the Import Table Data popup, fill in "sample" for the table name and select the sample.tsv you just downloaded.
2.7. Click the Start Import Job button.
What to expect
You should see your data is now visible as a sample table in the tables section of the Data page.
Step 3: Launch Cromwell (workflow engine)
Once the workflows infrastructure is running, you will see the Cromwell icon on the right sidebar with a small green button. Follow the steps below to launch the workflow engine (Cromwell) and choose the workflow.
3.1. Click the Cromwell icon (right sidebar).
3.2. Click Open to open a new Batch Analysis window in a separate tab.
You’ll see cards for three pre-staged workflows.
What do the three pre-staged workflows do?
- Pull in SARS-CoV2 data from NCBI’s Sequence Read Archive (SRA)
- Perform reference-based assembly
- Create visualizations using NextStrain
Can I run other workflows?At this time, it is not possible to run workflows other than the three that are pre-staged in the featured workspace. That is why the “Find a workflow” button is inactive. We are working on it though!
For more detailed instructions on how to upload data and run the workflows, see the Covid-19 Surveillance Featured Workspace and step-by-step guide.
Step 4: Select data and set up the workflow
Click on the workflow card fetch_sra_to_bam (one of the three pre-staged workflows) to access the submission configuration form.
What to expect
The configuration form includes useful information like the workflow version and source URL link. It is also where you will set up the workflow to run on specific data from the input table. All three workflows are pre-configured. The steps below walk through the process, so you can understand and verify the configuration.
4.1. Select the input data table
For this pre-configured workflow, the sample table, which includes URIs for the input data files (samples), is already selected.
4.2. Select data to run on
Next, you'll select which specific records (rows) in the data table to run on.
1. Navigate to Select Data at the bottom of the form to see the data table.
2. Select the samples to analyze by clicking the checkbox at the left of the sample row.
4.3. Specify input variable attributes
Go to the Inputs tab to specify data table columns (attributes) for each variable. For the pre-staged workflows, you’ll see the variable attributes are pre-configured.
Note on configuring inputs
Note you can select input sources (hard coded or from the data table) for each variable. If you choose to Fetch from table, the attribute column dropdown will display all columns in the data table.
4.4. Write outputs to the data table
Next, click Outputs to configure the workflow to write a new column to the sample data table to for each output variable. You can enter a new name in the attribute column to make a new column. In this tutorial, the outputs have been pre-filled to write to the data table.
Where is generated data stored?
Generated files will be stored in the workspace blob storage container by default. When the output attribute is filled in, Terra will write the file locations (URIs) of generated files in a new column in the input data table.
Step 5: Submit the workflow
5.1. When ready, select Submit to open a popup window where you can name and enter comments about the submission.
Your submission has a pre-populated name that includes the workflow name, input data table, and date and time of submission. You can change this to be meaningful to you.
The popup includes how many workflows will be submitted in this submission.
5.2. To confirm and launch the workflow submission, click the Submit button again.
What to expect
Once you submit, Terra will get to work setting up and deploying the cloud resources to run your workflow. You will automatically be directed to the submission details page.
Next steps: Monitor workflow submission status
The submission details page of the Cromwell environment includes the workflow name, submission date, and duration.
How to find the Submission HistoryNote that the Submission History and submission details pages are in the Cromwell environment, which is distinct from your workspace pages. If you cannot see the submission history option, make sure you have Cromwell running (green icon in the right sidebar from any workspace page) and are in the Cromwell environment (separate tab from the workspace).
At the bottom of the page is a list of all the workflows in the submission (running in parallel) along with the sample ID, workflow ID, status, and duration.
To see the status of a workflow, its start and end time, and sub-workflow and task failures, click on an individual workflow ID to view the workflow details page.
Use the breadcrumb on top of the page (circled in the screenshot below) to navigate back and forth between the submission history (lists of previous submissions), submission details page, and workflow details page.
What to expect (completed workflows)
When the workflows are done and you see a green check in the Job History, you can verify that the generated output files are in your workspace blob storage container by clicking on the Files icon in the right sidebar. This will open the directory of your workspace storage.
To access the generated data files
Click in the left-hand column to open the subdirectories cromwell-executions > workflow-name > submission-ID > taskname > execution > outputfile. You will see a list of all the generated files.
Note that you may need to go down several levels in the file directory to find the data files.
FYI I ran into a broken link in this guide.
For Step 2.1, when I try to click the link to download the sample TSV file, I'm met with an authentication error and no TSV file is downloaded.
When I navigate to the featured workspace to try to download the TSV, it looks like the data table service is not up and running:
Perhaps this is why I'm unable to download the sample TSV file?
I'm so sorry you couldn't download the TSV Curtis Kapsak! The data table status in the Featured Workspace doesn't affect whether you can access the file. The table status error you saw was because tables were single-user and only viewable to workspace creators until last week.
But I'm glad you commented since you revealed a problem with our instructions (and I have updated them, so they should be current). You should be able to see the sample table in the Featured Workspace now. If you click on the three-dot action icon to the right of the sample table, you can download the sample.tsv to your local machine, then upload it to your own workspace copy.
Please sign in to leave a comment.