Workflows Quickstart Part 2 - Configure workflow to run on data in a table

Allie Hajian
  • Updated

Welcome to part 2 of the Workflows Quickstart. You'll learn how to set up the workflow in a blank configuration form and you'll analyze two single entities of genomic data from the data table. When you run the workflow, Terra will generate a set of the two samples to use for further back-to-back analysis in Part 3.

The tutorial uses the same file format conversion workflow from Part 1, where everything was already set up to run. In Part 2, we'll walk through in more details the process of setting up the workflow in Terra. 

Learning objectives - Time and cost to completeWhat you will learn
You'll learn the parts of a workflow configuration form, including general options for running the workflow. You'll learn how to set up the workflow to read input data from the table and write links to generated data files to the same table. You'll see how Terra generates a set of input data you can use to run downstream analysis. 

How much will it cost? How long will it take? 
The exercise should take no more than fifteen minutes (unless your submission is in the queue a long
time) and cost a few pennies.

HINT: Right click to open the tutorial demo in a new tab

Step 1: Select workflow and input data

Overview

For this second Workflows Quickstart exercise, you'll process two samples from the input table with the same CRAM-to-BAM workflow as in part 1. In this case, the form will be mostly blank, and you will go through the setup process from scratch.

Terra will run the two workflows in parallel and generate a set of those two samples that you will later use for a follow-up workflow: running back-to-back workflows (i.e.  turning workflows into pipelines). 

Step-by-step instructions

1.1. Go to the Workflows page and select the Part-2_CRAM-to-BAM workflow card.

1.2. Confirm that the "Run workflow(s) with inputs defined by data table" radio button is checked and the root entity type is sample

Workflows-Quickstart-Part2_Confirm-run-on-data-table_Screen_shot.png

1.3. Choose the Select Data button. You will be directed to the Select Data form (below).

Workflows-Quickstart-Part-2_Select-data-form_Screen_shot.png

1.4. Click the Choose specific samples to process radio button.

1.5. Check the box next to the samples NA12878 and my_sample

Notice that when you choose more than one sample to process, Terra automatically generates a set that includes the subset you've chosen. This makes it easier to repeat an analysis, or run back-to-back analyses on the same subset.

1.6. To change the name Terra gives the set, type in an easily memorable name in the field.

1.7. Confirm your selection by clicking the blue OK button.

Step 2. Specify workflow inputs (attributes) from data table

What are input attributes?

Attributes are the integers, strings, or files that correspond to input variables in the workflow. These were pre-configured in the form for Part 1.

How do you specify input data?

You'll specify input data by filling in the Attributes fields in the setup form for all required variables from the input (root entity) table or the workspace data table.  

Set-up-workflow_Specify-attributes_Screen_shot.png

Step-by-step instructions

2.1. Go to the first required variable that is blank - "InputCram" - and click inside the attribute field. You'll see a dropdown menu of all the inputs available in both the sample table and the workspace data table.

The drop down menu lists all the columns in the "sample" data table as well as all the workspace-level resource files in the workspace data table. You can usually figure out which to choose from the variable name (second column). 

How to specify input data from the root entity tableThe this.something format tells the workflow "go to the root entity type table and look in the 'something' column to find the input for this variable."

For example, this.CRAM tells the WDL two important bits of information about the input files. 
a. this. means "go to the root entity table (the sample table, in this case)
b. CRAM after the period means go to the CRAM column in the table for this file.

2.2. Select this.CRAM from the dropdown menu i.e. the data file in the CRAM column of the input table). 
Workflows-Quickstart_Part-2_Select-this.CRAM-from-dropdown_Screen_shot.png

2.3. Go to the next blank variable, SampleName, and follow the same process. See the hint below for help with choosing from the dropdown.

  • Sometimes, especially if you didn't write the WDL yourself, you will have to make an educated guess at the attribute that matches the variable name.

    The sample_id is the unique ID (name) for each sample in the sample table.

    Thus, this.sample_id is the correct attribute to use for this variable. The "this.sample_id" format tells the WDL to find the value for the SampleName variable in the sample_id column of the root entity table.

2.4. Repeat for each variable with a blank attribute field. Use the variable name (second column) to help figure out what attribute to choose from the dropdown. 

Some variables will be from the input table (these start with "this.") and some will be global variables from the workspace data table (these start with "workspace.").  

Workspace-level resources (i.e. reference files, Docker images, etc.)

Attributes in the dropdown that begin with workspace. are from the workspace data table. These workspace-level resources can include Docker images, reference FASTA files or other inputs that are used for analyzing any entities.     

Even if you aren't familiar with the reference file in the dropdown, you can take a guess, based on the variable name, what is the right one to select in the dropdown.  Click to see the example below.

  • Sometimes, especially if you didn't write the WDL yourself, you will have to make an educated guess at the attribute that matches the variable name. You usually don't have to know what exactly the file it to do this!

    For example, RefDict is a kind of reference file used for converting file formats.

    Even if you don't know exactly what it is, looking at the dropdown, it's a good bet that workspace.ref_dict is the correct attribute to use for this variable. The "workspace.ref_dict" format tells the WDL to find the value for the RefDict variable in the ref_dict column of the workspace data table.

2.5. Save your inputs by clicking the blue "Save" button at the top right of the form. 

Step 3: Write output file paths to the data table

Generated data files are stored in the workspace bucket by default. You have the option in the configuration form to write links to the files back to the same table that contains the input data. Writing to the data table keeps generated data organized and associated with the input data.

Formatting requirement for attributes from a table You'll use the same "this.something" formatting (from the dropdown) as you did for inputs. Note that if the "something" column does not exist in the data table, the workflow will create one.

3.1. Start in the Outputs tab of the setup form.

3.2. For the first output variable, "outputBai", go to the attribute field and type in "this." + a column name for your output files in the table.

You may not want to choose from the dropdown When you click into the blank field, you will see a dropdown with all the columns that exist in the "sample" data tableThis includes the original columns in the data table (i.e. the ones you used for input variables) as well as the columns the workflow created for the outputs generated in Part 1.

To create a new column in the data table for the generated output, type the new column name into the attribute field, not pick from the dropdown!

3.3. Type in a different output name than the default from part 1 for this run.  

3.4. Save your output attributes by selecting the blue SAVE button at the top right.

Workflows-Quickstart-Part-2_Write-outputs-to-table_Screen_shot.png

The Run Analysis button should turn blue (if it doesn't, you might need to go back and fill in an attribute or click the Save button).

Global and cost-saving runtime options (optional)

There are several other options you can configure on the setup form. Note that the default options for anything else on the setup form are fine when running the quickstart. To learn more about these, see Workflow setup: VM and other options.

Step 4. Launch and monitor analysis

Now that you have everything set up, you are ready to submit the job and let Terra take care of the details of running the workflow in the background on a cloud VM. 

4.1. Click on Run Analysis.

4.2. In the new form, click Launch to finalize your submission.
Workflows-Quickstart-Part-2_Launch-workflow_Screen_shot.png

Notice that this will launch two analyses in parallel, one for each sample.

When your jobs are submitted, you'll be redirected to the Job History page to monitor your submission.

To see job status updates, refresh the page.

5. What to expect when your workflow completes 

Hopefully your submission will succeed. If so, congratulations on setting up the workflow and running on input data in a table. If it failed (and especially if it failed immediately), check to make sure you selected the right input variables from the dropdown.  

5.1. Check (refresh) your your Job History page. It should look like the screenshot below.

Workflows-Quickstart_Part2_Completed-workflow-in-Job-History_Screenshot.png

5.2. Check the sample table in your Data page. It should look similar to this, with additional columns for the generated BAM and BAM index files.

Workflows-Quickstart-Part2_BAI-BAM-in-data-table_Screenshot.png

5.3. Click on the sample_set table. This new table is a set that includes the two samples you ran from the sample table.

Workflows-Quickstart-Part2_Sample_set-table-expanded_Screenshot.png

Now that you've defined a set, you can choose to run this analysis again or to run a downstream analysis on this group by choosing the set when you choose the data to run on (we'll do that in Part 3!).

What's in the sample set? Where's the data? If you click on the "2 samples" you will see that this column includes the names of the samples only, not the data files associated with those samples.

Links to the data files are in the "sample" table. When you run an analysis on a set of single entities, you need to set up the workflow to find the data (Part 3).

G0-smiley-icon.png Congratulations! You've completed Part 2 of the Workflows Quickstart!

Was this article helpful?

1 out of 1 found this helpful

Comments

0 comments

Please sign in to leave a comment.