Data Tables Quickstart Part 4: Sets again! WDLS that take sets (arrays) as inputs

Allie Hajian
  • Updated

Some analyses use input from several input files to generate a single output. In this last part of the Workflows Quickstart tutorial, you'll learn how to recognize and work with this sort of workflow.

Data-QuickStart_Part3_Array-inputs.png

Workflows that accept arrays (sets) as inputs The Optimus pipeline takes in multiple lanes of a sample, but outputs one file that corresponds to the sample, not the read lanes.

Another example, familiar to cancer researchers, is the CNV_Somatic_Panel_workflow, which generates a single Panel of Normals (PoN) from a list of normal samples. The PoN is used when doing variant calling to filter out systemic errors that occur when reads are processed.

Identifying a workflow that takes a set (array) as input 

You will know a workflow takes a set of inputs (rather than a single file) by the input variable file Type in the workflow configuration card.

Input is a single entity: Input variable file type = File 
Data-QuickStart-Part4_Input-type-file.png

Input is a set of entities: Input variable file type = Array[File]
Data-QuickStart-Part4-Input-type-array.png

1. Run a workflow with a set (array of entities) as input

Once you identify a workflow takes a set as input, how do you set up and run it? The process is slightly different than running a single input workflow on a set of single entities!

Step 1: Set up the workflow configuration form

1.1. Open the "2-Sets-as-Input-Workflow" from the Workflows page.

1.2. Select the root entity type "specimen_set" from the dropdown.

1.3. Confirm the input attribute formatting in the Inputs.  
Remember the links to the data files are in the "specimen" table, not in the "specimen_set" table! You need to tell the WDL to

a) Go to the specimens column of the specimen_set table to get the IDs of the specimens in the array

b) Then to go to the r1_fastq column of each specimen in the set to get the data.

You'll specify this with the format:this.specimens.r1_fastq(already filled in)

1.4. Click the blue "Select Data" button.

Notice the extra "s" in the entity attribute!! Even though the data is in the "specimen" table, the formatting to read/write to the
table is
this.specimens.r1_fastq.

Remember to add an "s" at the end of the entity name. That's right, it's specimens, not specimen! That extra "s" is just built into the platform, but has been known to trip people up!

Step 2: Select the set of input data files

2.1. Click the "Choose existing sets" radio button  

2.2. Select one of your sets from the available options 

Data-QuickStart-Part4_Choose-existing-set_Screen_Shot.png

Step 3: Confirm launch to run the workflow

The workflow runs a single analysis, even though it takes a set as input! Data-QuickStart-Part4_Confirm-launch_Screen_Shot.png

This is different than in previous parts 1, 2, and 3, where the workflows were designed to run a separate analysis submission for each sample. In that case, running on a set of inputs launched as many analyses as there were entities in the set. 

2. Examine the output

Once your workflow completes successfully, you'll want to take a peek at the output. HINT: Look in the table corresponding to the root entity. 

  • The output data file is stored in the workspace bucket. However, you set up the workflow to write to the data table. But which table?

    The root entity for this workflow was specimen_set, and thus the  specimen_set table is where you will find the data:Data-QuickStart-Part4_Outputs-in-specimen-set-table_Screen_Shot.png

G0-smiley-icon.png Congratulations! You've finished the Data Tables Quickstart!

Hopefully you have a better understanding of how you can use integrated tables to organize and manage your data in Terra! 

Was this article helpful?

4 out of 4 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.