Data Tables Quickstart Part 4: Sets again! WDLS that take sets (arrays) as inputs

Allie Hajian
  • Updated

Some analyses use input from several input files to generate a single output. In this last part of the Workflows Quickstart tutorial, you'll learn how to recognize and work with this sort of workflows.

Data-QuickStart_Part3_Array-inputs.png

Workflows that accept arrays (sets) as inputs

- The Optimus pipeline takes in multiple lanes of a sample, but outputs one file that corresponds to the sample, not the read lanes.

- Another example, familiar to cancer researchers, is the CNV_Somatic_Panel_workflow, which generates a single Panel of Normals (PoN) from a list of normal samples. The PoN is used when doing variant calling to filter out systemic errors that occur when reads are processed.

Identifying a workflow that takes a set (array) as input 

You will know a workflow takes a set of inputs (rather than a single file) by the input file type in the workflow configuration card.

Input is a single entity: variable type = "File" 
Data-QuickStart-Part4_Input-type-file.png

Input is sets of entities: variable type = "Array[File]"
Data-QuickStart-Part4-Input-type-array.png

4.1. Run a workflow with a set (array of entities) as input

Once you identify a workflow takes a set as input, how do you set up and run it? The process is slightly different than running a single input workflow on a set of single entities!

Set up the workflow configuration form

1. Open the "2-Sets-as-Input-Workflow" from the Workflows page.

2. Select the root entity type "specimen_set" from the dropdown.

3. Confirm the input attribute formatting in the Inputs.  
Remember the links to the data files are in the "specimen" table, not in the "specimen_set" table! You need to tell the WDL to a) first go to the specimens column of the specimen_set table to get the IDs of the specimens in the array, b) then to go to the r1_fastq column of each specimen in the set to get the data.
You'll specify this with the format:this.specimens.r1_fastq(already filled in)

4. Click the blue "Select Data" button.

icon-warning2.png


Notice the extra "s" in the entity attribute!!

  Even though the data are in the "specimen" table, the formatting to read/write to the
table is
this.specimens.r1_fastq

Remember to add an "s" at the end of the entity name. That's right, it's specimens, not
specimen! That extra "s" is just built into the platform, but has been known to trip people
up!

Select the set of input data files

1. Click the "Choose existing sets" radio button  

2. Select one of your sets from the available options 

Data-QuickStart-Part4_Choose-existing-set_Screen_Shot.png

 Last, confirm launch to run the workflow (click for screenshots and hints)

icon-warning2.png


The workflow runs a single analysis, even though it takes a set as input!

 

Data-QuickStart-Part4_Confirm-launch_Screen_Shot.png

This is different than in previous parts 1, 2, and 3, where the workflows were designed to analyze one
entity at a time. In that case, running on a set of inputs launched as many analyses as
there were entities in the set. 

4.2. Examine the output

Once your workflow completes successfully, you'll want to take a peek at the output!

Question: Where (what table) is the output data file?

The output data file is stored in the workspace bucket. However, you set up the workflow to write to the data table. But which table?

The root entity for this workflow was specimen_set, and thus the  specimen_set table is where you will find the data:Data-QuickStart-Part4_Outputs-in-specimen-set-table_Screen_Shot.png


Congratulations! You've finished the Data Quickstart!!

Hopefully you have a better understanding of how you can use integrated tables to organize and manage your data in Terra! 

Was this article helpful?

3 out of 3 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.