Some analyses use input from several input files to generate a single output. In this last part of the Workflows Quickstart tutorial, you'll learn how to recognize and work with this sort of workflow.
Workflows that accept arrays (sets) as inputs The Optimus pipeline takes in multiple lanes of a sample, but outputs one file that corresponds to the sample, not the read lanes.
Another example, familiar to cancer researchers, is the CNV_Somatic_Panel_workflow, which generates a single Panel of Normals (PoN) from a list of normal samples. The PoN is used when doing variant calling to filter out systemic errors that occur when reads are processed.
Identifying a workflow that takes a set (array) as input
You will know a workflow takes a set of inputs (rather than a single file) by the input variable file Type in the workflow configuration card.
Input is a single entity: Input variable file type = File
Input is a set of entities: Input variable file type = Array[File]
1. Run a workflow with a set (array of entities) as input
Once you identify a workflow takes a set as input, how do you set up and run it? The process is slightly different than running a single input workflow on a set of single entities!
Step 1: Set up the workflow configuration form
1.1. Open the "2-Sets-as-Input-Workflow" from the Workflows page.
1.2. Select the root entity type "specimen_set" from the dropdown.
1.3. Confirm the input attribute formatting in the Inputs.
Remember the links to the data files are in the "specimen" table, not in the "specimen_set" table! You need to tell the WDL to
a) Go to the specimens column of the specimen_set table to get the IDs of the specimens in the array
b) Then to go to the r1_fastq column of each specimen in the set to get the data.
You'll specify this with the format:
this.specimens.r1_fastq(already filled in)
1.4. Click the blue "Select Data" button.
Notice the extra "s" in the entity attribute!! Even though the data is in the "specimen" table, the formatting to read/write to the
Remember to add an "s" at the end of the entity name. That's right, it's specimens, not specimen! That extra "s" is just built into the platform, but has been known to trip people up!
Step 2: Select the set of input data files
1. Click the "Choose existing sets" radio button
2. Select one of your sets from the available options
Step 3: Confirm launch to run the workflow
The workflow runs a single analysis, even though it takes a set as input!
This is different than in previous parts 1, 2, and 3, where the workflows were designed to run a separate analysis submission for each sample. In that case, running on a set of inputs launched as many analyses as there were entities in the set.
2. Examine the output
Once your workflow completes successfully, you'll want to take a peek at the output. HINT: Look in the table corresponding to the root entity.
The output data file is stored in the workspace bucket. However, you set up the workflow to write to the data table. But which table?
The root entity for this workflow was
specimen_set, and thus the specimen_set table is where you will find the data.
|Congratulations! You've finished the Data Tables Quickstart!|
Hopefully you have a better understanding of how you can use integrated tables to organize and manage your data in Terra!
Please sign in to leave a comment.