Need help in Map output arrays
Hello Everyone,
We have a workflow in WDL that consumes an array of files corresponding to a set of samples, performs a joint calling task, and returns another array of files, each of which correspond to the input file, like this:
```
workflow Workflow {
input {
Array[File] gvcfs
}
call JointTask {
input:
gvcfs=gvcfs
}
output {
Array[File] result_files = JointTask.output_array
}
}
```
So that `results_files[i]` corresponds to `gvcfs[i]` for the i-th sample. My question is about how to make this work with the data model. In Terra, if `gvcfs` is provided using a sample set, e.g. `this.samples.gvcf`, can the `result_files` be assigned back to the individual sample entities? That is, can I just specify the output as `this.samples.result_file`?
Thanks in Advance
Regards: Diksha
Comments
2 comments
Hello Diksha,
Thank you for your inquiry. We'll take a look and get back to you as soon as we can!
Kind regards,
Jason
Hi Diksha,
Unfortunately, you are not able to write files down a level from the data entity you ran the workflow on. This design helps protect against ambiguity. For example, if you had 50 samples and you had two sample sets, A and B, where sample set A contained samples 1-30 and sample set B contained samples 20-50, running a workflow on sample set A and writing to sample 25's row would not indicate in any way which sample set the file originated from. This would be further complicated if you ran joint calling on sample set A, then again on sample set B. The second run could overwrite the results from the first run, if writing to the underlying sample table was allowed.
Because workflows run on sample sets create outputs that specifically pertain to the samples defined by the set, you will want to associate the outputs with that set.
I'm curious to know more about why the joint calling method produces individual output files rather than a single multisample VCF. We have a best practices workflow for joint calling that you are free to use, also found in this featured workspace GATK4-Germline-Preprocessing-VariantCalling-JointCalling. This workflow takes in an array of gvcfs, combines and performs a joint calling on the samples, does a bit of QC on the previous single output, then finally outputs a single multisample VCF. Would this work for your needs, or is there a particular reason you need to output multiple files that are associated with each sample included in the set?
Kind regards,
Jason
Please sign in to leave a comment.