Scatter: over chunks rather than for each
Hello!
I have a specific question about scattering in a WDL. What I want to do is a lot like this
scatter (file in file_array) {
call task{input: file=file}
}
except that, in my case, it is horribly inefficient to spin up a separate VM for every single file. What I want to do is spin up a given number (a workflow input Int) of VMs and pass each VM however many files are needed in order for all the files to go somewhere.
What I'd like is to do something like this
input {
Int? n_machines = 3
}
scatter (files in [[file1, file2], [file3, file4], [file5, file6]]) {
call task{input: files=files}
}
Here the first VM runs [file1, file2], the second VM runs [file3, file4], and the third VM runs [file5, file6].
My question is: how do I do this programmatically in WDL? In a nice clean way? Something like the equivalent of the following python:
file_array = [file1, file2, file3, file4, file5, file6]
n_machines = 3
import math
# files per machine
n = math.ceil(len(file_array) / n_machines)
scatter_iterator = [file_array[i:i + n] for i in range(0, len(file_array), n)]
Then `scatter_iterator` will be `[[file1, file2], [file3, file4], [file5, file6]]`.
Two questions really:
1. How can I do this in WDL?
2. Where does this compute happen? Can it happen within the `workflow` block? I would guess not.
Do I really need to make a separate task that runs on its own VM to compute the array I'm interested in? That seems like a waste...
Thanks so much!
Stephen
-
Hi Stephen,
We believe you should be able to do this in WDL, though you may need to take a look at the SPEC to find the WDL functions equivalent to the Python functions you mentioned. For example math.ceil(float) in Python would be ceil(float) in WDL.
If you can find the WDL equivalents, you may be able to run the calculation right from your workflow block. Here's an example of how you can set up a calculation in a workflow block: https://github.com/broadinstitute/warp/blob/6a6cce619db2aa597927e625483fca6f83e90663/pipelines/broad/dna_seq/germline/joint_genotyping/JointGenotyping.wdl#L86
If for whatever reason you can’t obtain a complete translation to WDL, you can indeed create a task and copy and paste your Python script into the command block, then call the task before you scatter to get Array[Array] files. This would result in a separate task, as you said. You could probably allocate a very small machine to this task if needed, though.
If you have any further questions, please don't hesitate to let us know!
Kind regards,
Jason
-
Hi Jason,
Okay, the example WDL you sent is very helpful. I wasn't sure if any computation could be done in the workflow block, but the example shows that WDL standard library commands can be executed there.
And I also realized that my particular use case actually requires a more complicated scatter anyway, so I will definitely have to use a separate task.
Thanks for the information!
Stephen
Please sign in to leave a comment.
Comments
4 comments