Scatter: over chunks rather than for each

March 18, 2021 23:01
4 comments

Hello!

I have a specific question about scattering in a WDL. What I want to do is a lot like this

scatter (file in file_array) {
  call task{input: file=file}
}

except that, in my case, it is horribly inefficient to spin up a separate VM for every single file. What I want to do is spin up a given number (a workflow input Int) of VMs and pass each VM however many files are needed in order for all the files to go somewhere.

What I'd like is to do something like this

input {
  Int? n_machines = 3
}

scatter (files in [[file1, file2], [file3, file4], [file5, file6]]) {
  call task{input: files=files}
}

Here the first VM runs [file1, file2], the second VM runs [file3, file4], and the third VM runs [file5, file6].

My question is: how do I do this programmatically in WDL? In a nice clean way? Something like the equivalent of the following python:

file_array = [file1, file2, file3, file4, file5, file6]
n_machines = 3

import math

# files per machine
n = math.ceil(len(file_array) / n_machines)

scatter_iterator = [file_array[i:i + n] for i in range(0, len(file_array), n)]

Then `scatter_iterator` will be `[[file1, file2], [file3, file4], [file5, file6]]`.

Two questions really:
1. How can I do this in WDL?

2. Where does this compute happen? Can it happen within the `workflow` block? I would guess not.

Do I really need to make a separate task that runs on its own VM to compute the array I'm interested in? That seems like a waste...

Thanks so much!

Stephen

Comments

4 comments

Jason Cerrato
- March 19, 2021 15:37
Hi Stephen,

Thank you for your question. We'll take a look and get back to you as soon as we can!

Kind regards,

Jason

0
Jason Cerrato
- March 22, 2021 19:39
Hi Stephen,

We believe you should be able to do this in WDL, though you may need to take a look at the SPEC to find the WDL functions equivalent to the Python functions you mentioned. For example math.ceil(float) in Python would be ceil(float) in WDL.

If you can find the WDL equivalents, you may be able to run the calculation right from your workflow block. Here's an example of how you can set up a calculation in a workflow block: https://github.com/broadinstitute/warp/blob/6a6cce619db2aa597927e625483fca6f83e90663/pipelines/broad/dna_seq/germline/joint_genotyping/JointGenotyping.wdl#L86

If for whatever reason you can’t obtain a complete translation to WDL, you can indeed create a task and copy and paste your Python script into the command block, then call the task before you scatter to get Array[Array] files. This would result in a separate task, as you said. You could probably allocate a very small machine to this task if needed, though.

If you have any further questions, please don't hesitate to let us know!

Kind regards,

Jason

0
Stephen Fleming
- March 22, 2021 20:31
Hi Jason,

Okay, the example WDL you sent is very helpful. I wasn't sure if any computation could be done in the workflow block, but the example shows that WDL standard library commands can be executed there.

And I also realized that my particular use case actually requires a more complicated scatter anyway, so I will definitely have to use a separate task.

Thanks for the information!
Stephen

0
Jason Cerrato
- March 23, 2021 13:32
Hi Stephen,

Glad to hear it helped! If we can assist with anything else, please let us know. Best wishes!

Kind regards,

Jason

0

Please sign in to leave a comment.