scatter over optional array
I have a case where I want to iterate over two arrays, one of which shall be optionally present or not. I created the test WDL below. In the case of create_inner_array = false it throws the error
Failed to evaluate input 'CreateArray.echo' (reason 1 of 1): Failed to lookup input value for required input echo
But in the case of create_inner_array = true it throws the error
assertion failed: base member type WomMaybeEmptyArrayType(WomOptionalType(WomAnyType)) and womtype WomMaybeEmptyArrayType(WomOptionalType(WomStringType)) are not compatible
I've also given GROUP_FireCloud-Support@firecloud.org access to the workspace https://app.terra.bio/#workspaces/carterlabtest/test . I'm referring to the last two job submissions.
(EDIT: submissions 1c7b62af-aa35-46e7-abac-e69ad18a4386 and 44dafd7e-a356-4ae5-93cb-99fa99d20c7f )
Is there a way to set up this double (scattered) loop where one of the arrays is of Array[String]? type?
version development
workflow runIterationWithNone {
input {
Array[String] outer_array = ["1", "2"]
Boolean create_inner_array = false
}
if (create_inner_array) {
scatter (count in ["one", "two"]) {
call Echo as CreateArray {
input:
non_optional_string=count
}
}
}
scatter (non_optional_string in outer_array) {
scatter (optional_string in select_first([CreateArray.echo, [None]])) {
call Echo as GetEcho {
input:
non_optional_string=non_optional_string,
optional_string=optional_string
}
}
}
output {
Array[String] echo = flatten(GetEcho.echo)
}
}
task Echo {
input {
String non_optional_string
String? optional_string
# runtime
String gatk_docker = "broadinstitute/gatk:4.0.0.0"
Int? boot_disk_sizeGB = 12
Int disk_spaceGB = 1
Int memoryGB = 1
Int command_memGB = 1
Int? preemptible = 0
Int? max_retries = 0
Int cpu = 1
}
String string = non_optional_string + select_first([optional_string, ""])
command <<<
echo ~{string}
>>>
output {
String echo = read_string(stdout())
}
runtime {
docker: gatk_docker
bootDiskSizeGb: boot_disk_sizeGB
memory: memoryGB + " GB"
disks: "local-disk " + disk_spaceGB + " HDD"
preemptible: preemptible
maxRetries: max_retries
cpu: cpu
}
}
Comments
14 comments
Hey Philipp,
Thanks for writing in. We'll take a closer look at this and get back to you as soon as we can.
Kind regards,
Jason
Hi Philipp,
Apologies for the delay here! I was out sick on Friday but I will look into getting you an answer here as quickly as I can.
Kind regards,
Jason
Hi Philipp,
It looks like in the case of setting create_inner_array = false where it throws the error
this is happening because CreateArray.echo is being evaluated for your scatter, but because it isn't defined in cases where create_inner_array = false, you are receiving the error message.
I'm going to talk with my colleagues to see what the best recommendation would be for re-writing this so that it can work regardless of whether create_inner_array is true or false.
You mentioned you were ultimately interested in being able to set up a double (scattered) loop where one of the arrays is of type Array[String]?. Can you give us a little more detail, with specifics if possible, about what you're looking to achieve with this set up?
Kind regards,
Jason
Thanks, Jason.
The ultimate goal is to set up a proper multisample mutect2 pipeline. GATK currently does "multisample" by just running M2 in single tumor-normal pair mode for all pairs ... .
Input shall be an array of tumor bam files and an optional array of normal bam files. For the step of calculating the contaminations, I first get the pileup summaries for the tumor samples (which corresponds to the outer_array above) and, if present, for the normal samples (which corresponds to the inner array CreateArray.echo as the pileup summaries are only created if the normal bams are actually present). As the corresponding code excerpt this then reads
The optional normal_pileups argument is directly handed through to the GATK CalculateContamination as
But of course this doesn't work as I'm running into the problems described in the test case above.
Hope, this helps!
Best,
Philipp
Hi Philipp,
Thanks for that - let me check with some colleagues more familiar with the Mutect2 pipeline to see if they have any recommendations here.
Kind regards,
Jason
Hi Philipp,
I've heard from the GATK team that the developer for mutect2 is largely preoccupied at this time, but if you were interested in discussing the methodology for the mutect2 pipeline you could post to the GATK forum and they can try to get back to you as soon as they're able.
Our team WDL expert is also very wrapped up at this time, but they offered the suggestion to check out the select_all() function to see if that would work for your purposes. If you're interested, I can file the request for them to take a closer look at this as soon as they're able and I can let you know when I hear from them.
You can alternatively try to see if anyone in the OpenWDL Slack has any suggestions.
Let me know what you think and what works for you.
Kind regards,
Jason
Dear Jason,
thanks a lot for reaching out to your colleagues! I'd appreciate if your WDL expert would have a closer look at this whenever is convenient. I'll also try other channels and will poke around myself a bit more. Whenever I have any update on this I'll post it here. It might take a couple of days as I'm also wrapped up in things ...
Best,
Philipp
Hi Philipp,
Sounds good - I'll file this request with them and get back to you as soon as I hear back!
Kind regards,
Jason
All right, a workaround is to replace None with an empty string "" and then deal with the empty string in the task:
If you want to check in the task if it's defined or not, then something like
would do.
I realized that if we don't work with Strings but Files, we have to work a bit harder:
I uploaded an empty file called "no_file" into the bucket which I now have to give as an additional input. This file needs to exist because the Echo task is localizing the file. If the command in Echo knows how to use non-localized files, then we can just define
somewhere in the workflow and write in the Echo task
The file doesn't need to exist, but the file structure needs to satisfy cromwell file patterns.
All of that though would be SO MUCH EASIER if we could just pipe None. I'd still appreciate if someone could come up with a nicer answer!
Hi Philipp,
Happy New Year! I'll update my colleague on these findings to see if he has any suggestions once he has time to take a closer look here.
Kind regards,
Jason
Hey Jason,
I pondered a bit more about it and simplified it to some satisfactory state. I flattened the nested scatter and restricted the none_file hack to the scatter context:
That way, the task Echo remains clear of any hacks and the dummy file is also removed from input arguments.
This should be as close to piping None as I can think of. #closed :)
Best,
Philipp
Hi Philipp,
This seems reasonable on a quick look! I'm glad to see you were able to find a suitable solution. If we can be of further assistance, please don't hesitate to reach out!
Kind regards,
Jason
Please sign in to leave a comment.