scatter over optional array

Post author
Philipp Hahnel

I have a case where I want to iterate over two arrays, one of which shall be optionally present or not. I created the test WDL below. In the case of create_inner_array = false it throws the error

Failed to evaluate input 'CreateArray.echo' (reason 1 of 1): Failed to lookup input value for required input echo

But in the case of create_inner_array = true it throws the error

assertion failed: base member type WomMaybeEmptyArrayType(WomOptionalType(WomAnyType)) and womtype WomMaybeEmptyArrayType(WomOptionalType(WomStringType)) are not compatible

I've also given GROUP_FireCloud-Support@firecloud.org access to the workspace https://app.terra.bio/#workspaces/carterlabtest/test . I'm referring to the last two job submissions. 

(EDIT: submissions 1c7b62af-aa35-46e7-abac-e69ad18a4386 and 44dafd7e-a356-4ae5-93cb-99fa99d20c7f )

Is there a way to set up this double (scattered) loop where one of the arrays is of Array[String]? type?

 

version development

workflow runIterationWithNone {
input {
Array[String] outer_array = ["1", "2"]
Boolean create_inner_array = false
}

if (create_inner_array) {
scatter (count in ["one", "two"]) {
call Echo as CreateArray {
input:
non_optional_string=count
}
}
}

scatter (non_optional_string in outer_array) {
scatter (optional_string in select_first([CreateArray.echo, [None]])) {
call Echo as GetEcho {
input:
non_optional_string=non_optional_string,
optional_string=optional_string
}
}
}

output {
Array[String] echo = flatten(GetEcho.echo)
}
}

task Echo {
input {
String non_optional_string
String? optional_string
# runtime
String gatk_docker = "broadinstitute/gatk:4.0.0.0"
Int? boot_disk_sizeGB = 12
Int disk_spaceGB = 1
Int memoryGB = 1
Int command_memGB = 1
Int? preemptible = 0
Int? max_retries = 0
Int cpu = 1
}

String string = non_optional_string + select_first([optional_string, ""])

command <<<
echo ~{string}
>>>

output {
String echo = read_string(stdout())
}

runtime {
docker: gatk_docker
bootDiskSizeGb: boot_disk_sizeGB
memory: memoryGB + " GB"
disks: "local-disk " + disk_spaceGB + " HDD"
preemptible: preemptible
maxRetries: max_retries
cpu: cpu
}
}

Comments

14 comments

  • Comment author
    Jason Cerrato

    Hey Philipp,

    Thanks for writing in. We'll take a closer look at this and get back to you as soon as we can.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Philipp,

    Apologies for the delay here! I was out sick on Friday but I will look into getting you an answer here as quickly as I can.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Philipp,

    It looks like in the case of setting create_inner_array = false where it throws the error

    Failed to evaluate input 'CreateArray.echo' (reason 1 of 1): Failed to lookup input value for required input echo

    this is happening because CreateArray.echo is being evaluated for your scatter, but because it isn't defined in cases where create_inner_array = false, you are receiving the error message.

    I'm going to talk with my colleagues to see what the best recommendation would be for re-writing this so that it can work regardless of whether create_inner_array is true or false.

     

    You mentioned you were ultimately interested in being able to set up a double (scattered) loop where one of the arrays is of type Array[String]?. Can you give us a little more detail, with specifics if possible, about what you're looking to achieve with this set up?

    Kind regards,

    Jason

    0
  • Comment author
    Philipp Hahnel

    Thanks, Jason.

    The ultimate goal is to set up a proper multisample mutect2 pipeline. GATK currently does "multisample" by just running M2 in single tumor-normal pair mode for all pairs ... .

    Input shall be an array of tumor bam files and an optional array of normal bam files. For the step of calculating the contaminations, I first get the pileup summaries for the tumor samples (which corresponds to the outer_array above) and, if present, for the normal samples (which corresponds to the inner array CreateArray.echo as the pileup summaries are only created if the normal bams are actually present). As the corresponding code excerpt this then reads

    scatter (tumor_pileups in GetTumorPileupSummaries.pileup_summaries) {
    scatter (normal_pileups in select_first([GetNormalPileupSummaries.pileup_summaries, [None]])) {
    call CalculateContamination {
    input:
    tumor_pileups = tumor_pileups,
    normal_pileups = normal_pileups,
    runtime_params = standard_runtime
    }
    }
    }

    The optional normal_pileups argument is directly handed through to the GATK CalculateContamination as

    gatk --java-options "-Xmx~{runtime_params.command_mem}g" \
    CalculateContamination \
    --input ~{tumor_pileups} \
    --output ~{output_contamination} \
    --tumor-segmentation ~{output_segments} \
    ~{"--matched-normal " + normal_pileups}

    But of course this doesn't work as I'm running into the problems described in the test case above.

    Hope, this helps!

    Best,

    Philipp

    0
  • Comment author
    Jason Cerrato

    Hi Philipp,

    Thanks for that - let me check with some colleagues more familiar with the Mutect2 pipeline to see if they have any recommendations here.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Philipp,

    I've heard from the GATK team that the developer for mutect2 is largely preoccupied at this time, but if you were interested in discussing the methodology for the mutect2 pipeline you could post to the GATK forum and they can try to get back to you as soon as they're able. 

    Our team WDL expert is also very wrapped up at this time, but they offered the suggestion to check out the select_all() function to see if that would work for your purposes. If you're interested, I can file the request for them to take a closer look at this as soon as they're able and I can let you know when I hear from them.

    You can alternatively try to see if anyone in the OpenWDL Slack has any suggestions.

    Let me know what you think and what works for you.

    Kind regards,

    Jason

    0
  • Comment author
    Philipp Hahnel

    Dear Jason,

    thanks a lot for reaching out to your colleagues! I'd appreciate if your WDL expert would have a closer look at this whenever is convenient. I'll also try other channels and will poke around myself a bit more. Whenever I have any update on this I'll post it here. It might take a couple of days as I'm also wrapped up in things ... 

    Best,

    Philipp

    0
  • Comment author
    Jason Cerrato

    Hi Philipp,

    Sounds good - I'll file this request with them and get back to you as soon as I hear back!

    Kind regards,

    Jason

    0
  • Comment author
    Philipp Hahnel

    All right, a workaround is to replace None with an empty string "" and then deal with the empty string in the task:

    version development

    workflow runIterationWithNone {
    input {
    Boolean create_inner_array
    }

    Array[String] outer_array = ["1", "2"]

    if (create_inner_array) {
    scatter (count in ["one", "two"]) {
    call Echo as CreateArray {
    input:
    non_optional_string=count
    }
    }
    }

    scatter (non_optional_string in outer_array) {
    scatter (optional_string in select_first([CreateArray.echo, [""]])) {
    call Echo as GetEcho {
    input:
    non_optional_string=non_optional_string,
    optional_string=optional_string
    }
    }
    }

    output {
    Array[String] echo = flatten(GetEcho.echo)
    }
    }

    task Echo {
    input {
    String non_optional_string
    String? optional_string
    }

    command <<<
    echo "~{non_optional_string}~{optional_string}"
    >>>

    output {
    String echo = read_string(stdout())
    }

    runtime {
    docker: "broadinstitute/gatk:4.0.0.0"
    bootDiskSizeGb: 12
    memory: "1 GB"
    disks: "local-disk 1 HDD"
    }
    }

    If you want to check in the task if it's defined or not, then something like

    Boolean defined_optional_string = if optional_string == "" then false else true

    would do.

    0
  • Comment author
    Philipp Hahnel

    I realized that if we don't work with Strings but Files, we have to work a bit harder:

    version development

    workflow runIterationWithNone {
    input {
    Boolean create_inner_array
    File test_file
    File no_file # point to a file called "no_file"
    }

    Array[File] outer_array = [test_file, test_file]
    Array[File] inner_array = [test_file, test_file]

    if (create_inner_array) {
    scatter (file in inner_array) {
    call InnerArray {
    input:
    input_file=file
    }
    }
    }

    Array[File] inner_optional_array = select_first([InnerArray.file, [no_file]])

    scatter (non_optional_file in outer_array) {
    scatter (optional_file in inner_optional_array) {
    call Echo as GetEcho {
    input:
    non_optional_file=non_optional_file,
    optional_file=optional_file
    }
    }
    }

    output {
    Array[String] echo = flatten(GetEcho.echo)
    }
    }

    task InnerArray {
    input {
    File input_file
    }

    command <<<
    >>>

    output {
    File file = input_file
    }

    runtime {
    docker: "broadinstitute/gatk:4.0.0.0"
    bootDiskSizeGb: 12
    memory: "1 GB"
    disks: "local-disk 1 HDD"
    }
    }

    task Echo {
    input {
    File non_optional_file
    File? optional_file
    }

    Boolean defined_optional_file = (
    if (!defined(optional_file) || defined(optional_file) && basename(select_first([optional_file])) == "no_file")
    then false else true
    )

    String non_optional_file_name = basename(non_optional_file)
    String optional_file_name = if defined_optional_file then basename(select_first([optional_file])) else ""

    command <<<
    echo "\n~{non_optional_file_name}~{" " + optional_file_name}\n"
    >>>

    output {
    String echo = read_string(stdout())
    }

    runtime {
    docker: "broadinstitute/gatk:4.0.0.0"
    bootDiskSizeGb: 12
    memory: "1 GB"
    disks: "local-disk 1 HDD"
    }
    }

    I uploaded an empty file called "no_file" into the bucket which I now have to give as an additional input. This file needs to exist because the Echo task is localizing the file. If the command in Echo knows how to use non-localized files, then we can just define

    File no_file = "gs://no/file"

    somewhere in the workflow and write in the Echo task

    parameter_meta {
    optional_file: {localization_optional: true}
    }

    Boolean defined_optional_file = (
    if (!defined(optional_file) || defined(optional_file) && optional_file == "gs://no/file")
    then false else true
    )

    The file doesn't need to exist, but the file structure needs to satisfy cromwell file patterns.

    All of that though would be SO MUCH EASIER if we could just pipe None. I'd still appreciate if someone could come up with a nicer answer!

    0
  • Comment author
    Jason Cerrato

    Hi Philipp,

    Happy New Year! I'll update my colleague on these findings to see if he has any suggestions once he has time to take a closer look here.

    Kind regards,

    Jason

    0
  • Comment author
    Philipp Hahnel

    Hey Jason,

    I pondered a bit more about it and simplified it to some satisfactory state. I flattened the nested scatter and restricted the none_file hack to the scatter context:

    version development

    workflow runIterationWithNone {
    input {
    Boolean create_inner_array
    File test_file
    }

    Array[File] outer_array = [test_file, test_file]
    Array[File] inner_array = [test_file, test_file]

    if (create_inner_array) {
    scatter (file in inner_array) {
    call InnerArray {
    input:
    input_file = file
    }
    }
    }

    # Ideally, we would pipe None into the optional_file argument, but select_first
    # does not select None, so we circumvent this by introducing a dummy none_file.
    # This file just needs to have a valid cromwell file pattern.
    File none_file = "gs://none_file"
    Array[Pair[File, File]] pairs = cross(outer_array, select_first([InnerArray.file, [none_file]]))

    scatter (pair in pairs) {
    File non_optional_file = pair.left
    File optional_file = pair.right

    call Echo as GetEcho {
    input:
    non_optional_file = non_optional_file,
    optional_file = (if (optional_file == none_file) then None else optional_file)
    }
    }

    output {
    Array[String] echo = GetEcho.echo
    }
    }

    task InnerArray {
    input {
    File input_file
    }

    command <<<
    >>>

    output {
    File file = input_file
    }

    runtime {
    docker: "broadinstitute/gatk:4.0.0.0"
    bootDiskSizeGb: 12
    memory: "1 GB"
    disks: "local-disk 1 HDD"
    }
    }

    task Echo {
    input {
    File non_optional_file
    File? optional_file
    }

    command <<<
    echo "~{"\n1. " + non_optional_file + "\n"}~{"\n2. " + optional_file + "\n"}"
    >>>

    output {
    String echo = read_string(stdout())
    }

    runtime {
    docker: "broadinstitute/gatk:4.0.0.0"
    bootDiskSizeGb: 12
    memory: "1 GB"
    disks: "local-disk 1 HDD"
    }
    }

    That way, the task Echo remains clear of any hacks and the dummy file is also removed from input arguments.

    This should be as close to piping None as I can think of. #closed :)

    Best,

    Philipp

    0
  • Comment author
    Jason Cerrato

    Hi Philipp,

    This seems reasonable on a quick look! I'm glad to see you were able to find a suitable solution. If we can be of further assistance, please don't hesitate to reach out!

    Kind regards,

    Jason

    0
  • Comment author
    khawar sohail
    It is a process of developing new versions

    workflow runIterationWithNone {
    input {
    Array[String] outer_array = ["1", "2"]
    Boolean create_inner_array = false
    }

    if (create_inner_array) {
    scatter (count in ["one", "two"]) {
    call Echo as CreateArray {
    input:
    non_optional_string=count
    }
    }
    }
    By doing so, you will ensure that step Echo remains free from any hacks, as well as the dummy file being removed from input arguments.
    scatter (non_optional_string in outer_array) {
    scatter (optional_string in select_first([CreateArray.echo, [None]])) {
    call Echo as GetEcho {
    input:
    non_optional_string=non_optional_string,
    optional_string=optional_string
    }
    }
    }

    output {
    Array[String] echo = flatten(GetEcho.echo)
    }
    }

    task Echo {
    input {
    String non_optional_string
    String? optional_string
    # runtime
    String gatk_docker = "broadinstitute/gatk:4.0.0.0"
    Int? boot_disk_sizeGB = 12
    Int disk_spaceGB = 1
    Int memoryGB = 1
    Int command_memGB = 1
    Int? preemptible = 0
    Int? max_retries = 0
    Int cpu = 1
    }

    String string = non_optional_string + select_first([optional_string, ""])

    command <<<
    echo ~{string}
    >>>

    output {
    String echo = read_string(stdout())
    }

    runtime {
    docker: gatk_docker
    bootDiskSizeGb: boot_disk_sizeGB
    memory: memoryGB + " GB"
    disks: "local-disk " + disk_spaceGB + " HDD"
    preemptible: preemptible
    maxRetries: max_retries
    cpu: cpu
    }
    }In my opinion, this should be as close to simply piping None as I can think of.
    0

Please sign in to leave a comment.