Co-localising tools in the same task

Post author
Migwell

I am working on a workflow that involves very large files, so minimizing the costs and times involved in transferring these files is critical. For each of these files, we need to run 3 different tools. The intuitive way to model this workflow in WDL is to write 3 different tasks, each of which uses a container with the appropriate tool installed, which we run in parallel on each file. Here is a simplified example of this model:

version 1.0

task A {
    input {
    File file
    } 
    runtime {
   docker: "quay.io/biocontainers/tool_a"
    }
    command {
run_a
    }
}
task B {
    input {
  File file
    } 
    runtime {
  docker: "quay.io/biocontainers/tool_b"
    }
    command {
run_b
    }
}
task C {
    input {
    File file
    } 
    runtime {
  docker: "quay.io/biocontainers/tool_c"
    }
    command {
run_c
    }
}

workflow wf {
  scatter(file in files) {
      call A {input: file=file}
      call B {input: file=file}
      call C {input: file=file}
  }
}

However, it seems to be the case that this approach involves downloading each given file from the source bucket to a VM 3 times, one for each task call. This is a problem for us, since it's effectively wasted VM time and therefore costs.

Considering this, it seems more appropriate to actually mash together all the 3 tools into the same container, and therefore do it more like this:

version 1.0

task ABC {
    input {
        File file
    } 
    runtime {
     docker: "custom_combined_image"
    }
    command {
        run_a
run_b
run_c
    }
}

workflow wf {
    scatter(file in files) {
        call ABC {input: file=file}
    }
}

This is much less elegant and modular, and is no longer able to use biocontainers directly, which is a shame, but we now no longer waste 2 redundant file localisation steps, making it much more efficient.

My question is: is my understanding correct? If the files are large and localisation is expensive, is it recommended that we compile our tools into one container and one task in this way? Are we paying for the VM time involved in the localisation of files in each task? Do you have a recommendation on how to best do this on terra?

Comments

2 comments

  • Comment author
    Pamela Bretscher

    Hi Migwell,

    Thanks for writing in with this question! A member of the Terra support team will follow up with you as soon as they are able.

    Kind regards,

    Pamela

    0
  • Comment author
    Josh Evans

    Hi Migwell,

    Thanks again for writing in! You're correct, the first approach does require the file to be localized three times for each of the three docker containers used, and you will have to pay for the compute time involved in each localization. 

    However, there are some things to think about with the second approach. Because all three tools would need to run within the same VM, it may require more compute power or a larger disk size.  Also if one of the tools fails, the entire workflow would need to be re-run on the file. With the first approach, only the failed task would need to be rerun if call caching was turned on.  Because of these two items, it is theoretically possible that the first approach actually could save money in the long run.

    Our suggestion would be to try out both approaches on a smaller file if possible.  That should help you get a better understanding of the cost difference in both approaches. 

    Also, I'd like to provide a link to our documentation on how to reduce costs.  There may be some other ideas in there that you might find interesting.

    Please let me know if that information was helpful or if you have any questions.

    Best,

    Josh

    1

Please sign in to leave a comment.