Specify Array[File] output for a task in a WDL
I had a question about how to specify a `task` output that is an array of files `Array[File]`. Chris Llanwarne answered my question, and so I will post it here.
Short version:
---------------------------------------------------------------------------------------------------------
Question:
Can I specify the output files in the `output` section of my `task` using the following syntax?
output {
Array[File] output_array = read_lines("file_containing_one_filename_per_line.txt")
}
Answer:
(Paraphrasing Chris Llanwarne.) There is nothing wrong with this from the perspective of WDL. However, Cromwell does not currently support this on the Google backend [GCP]. (That is to say, this will not work in Terra on GCP.) For GCP, you need to use `glob` in your WDL instead. Something like this, assuming all my outputs start with "out_" and end with ".h5ad":
output {
Array[File] output_array = glob("out_*.h5ad")
}
-------------------------------------------------------------------------------------------------------------
Original question:
I have a task like this, where I'd like to specify the output as an array of files:
task split_anndata_file {
input { ... }
String output_filename = "anndata_fofn.txt"
command {
# creates the files and a file-of-filenames called "anndata_fofn.txt"
}
output {
Array[File] anndata_array = read_lines(output_filename)
}
}
This passes WDL checkers, but doesn't work on Terra on GCE. In the Cromwell "metadata" for the task, the outputs are correctly listed as
"outputs": {
"split_anndata_file.anndata_array": [
"gs://broad-methods-cromwell-exec-bucket-instance-8/split_anndata_file/d517b49f-67e0-4219-833c-d64e2da91d7c/call-split_anndata/chunk_0.h5ad",
"gs://broad-methods-cromwell-exec-bucket-instance-8/split_anndata_file/d517b49f-67e0-4219-833c-d64e2da91d7c/call-split_anndata/chunk_1.h5ad",
"gs://broad-methods-cromwell-exec-bucket-instance-8/split_anndata_file/d517b49f-67e0-4219-833c-d64e2da91d7c/call-split_anndata/chunk_2.h5ad",
"gs://broad-methods-cromwell-exec-bucket-instance-8/split_anndata_file/d517b49f-67e0-4219-833c-d64e2da91d7c/call-split_anndata/chunk_3.h5ad",
"gs://broad-methods-cromwell-exec-bucket-instance-8/split_anndata_file/d517b49f-67e0-4219-833c-d64e2da91d7c/call-split_anndata/chunk_4.h5ad"
]
}
but those files don't actually exist. The files do not get delocalized from the VM the task ran on. Instead, what gets delocalized and copied out to the Cromwell execution directory (google bucket) is "anndata_fofn.txt". Very strange.
Chris Llanwarne's answer:
Historically this kind of thing couldn’t work in GCP because the Google cloud needed Cromwell to enumerate all of the files it was going to need to delocalize upfront (before the task even started). So ad-hoc enumeration of which files to save after the task completed was impossible.
...
I think this is technically a gap in Cromwell support in GCP, rather than a thing that WDL the language doesn’t allow, so where to throw the error is probably more at run-time than static analysis time. Definitely “continuing blindly on to the next task anyway, and then failing” feels like it could be better
Comments
1 comment
Hi Stephen,
Thanks for this wonderful write-up!
Best,
Josh
Please sign in to leave a comment.