How do I pass an array(array(file)) to another task?

Post author
Ash O'Farrell

I have a task that takes in a bunch of RData files, matches them according to the chr number they are associated with, and should export them as an Array[Array[File]].

Something similar is done in this variant calling WDL, but what is being saved in that case are full gs paths to external files if I recall correctly, not files that only exist in the workspace like what I'm doing. Nevertheless I tried to follow its lead by putting these in my output section, hoping at least one would work.

  • Array[Array[String]] d_string = read_tsv("output_filenames.txt")
  • Array[Array[File]] d_files = read_tsv("output_filenames.txt")

output_filenames.txt looks something like this.

/cromwell_root/fc-e860f7d8-0013-41a0-b74a-5fd0c86a128b/e0129711-3789-42d3-b7f3-0056878941d2/assoc_agg/f6312cc4-8134-4f04-af79-b3f2e2077755/call-assoc_aggregate/shard-0/cacheCopy/glob-102d2e89c0518e38f77aa349b30c2214/position_chr1_seg1.RData /cromwell_root/fc-e860f7d8-0013-41a0-b74a-5fd0c86a128b/e0129711-3789-42d3-b7f3-0056878941d2/assoc_agg/f6312cc4-8134-4f04-af79-b3f2e2077755/call-assoc_aggregate/shard-1/cacheCopy/glob-102d2e89c0518e38f77aa349b30c2214/position_chr1_seg10.RData /cromwell_root/fc-e860f7d8-0013-41a0-b74a-5fd0c86a128b/e0129711-3789-42d3-b7f3-0056878941d2/assoc_agg/f6312cc4-8134-4f04-af79-b3f2e2077755/call-assoc_aggregate/shard-2/cacheCopy/glob-102d2e89c0518e38f77aa349b30c2214/position_chr1_seg11.RData

There’s a tab between each file and if I had posted the whole thing there’d be a newline for files associated with the next chromosome. ie, file/seg1_chr1.RData [tab] file/seg2_chr2.RData [newline] file/seg3_chr2.RData [tab] file/seg4_chr2.RData

Using d_files as the input to the next task

When I run this locally on Cromwell, this works. When I run this on Terra, I get errors like this in the next task:

Failed to evaluate input 'assoc_size' (reason 1 of 49): [Attempted 1 time(s)] - FileNotFoundException: gs://fc-e860f7d8-0013-41a0-b74a-5fd0c86a128b/e0129711-3789-42d3-b7f3-0056878941d2/assoc_agg/f6312cc4-8134-4f04-af79-b3f2e2077755/call-sbg_group_segments_1/call-assoc_aggregate/shard-0/cacheCopy/glob-102d2e89c0518e38f77aa349b30c2214/position_chr1_seg1.RData

Using d_string as the input to the next task

This time, the error is in the calculation of the disk size for this task.
Failed to evaluate input 'finalDiskSize' (reason 1 of 1): ValueEvaluator[IdentifierLookup]: No suitable input for 'assoc_size' amongst {addldisk, all_assoc_files, out_prefix, preempt, assoc_type, chr, conditional_variant_file, cpu, memory}
 
If I remove the disk size calculation, I get this error instead:
Could not build the path "/cromwell_root/fc-e860f7d8-0013-41a0-b74a-5fd0c86a128b/4adf53ae-56ec-488a-9419-4c3e3f8b570a/assoc_agg/971a76c7-959b-464f-aa37-5efa530c730b/call-assoc_aggregate/shard-0/cacheCopy/glob-102d2e89c0518e38f77aa349b30c2214/allele_chr1_seg1.RData". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage, DRS. Failures: Google Cloud Storage: Path "/cromwell_root/fc-e860f7d8-0013-41a0-b74a-5fd0c86a128b/4adf53ae-56ec-488a-9419-4c3e3f8b570a/assoc_agg/971a76c7-959b-464f-aa37-5efa530c730b/call-assoc_aggregate/shard-0/cacheCopy/glob-102d2e89c0518e38f77aa349b30c2214/allele_chr1_seg1.RData" does not have a gcs scheme (IllegalArgumentException) DRS: /cromwell_root/fc-e860f7d8-0013-41a0-b74a-5fd0c86a128b/4adf53ae-56ec-488a-9419-4c3e3f8b570a/assoc_agg/971a76c7-959b-464f-aa37-5efa530c730b/call-assoc_aggregate/shard-0/cacheCopy/glob-102d2e89c0518e38f77aa349b30c2214/allele_chr1_seg1.RData does not have a drs scheme. (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
 
---------------------
All in all, I haven’t been able to get this working. What is the correct way to do this on Terra?

Comments

13 comments

  • Comment author
    Samantha (she/her)

    Hi Ash O'Farrell,

     

    Thank you for writing in. Can you share your workspace with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

    1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
    2. Click Save.

    Please provide us with

    1. A link to your workspace
    2. The relevant submission ID(s)
    3. The relevant workflow ID(s)

    If your WDL isn't public, please also share it with svelasqu@broadinstitute.org.

    We’ll be happy to take a closer look as soon as we can!

     

    Kind regards,

    Samantha​ 

    0
  • Comment author
    Ash O'Farrell
    • Edited

    Workspace link: https://terra.biodatacatalyst.nhlbi.nih.gov/#workspaces/anvil-stage-demo/assoc%20PUBLIC/, it's only got public data

    Submission IDs (each have only one workflow ID):

    WDL link: I don't know exactly what commits correlate to exactly what submissions, but the overall logic of the WDL is relatively unchanged other than what I made note of above. They are usually imported via Dockstore but in at least one instance I got hampered by GitHub's cache and used the broad methods repo instead. In any case, see the most recent commits on this branch: https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/assoc-agg-debugging/assoc-aggregate/assoc-aggregate.wdl

     

    0
  • Comment author
    Jason Cerrato

    Hey Aisling,

    Thanks for providing us with that information. A member of our Batch team took a closer look and it seems that Cromwell doesn’t currently support delocalizing files in nested arrays. If you're interested, we're happy to write this up as a feature request for the Cromwell team. The reason you didn't get this issue on local Cromwell is that local Cromwell wouldn’t need to delocalize files; everything is on the local filesystem. Delocalization code is specific to the backend (in this case PAPI aka Google Cloud Life Sciences). 

    In the meantime, they have suggested a workaround of encoding the matrix coordinates into the filenames, if possible, and then reconstructing the matrix from the filenames in the next task.

    We have an idea to test delocalizing directories which might then allow an array of files (directories), which could mimic a nested array of files. If you're interested, we can try to test this next week. 

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hey Aisling,

    Just wanted to follow up here to see if there was anything we can do to further assist here. If you have any questions, please let us know!

    Kind regards,

    Jason

    0
  • Comment author
    Ash O'Farrell

    Hey Jason,

    Sorry about my slow response, I have been OOO. I'm definitely interested in the delocalization of directories idea. Not sure if it would work for my use case, but it's worth a shot. I'm also interested in this being a feature request/bug report for Cromwell; as far as I'm aware the current WDL 1.0 spec does not mention that delocalization isn't supported for nested arrays so I think there's a few users who like myself weren't aware that this wasn't supported.

    0
  • Comment author
    Jason Cerrato

    Hi Aisling,

    No problem at all! I will discuss this with the engineer to see if this test is something they can set up in the near future. I'll also verify that this information doesn't exist in the WDL docs and recommend the information be added.

    I'll follow up with you as soon as I get that information.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Aisling,

    Thinking through the WDL side of the situation, it may be reasonable that the WDL spec does not have information about delocalizing nested arrays, as delocalization is a concept specific to the backend PAPI/Google Life Sciences. Since it isn't a relevant concept in the general use of WDL or local Cromwell, I can see why the engineers did not mention it in the spec. I noticed that there isn't reference to delocalization at all in the WDL spec.

    That said, I noticed there is a reference to localization in the spec. I'm going to talk to the engineers about this to get their opinion and see if there's a best place for this information to be added.

    Kind regards,

    Jason

    0
  • Comment author
    Ash O'Farrell

    I'm wondering if it might be a good idea to use this as opportunity for a wider discussion. Based on my own experience and from talking with a few BDC fellows, I have found that there is some problems when going from local WDL development to developing for Terra. Terra, by the nature of its backend, has several considerations that sometimes trip developers up. I'm appreciative of the Terra Support documents hosted here (in fact I sometimes use their formatting as inspiration for my own documentation) but as far as I'm aware there isn't currently unified documentation that collects all of the differences between Cromwell's "Terra mode" and its "local mode." Even though most of those differences are the result of the backend, not so much Cromwell itself, it ultimately still causes some confusion.

    Does such documentation exist? If not, I think that the creation of such a list of differences would be a good place to put this information about arrays of arrays not supporting delocalization.

    1
  • Comment author
    Jason Cerrato

    Hi Aisling,

    I think this is a fantastic idea. I'm not aware of any documents that outline the differences between working with Terra Cromwell vs local Cromwell, but I think its value is highly apparent. I'm going to see if anything of this nature exists now that can act as a starting point from which our User Education team can create a fully realized article.

    Are you okay with me and members of our User Education team following up with you if we have any specific questions about your experiences and thoughts around how to make the article truly effective?

    Kind regards,

    Jason

    0
  • Comment author
    Ash O'Farrell

    Certainly! Feel free to contact me via the same email I use for these forums, or via the BDC/AnVIL Slacks. Looking forward to this!

    0
  • Comment author
    Jason Cerrato

    Hi Aisling,

    If you're still interested, a member of our team will be giving that delocalization of directories idea a test try. Let us know if you would still like to see that test run happen and be made aware of the results.

    Kind regards,

    Jason

    0
  • Comment author
    Ash O'Farrell

    I'm definitely interested, thanks for keeping me in the loop as this develops!

    0
  • Comment author
    Jason Cerrato

    Hi Aisling,

    Our engineer had a change to take a closer look at your original inquiry and had this suggestion:

    This is how the user is currently outputting their files
    File debug_output_filenames = "output_filenames.txt"
    File debug_output_chrs = "output_chromosomes.txt"
    Array[Array[String]] debug_grouped_string = read_tsv("output_filenames.txt")
    Array[Array[File]] debug_grouped_files = read_tsv("output_filenames.txt")
    But this doesn’t work because Cromwell doesn’t handle delocalizing Array[Array[File]]. However, it is possible to have coercion between String and Files, meaning you can have a variable that is suppose to be a File and give a String that is a path to a file and it will accept it. So how about we have the task output the files as a simple array, just to delocalize the files from the VM.
    File debug_output_filenames = "output_filenames.txt"
    File debug_output_chrs = "output_chromosomes.txt"
    Array[Array[String]] debug_grouped_string = read_tsv("output_filenames.txt")
    Array[File] debug_files = flatten(read_lines("output_filenames.txt"))
    Then in the next task they can use debug_grouped_string where ever debug_grouped_files is expected or just create the variable right after the call using the string version.
    call sbg_group_segments_1 {
    			input:
    				assoc_files = flatten_array
    	}
    Array[Array[File]] debug_grouped_files = debug_grouped_string
    This should allow us to delocalize the files as normal, and still preserve the users grouping files.
     
    I hope this all makes sense and is of help! If you have any questions about this, please don't hesitate to let us know.
     
    Kind regards,
    Jason
    1

Please sign in to leave a comment.