Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

Terra fails to delocalize files listed through read_lines()

Comments

4 comments

  • Avatar
    Jason Cerrato

    Hi Giulio,

    I'll also take a closer look at this issue. Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace (see the icon with the three dots at the top-right)?

    1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press Enter on your keyboard
    2. Click Save

    Let me know the name of the workspace, or a share a link. Please also add me to any authorization domains for the workspace if possible. If you are unable to do so, please let me know.

    Please also share the submission & workflow IDs where this issue was observed.

    Finally, between this issue and the other you posted about, which would you consider to be higher priority for quicker resolution?

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Giulio Genovese

    Hi Jason,

    Thank you for your prompt reply. I have found a workaround that effectively is a solution for me, but I would say that this is still an issue for the community.

    It turns out this is a known problem that goes back years. What is going on is that Cromwell defines files (and directories) to delocalize before the task is run. As such, this is an issue if the output is a list of files generated during the task runtime.

    One workaround is to use glob() which effectively creates a directory that will be delocalized at the end of the task. But this did not work for me as I need to control the order of the output files.

    The other, better workaround, was to imitate what glob() does. This means that instead of using:

    output {
    Array[File] gtc_files = read_lines("gtc_file_list.txt")
    }

    in my output section of the task, I have switched to:

    output {
    Directory gtcs = "gtcs"
    Array[File] gtc_files = read_lines("gtc_file_list.txt")
    }

    and I made sure that the array of files that I needed to delocalize was included in the gtcs/ directory by the task. This caused the whole gtcs/ directory to be delocalized so that even if the second line does not cause a delocalization of its own, it still manages to piggyback on the directory delocalization to get the files out of the docker.

    Now Directory is a WDL keyword introduced in version development, and my WDL happens to be designed for that version so that's okay for me, but I don't know whether this solution is helpful for WDLs written for version 1.0 of the specification.

    I am still surprised that I had got no warnings from Cromwell though. I have written the WDL on my laptop and tested it locally thoroughly. Then when I tested in on Terra I found several issues:

    (i) serialization of tables with write_lines()/write_map()/write_tsv()/write_json() that only works within tasks

    (ii) optional files not being allowed in tasks outputs (see here and here, and why is that? Cromwell seems to allow optional delocalization of files such as /cromwell_root/memory_retry_rc)

    (iii) issue with delocalization of files with names that cannot be determined ahead of task runtime (this page)

    I am glad I was able to find workarounds for each one of these issues, but it did make for a lot of frustration over the last two days. I would really hope other users would not have to go through the same experience.

    I think the most frustrating thing of all is that on Cromwell's main page it says: "Trivially transition between one off use cases to massive scale production environments". I think there could be a link explaining all the issues that don't make this transition so trivial.

    Anyway, again, I am all set and I did get my workflow to fully run on Terra(!) and that was quite fulfilling. I hope this feedback is helpful.

    Giulio

    PS Unfortunately I have deleted the bucket with the data I have referenced in my post after finding the workaround.

    0
    Comment actions Permalink
  • Avatar
    Chris Llanwarne

    Hi Guilio,

     

    I'm glad you managed to get these working with the workarounds you mentioned. I can answer your questions about _why_ Cromwell ended up the way it is but in general I agree - it would be really nice if we could be closer to the spec in all respects AND it would be very nice (and not too much overhead I should think) to list any known gaps between openWDL and Terra somewhere easy to find for people in the future who are trying to make the transition from local to cloud and having problems.

     

    (i). Cromwell allows workflows to have multiple backends - and filesystems - configured at the same time. This means that within a task you have a single assigned backend and a single default filesystem, but at the workflow level there's no filesystem and thus it's impossible to know where it was intended to store the file. That's the theory and why Cromwell struggles to calculate it. BUT... (1) Terra only has a single backend, and (2) we do have a concept of default backend even if you do have multiple available, so we probably could do more to use a sensible default filesystem in cases like yours where it absolutely makes sense to be able to stage maps and other files into GCS at the workflow level before sending them to tasks.

    (ii) and (iii) are side-effects of Cromwell's first cloud backend being JES (now named PAPIv1 - the Pipelines API on Google cloud). In PAPIv1 the request to the API needs to specify the spec of the VM we want to run on, which files to localize at the start, which script to run, and which files to delocalize at the end. PAPIv1 then deletes the VM and any files we didn't record ahead of time to rescue before letting us know that it did what we asked it to do. That model makes writing simple jobs in PAPIv1 easy, but doesn't fit well with file outputs being defined as functions of other file outputs (we can't predict the result ahead of time), nor file outputs being optional. HOWEVER: now that we're operating in PAPIv2 we do have the opportunity to refactor some of that localization/delocalization logic to happen on the VM itself after the job completes, rather than having to predict it ahead of time in the Cromwell engine. We have tickets in our backlog to do just that.

     

    Just a note on timelines to manage expectations: there are potential answers, and tickets, for most of your "I wish this could be better" comments - and I completely agree with you that it'd be awesome if we could get those done to make people's lives easier. Right now the team is focused more on scaling - and addressing a couple of scaling issues we have with the PAPI backend to try to unblock greater throughput of jobs per user in the Google cloud - but I'll raise these user experience / quality-of-life issues with the team and our product owner to see whether they can be given any priority in the next couple of sprints. Ultimately choosing which features to work on first is always a balancing act between developer time allocation and anticipated overall impact!

     

    Thanks -

     

    Chris

    0
    Comment actions Permalink
  • Avatar
    Giulio Genovese

    As it is relevant to this post, I will mention that another possible solution to requiring that the list of output files be computed ahead of the task starting is to have the suffix() function implemented. This is already part of the development specification of WDL but it is not supported by Cromwell at the moment. The Directory solution is more general, but I think most software generates output with filenames that are deterministic functions of the input filenames. I think this alternative model to deal with arrays of output files would also require basename() to accept both String and Array[String] inputs though, similarly to the way size() does, so that maybe just having the Directory keyword would be a more minimalist way to design the next version of WDL.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk