select_first() doesn't function as expected on optional outputs Answered

Post author
Ash O'Farrell

I've been struggling to get a workflow to work in Terra. In attempts to narrow down the issue I've ended up with a workflow that looks completely arbitrary, but I promise I came across this issue in something that's actually productive.

First Attempt

The problematic task attempts to define finalDiskSize as a function of a File? input named test. finalDiskSize is used as a runtime attribute.

Locally, this works fine even if File? test does not exist, but on Terra it fails before task execution as Cromwell declares test_size doesn't have a valid value. Attempting to use "if then else" logic in the disk size runtime attribute instead/in addition to "if then else” logic in the calculation of test_size also fails with the same error.

Second Attempt Setup

Although I don't know the specifics, I'm aware localization on a Google backend is different than on a local execution, so I tried this workaround in the workflow section:

Where run_example_wf.wf_magicword is type File, run_example_wf.wf_nonexistent is type File?, and fallback.bogus is type File. The fallback file is a blank text file whose only purpose to prevent the task from erroring out. If it is used as the second element in an select_first() array, it acts as a fallback input should the first element not exist.

In the context of the "checker" workflow, run_example_wf.wf_nonexistent (a workflow level output of run_example_wf) never exists; it is a File? output of run_example_wf that would match the pattern zzyzx.txt, which does not get written at any point of run_example_wf.

Second Attempt Results

What confuses me is the fact that my filecheck task (see first screenshot) errors out again (only on Terra) with this set up, even with the fallback file in the select_first() array. I assume that my first attempt which did not use select_first() may not work as a side effect of how localization works, but my understanding is that select_first() is not functioning correctly in my second attempt. It appears that Cromwell is attempting to localize a file that does not exist even though I am using select_first() to prevent this from happening.

Error attempting to localize file with command: 'mkdir -p '/cromwell_root/fc-4e8db524-9266-47eb-ad44-9b54fee6decd/3f07149a-33ba-4d82-9c36-89c3ec7aa699/checker/4e73aaf9-a1a7-4807-9e2e-c835c9da0953/call-run_example_wf/run_example_wf/385dde28-cf86-44cb-a91b-473fb1242e87/call-one_is_missing/' && rm -f /root/.config/gcloud/gce && gsutil -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=1' cp 'gs://fc-4e8db524-9266-47eb-ad44-9b54fee6decd/3f07149a-33ba-4d82-9c36-89c3ec7aa699/checker/4e73aaf9-a1a7-4807-9e2e-c835c9da0953/call-run_example_wf/run_example_wf/385dde28-cf86-44cb-a91b-473fb1242e87/call-one_is_missing/zzyzx.txt' '/cromwell_root/fc-4e8db524-9266-47eb-ad44-9b54fee6decd/3f07149a-33ba-4d82-9c36-89c3ec7aa699/checker/4e73aaf9-a1a7-4807-9e2e-c835c9da0953/call-run_example_wf/run_example_wf/385dde28-cf86-44cb-a91b-473fb1242e87/call-one_is_missing/'' CommandException: No URLs matched: gs://fc-4e8db524-9266-47eb-ad44-9b54fee6decd/3f07149a-33ba-4d82-9c36-89c3ec7aa699/checker/4e73aaf9-a1a7-4807-9e2e-c835c9da0953/call-run_example_wf/run_example_wf/385dde28-cf86-44cb-a91b-473fb1242e87/call-one_is_missing/zzyzx.txt grep: /cromwell_root/stderr: No such file or directory

Third Attempt

So with that in mind I stopped accounting for the test file entirely in my disk size calculation. Instead, I just assume the test file is about the same size as the truth file and double the size of the truth file. I am still taking in the test file as an optional input, as if it does exist, I want to compare the truth file against the test file. Unfortunately, running this on Terra still attempts to localize the file that does not exist instead of falling back to the fallback file, and the task still fails.

select_first() can work as expected... sometimes?

Here's where things get interesting... I have a second file, wf_never, that does not get created because the task creating it does not ever get called. This workflow generates the workflow-level output wf_always, a File named foo.txt, and wf_never, a File? named bizz.txt... except, of course, wf_never will never be created as its task is never called.

Using select first in that instance passes, even on Terra.

So it seems that select_first() acts as expected (or at least my understanding of it, please correct me if I'm wrong) when the task that would create a File? is never called, but not if the task that would create a File? is called.

Questions

Is there a way to estimate disk size at runtime when dealing with a File?, and is there a reason why two different Files? show different behavior with select_first()?

Source code

* Filechecker task: https://github.com/dockstore/checker-WDL-templates/blob/debug-terra/checker_tasks/filecheck_task.wdl

* Workflow: https://github.com/dockstore/checker-WDL-templates/blob/debug-terra/check_wf_outputs/outputs_some_optional/parent_opt.wdl

* Checker workflow: https://github.com/dockstore/checker-WDL-templates/blob/debug-terra/check_wf_outputs/outputs_some_optional/template_opt.wdl

Thanks in advance for your time and sorry for how long this post is. I'm hoping the explanations I gave will save some time in testing.

Comments

12 comments

  • Comment author
    Samantha (she/her)

    Hi Ash O'Farrell,

     

    Thank you for writing in about this issue. Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

    1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
    2. Click Save.

     

    Please provide us with

    1. A link to your workspace
    2. The relevant submission ID(s)
    3. The relevant workflow ID(s)

     

    We’ll be happy to take a closer look as soon as we can!

     

    Kind regards,

    Samantha​ 

    0
  • Comment author
    Ash O'Farrell

    I'm so used to working with controlled data I didn't even think of that. It's all public data this time, support has been added.

    Workspace link: https://terra.biodatacatalyst.nhlbi.nih.gov/#workspaces/anvil-stage-demo/checker/job_history/5bacba50-3672-4d09-af0c-992316b3f1f0
    Submission ID: 5bacba50-3672-4d09-af0c-992316b3f1f0

    All workflows within that submission are relevant either as "this didn't work" or "this similar thing did work."

    0
  • Comment author
    Samantha (she/her)

    Hi Ash O'Farrell,

    Sorry for the delayed response. I brought this to our engineers to look into and while debugging, they determined that there was unexpected behavior where missing outputs are not returning NULL on some Cromwell backends (PAPIV2Beta and PAPIV2Alpha1), which probably explains the weird behavior you're seeing with the select_first() function in Terra vs locally. Our engineers have created a bug ticket to look into this. I'll be sure to let you know of any updates.

    Best,

    Samantha

    0
  • Comment author
    Ash O'Farrell

    I've recently started seeing some odd behavior with select_all() as well, where null outputs appear to not be getting dropped. I'm wondering if that might be related to this bug. Has there been an update on this?

    0
  • Comment author
    Jason Cerrato

    Hey Ash,

    Thanks for writing in about this. The original bug does seem to have been resolved - apologies for not successfully reporting back on that earlier as we said we would. We will work to improve our tracking process to avoid these situations falling through the cracks in the future.

    I will do some digging for select_all() to see if I can identify what's going on. Is this related to the other thread we already have going? I just want to make sure I already have the relevant information about select_all().

    Kind regards,

    Jason

    0
  • Comment author
    Ash O'Farrell

    Yes, this is relevant to the other thread too. I again had select_all() not working as expected, this time on file outputs. I wrote a thread about it on the OpenWDL Slack, which I'll summarize here:

    I have a scattered task that runs fasterq-dump on some SRA accessions, returning Array[File]? fastqs (outside the scatter this becomes Array[Array[File]?]). Sometimes fasterq-dump returns only one fq file instead of paired reads. I can't make use of that, so I delete it and return nothing, hence why I'm using Array[File]? instead of Array[File] as my output. I don't want empty arrays in my subsequent tasks, so I want to use select_all() to drop the empty arrays. Unfortunately, select_all() does not drop the outputs as expected.

    To be specific, I think the issue might be with the globbing of Array[File]? itself, as that returns [] when it should be returning None. As a specific example: When I run on SRR1002694 and SRR11947402, globbing output like this: Array[File]? fastqs = glob("*.fastq")

    This is the output of Array[Array[File]?] all_fastqs = select_all(pull.fastqs):
    "SRA_YOINK.all_fastqs": [["/private/var/folders/vp/327wktbj3wqb65q3v3n8qpxc0000gn/T/1666916T328389-0/dockstore-is-cool/SRA_YOINK/0c11bd0d-fc12-41b2-8371-db932b2ae91d/call-pull/shard-0/execution/glob-db248e3bce81b54f5ef521878fe9e9de/SRR1002694_1.fastq", "/private/var/folders/vp/327wktbj3wqb65q3v3n8qpxc0000gn/T/1666916328389-0/dockstore-is-cool/SRA_YOINK/0c11bd0d-fc12-41b2-8371-db932b2ae91d/call-pull/shard-0/execution/glob-db248e3bce81b54f5ef521878fe9e9de/SRR1002694_2.fastq"], []]

    That second empty array isn't getting dropped by select_all(). Daniel Park pointed out that glob will always return a defined array, even if that array is empty. His solution to use if(length(pull.fastqs)>1) did work, so this isn't a blocker, but there's two weird things going on here.

    * If a glob always returns a defined array, even if no files match the glob, that means it isn't really possible to return File?, Array[File]?, or Array[File?] unless you know the filename before runtime and/or can derive it using WDL built-ins under the inputs section. Is that an intentional limitation of WDL, or this a bug?

    * That mysterious "zombie" workflow I've the other ticket open on was running such that tarballs were being set as an input to subsequent tasks, using just select_all(), rather than a nested if(length(pull.fastqs)>1) then select_all(). This could be a red herring, but there is a possibility the zombie state might have been caused by select_all() not dropping empty types... maybe?

    0
  • Comment author
    Jason Cerrato

    Hey Ash,

    Thanks for the follow-up. It will be helpful for us and the engineers if we have the WDL in hand to examine for full context. Can you point us to the relevant, specific lines of the WDL you're working in?

    • The line(s) where globbing of Array[File]? happens
    • The line(s) where you attempt to use select_all() to drop empty arrays
    • The line(s) where you are trying to return File?, Array[File]?, and/or Array[File?]

    We'll be happy to dig into this and get back to you as soon as we can. Note that it may be in January 2023 as our office will be closed starting on December 23.

    Please also let me know if investigating this situation or what happened with the mysterious "zombie" workflow is more pressing for you. I will be sure to prioritize accordingly.

    Many thanks,

    Jason

    0
  • Comment author
    Ash O'Farrell

    This isn't urgent, feel free to address this when you can.

    So I put together a flowchart explaining how the workflow works: https://github.com/aofarrel/myco/blob/main/doc/overview.md

    Note that the report task (#3 in the explanation) does not exist in the "segments" branch where the zombie-ness happened. Here's the segments branch that ran on Terra: https://github.com/aofarrel/myco/blob/segment/myco.wdl

    Lines where globbing of Array[File]? happens/The lines where you are trying to return File?, Array[File]?, and/or Array[File?]

    This happens when pulling fastqs from SRA. Sometimes, the output can't be used, so we delete it and try to return nothing at line 205 here: https://github.com/aofarrel/SRANWRP/blob/v1.1.1/tasks/pull_fastqs.wdl#L205

    Note that the Array[File]? at line 205 are pairs of fastqs (if any valid ones exist). The File? at line 206 is the same idea -- if it's invalid, it doesn't exist -- the only difference being it's a tarball of the paired fastqs. 

     

    Lines where attempting to use select_all() to drop empty arrays:

    This happens twice -- first, at https://github.com/aofarrel/myco/blob/segment/myco.wdl#L34 at line 34, nested in a if(length(pull.fastqs)>1) block. This select_all() seems to work for my purposes (ie, dropping empty arrays) because of the length argument it is nested in. In other words, this is operating on the Array[File]? output of line 205 of pull_fastqs.wdl 

    The second time is on https://github.com/aofarrel/myco/blob/segment/myco.wdl#L74 at line 74. This one operates on the File? tarballs at line 206 of pull_fastqs.wdl. This select_all() is not nested with anything about length, and I think that might be the problem. The "zombie" workflow was run such that less_scattering was true, and therefore those line-74-select_all() tarballs are going into subsequent tasks segfault (line 76), decontaminate_many_samples (line 83), etc -- not the line-34-nested-select_all() paired fastqs.

    Like I said, I don't have hard proof that this is definitely what caused that one workflow to go zombie mode. (Sorry, I don't know what else to call it. "Stalled" doesn't seem right since it said it was failing.) All I know is that, as is, it seems there it isn't really possible to return File?, Array[File]?, or Array[File?] unless you know the filename before runtime and/or can derive it using WDL built-ins under the inputs section -- and by that I mean, yeah, you can technically return File?/Array[File]?/Array[File?], but they are functionally equivalent to File/Array[File] since they are never undefined when you use globbing, and that means select_all() doesn't work on them. That seems to be an oversight, even if it's not the cause of that one workflow being weird.

    Thanks for your time -- I really do appreciate it.

    0
  • Comment author
    Jason Cerrato

    Hey Ash,

    Just wanted to let you know we haven't forgotten about this. Our engineers have been highly wrapped up in the work related to Terra on Azure, but we will be engaging with them on this as soon as they have some availability.

    Thank you for your patience!

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato
    • Edited

    Hey Ash,

    I've heard back from an engineer who had some time to take a closer look at this. Their impression is that your desire is to see Array[File]? be None if there are no results from the glob, but they aren't sure what the benefit is over [] since you would still need to handle it in your next step as "use if not empty”, just with a different definition of empty. Does this sound accurate?

    One thing they're wondering is where the benefit for select_all comes from. Since glob doesn’t know how many files you expect and just lists what it finds, listing an array of [None, None] and then filtering that down to [] with select_all doesn’t seem any better than using glob and getting the [] upfront. Am I right that you are expecting that Array[File]? would return None rather than [None, None]?

    To take it way back to your original post's Question

    Is there a way to estimate disk size at runtime when dealing with a File?

    Getting a file size from a potentially empty input file is supposedly covered by the standard library, which claims that `size` works for optional files and automatically gives a 0 if they’re empty: https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#float-sizefile-string

    I'm not sure if that last point is still relevant to you, but we thought it might be helpful to share.

    Kind regards,

    Jason

    0
  • Comment author
    Ash O'Farrell
    • Edited

    I essentially want to coerce Array[Array[File]?] created via a scattered task's glob() into Array[Array[File]], because trying to scatter on arrays with optional types is a headache and/or not possible (https://support.terra.bio/hc/en-us/requests/288912).

    In theory, if the task output of a scattered task is Array[File]?, you should be able to coerce that into an Array[Array[File]] by using select_all() outside of the scatter to drop the empty arrays. That appears to be the intended use case of select_all(). But in practice that doesn't seem to ever happen, whether or not you're using globs. (Earlier I incorrectly implied optionals created via globbing being [] instead of None is the main issue, but after more through testing I realize it's more complicated than that.) I wrote up an example WDL here showcasing this. In summary, it seems you actually need to either know exactly how many File?s to expect (if any) and their filenames, or you need to use Array[File?], length(), select_all(), and then select_all() a second time.

    To explain a little further: If you want to do a basic scatter on an Array[Array[File]] constructed from a previous scattered task without needlessly creating extra VMs that will just do nothing (or error out) because you're passing in empty arrays, it appears you only have two options:

    1. You declare one File? per expected output in your first scattered task, then create one Array[File] per File? output via select_all() outside of the scatter, and then combine those Array[File]s into Array[Array[File]]. This doesn't work for my purposes, as I might have 0 files, 2 files, 4 files, 6 files, or more coming out of my first scattered task, and I can't predict their filenames before runtime (ie I have to use glob()).
    2. You declare one Array[File?] output in the task via globbing. Within the scatter, you sort-of coerce that into a Array[File] by checking if the length of Array[File?] > 1, and if yes, declare an Array[File] via select_all(task's output Array[File?]). This is only a sort-of coercion because outside the scatter, you have to use select_all() a second time in order to get an Array[Array[File]].

    Neither of these are intuitive, the more useful #2 especially so, which is why I'm thinking that select_all() is bugged... or I'm missing a hilariously obvious way to use it. I certainly have messed up WDLs before!

    0
  • Comment author
    Jason Cerrato

    Hey Ash,

    I've heard back from one of the engineers about this. They would like to first confirm that you want to do something like this:

      task t1:
        - Generates (or sometimes doesn’t) some Files, based on some input value.
      task t2:
       - Processes an Array[File] as long as its length is 1 or more (if the length is 0, or it gets given an optional File? input, it fails)
       - No return value in the example... Let’s say they want to return String t2_out = stdout()?
      Workflow w1:
       - Receive input data as an array. Scatter over it and generate a few arrays of files using task t1
       - Process them through task t2 (if appropriate)
       - Return: not obvious from the example, maybe an Array[String?] representing the gathered outputs of many shards of t2?
    If so:
      My suggestion would be:
       - Scatter over the input array
         - Call t1. Use globs in t1 to generate an Array[File] rather than Array[File?]. (eg Array[File] glob_files = glob("*.txt") ). If the task generates no output files, this will be the empty array []
         - Use an if block inside the scatter to call t2, only if appropriate. (eg if (length(t1.glob_files > 0) { call t2 } ) ). The output to t2 is normally a String, but from outside the if block it gets “gathered” into a String? to represent the possibility that the task doesn’t run.
       - Outside both the if and the scatter: gather results from t2. ( eg as a workflow output: output { Array[String?] t2_outs = t2.t2_out }   )

    They also mentioned

    select_all() always turns Array[X?] into an Array[X]. But I don’t think it’s needed at all here if they use glob to generate Array[X] in the first place.

    They based their comments on the recent code example you provided. If this doesn't answer your question, it would be helpful if you can frame your question in terms of “I have this task and that task, and I want a workflow that does xyz." They also mentioned they would also be happy to iterate with you based on the example you provided.

    I believe they are largely tackling this from 1. trying to understand your end goal, and then 2. providing a recommendation. If you actually just want to talk about select_all() or optional outputs specifically, please let us know.

    Kind regards,

    Jason

    0

Please sign in to leave a comment.