Job Manager fails to provide more than one error message from the metadata

Post author
Giulio Genovese

Some of the users of my workflows are having a hard time debugging a workflow when providing incorrect inputs. I have managed to create a minimal workflow that exposes the problem:

version 1.0

workflow main {
  input {
    File file
  }
  call main {
    input:
      file = file
  }
}

task main {
  input {
    File file
    Int? disk_size_override
    Float? memory_override
  }
  Float filesize = size(file, "GiB")
  Int disk_size = ceil(10.0 + filesize)
  Float memory = 3.5 + filesize

  command <<<
  >>>

  runtime {
    docker: "debian:stable-slim"
    disks: "local-disk " + disk_size + " HDD"
    memory: memory + " GiB"
  }
}

The workflow requires a single file as an input. After loading this workflow into a workspace, let's say that the user by mistake inputs a non-existing file. After running the workflow he might (it seems a bit stochastic) receive the following error message in the Job Manager:

Failed to evaluate input 'memory' (reason 1 of 1): ValueEvaluator[IdentifierLookup]: No suitable input for 'filesize' amongst {file, disk_size_override, memory_override}

This error message is very unintelligible and the user does not know what to do. However, retrieving the metadata using the following command:

gcloud auth application-default login
curl -X GET https://api.firecloud.org/api/workflows/v1/{workflow-id}/metadata -H Authorization:\ Bearer\ $(gcloud auth application-default print-access-token) | jq

Shows that there are actually three error messages:

Failed to evaluate input 'memory' (reason 1 of 1): ValueEvaluator[IdentifierLookup]: No suitable input for 'filesize' amongst {file, disk_size_override, memory_override}
Failed to evaluate input 'disk_size' (reason 1 of 1): ValueEvaluator[IdentifierLookup]: No suitable input for 'filesize' amongst {disk_size_override, memory_override, file}
Failed to evaluate input 'filesize' (reason 1 of 1): [Attempted 1 time(s)] - FileNotFoundException: {input-file}

The third error message is the message that would have clearly explained to the user what the problem is. The Terra Job Manager though is only capable of displaying the first error message if there are multiple error messages. This seems to be quite a serious limitation compounded with the fact that there is no way from the Terra UI to download the metadata. Not sure what to suggest but, at a minimum, it seems like if there are multiple error messages they all need to be reported to the user, as reporting the first one could make debugging impossible.

Comments

8 comments

  • Comment author
    Josh Evans

    Hi Giulio,

    Thanks for writing in with this great explanation! I agree that providing all the errors in the metadata is something the Job Manager should do moving forward. To that end, I've moved this post to our Active Feature Requests forum, and have already sent this request to our development team for consideration. I'll be happy to follow up with you if this feature gets built.

    Please let me know if you have any questions.

    Best,

    Josh

    0
  • Comment author
    Josh Evans

    Hi Giulio,

    Our engineering teams looked over this request and while we're still considering to build this feature, we did find a workaround for now.  The Workflow Dashboard on the Job History page will actually show all of the error messages in the metadata.  We ran a test and found that all three errors could be found under Workflow Level Failures.  

    I hope that information is helpful to you.

    Best,

    Josh

    0
  • Comment author
    Giulio Genovese

    I did notice that the Workflow Dashboard reports all error messages, which is good. For my own personal use I also know how to download the metadata through the API. I am mostly worried about users that would not be aware of this. Is there a plan to remove the Job Manager in favor of the Workflow Dashboard and if so is there a timeline?

    This might be a separate issue and maybe it should be reported to the Cromwell forums rather than the Terra forums, but it also seems to me that some of the error messages reported by Cromwell are weird. The message:

    Failed to evaluate input 'memory' (reason 1 of 1): ValueEvaluator[IdentifierLookup]: No suitable input for 'filesize' amongst {file, disk_size_override, memory_override}

    It seems like it should never even have been reported as an error message. Alternatively, something like "filesize could not be computed" would have been more appropriate.

    0
  • Comment author
    Josh Evans

    Hi Giulio,

    Thanks for getting back to us. As far as I'm aware, there's no plan to remove the Job Manger at this time. As for your Cromwell question.  While those forums my be the best place to ask about that error message, I can see what information I can find about that, and let you know here if I find anything.

    Thanks,

    Josh

    0
  • Comment author
    Josh Evans

    Hi Giulio,

    I wanted to let you know that I've sent your request for clearer Cromwell to our development team for consideration, and I'll be happy to follow up with you if this feature gets built.

    Thanks,

    Josh

    0
  • Comment author
    Giulio Genovese

    Thank you! 🙏

    0
  • Comment author
    Yossi Farjoun

    If I may be permitted to add on to this my slight modification/addition: 

    It would be very helpful if the Job manager could show in the summary page (the one that has the column "Messages") the error message(s) of the failed jobs. That would enable the user to take action regarding all the jobs with the same error.

     

    0
  • Comment author
    Shoaib Rakhangi

    Hi Yossi,

    I've gone ahead and added your addition to the feature request ticket. Thanks for your contribution!

    Kind regards,
    Shoaib

    0

Please sign in to leave a comment.