Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

PreparingJob state consumes most of a task running time, how to avoid?

Comments

8 comments

  • Avatar
    Jason Cerrato

    Hi Giulio,

    Thank you for sharing those details. I'll be happy to take a closer look. Can you provide a link to the workspace? Since the email you shared it with is a group, we do not get individual notifications for being added.

    Many thanks,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Giulio Genovese

    This is the URL I use to access the workspace.

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Giulio,

    I've sent a request to be added to this authorization domain for the workspace.

    I'll see if I have any visibility into what went on with this job through other methods in the meantime.

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Giulio,

    We believe that the long PreparingJob segments are due to the fact the calls are getting the size of hundreds of files. Key lines in WDL:
    task gtc2vcf {
    input {
    Array[File]+ gtc_files
    }
    Float gtc_size = size(gtc_files, "GiB")
    Int disk_size = select_first([disk_size_override, ceil(10.0 + bpm_size + csv_size + egt_size + ref_size + 2.0 * gtc_size + sam_size)]) runtime {
    disks: "local-disk " + disk_size + " HDD"
    }
    }
    Before Cromwell can tell the pipelines API what size disk to use, it has to measure the size of all those files. We suspect this is a case of a time-vs-money trade off: patiently waiting for the sizes to be retrieved by Cromwell, and hopefully saving on disk costs when the job eventually does run. An alternative to make the job run faster would be not to measure the sizes of hundreds of inputs and aim to hit the sweet spot of enough disk so the job runs but not too much that your bill isn’t outrageous.

    I hope this is helpful.

    Kind regards,

    Jason

     
    0
    Comment actions Permalink
  • Avatar
    Giulio Genovese

    If that's the case, there is an easy solution. Those files are all the same size anyway, so I can just change that code to:

      Float gtc_size = length(gtc_files) * size(gtc_files[0], "GiB")

    If this works and it removes the PreparingJob time away from the workflow, it just made my day!

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Giulio,

    Glad to hear! Let us know if you find more success with this change.

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Giulio Genovese

    I am happy to say that the PreparingJob wait time is completely gone from my pipeline(!). I do have a lingering question though. Does it take longer to estimate the size of a file if it is in a multi-regional bucket than if it is in a bucket localized in the same region where the computation is taking place?

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Giulio,

    This shouldn't have an impact on the time to estimate size, since it should be precalculated.

    Kind regards,

    Jason

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk