Support workflows with more than 50,000 nodes Completed

Post author
Matt Bookman

Trying to run joint-discovery-gatk4.wdl over 4000+ samples, the workflow errored out with:

2019-07-09 23:21:54,288 ERROR - WorkflowExecutionActor-54cf1196-53fa-48ab-8f96-b042abc85549 [UUID(54cf1196)]: Job BackendJobDescriptorKey_CommandCallNode_JointGenotyping.CollectMetricsSharded:671:1 failed to be created! Error: Root workflow tried creating 50043 jobs, which is more than 50000, the max cumulative jobs allowed per root workflow
 
I believe that this occurs because the intervals file contains 10,187 intervals and the workflow scatters 5 times over those 10,187 intervals.
 
I have worked around this problem by commenting out the collection and gathering of metrics (CollectMetricsSharded, GatherMetrics, and the *_metrics_files workflow outputs) such that the scatter over ApplyRecalibration can run. I can then re-run the workflow with call caching enabled, commenting out the ApplyRecalibration, in order to gather the metrics.
 
But it would be great if this maximum could be increased.

Comments

8 comments

  • Comment author
    Matt Bookman

    Two additional notes:

    1- My statement above was incorrect:

    I can then re-run the workflow with call caching enabled, commenting out the ApplyRecalibration, in order to gather the metrics.

    The CollectMetricsSharded needs the input of ApplyRecalibration, so it isn't as simple as I indicated. We will need to craft a separate workflow that takes the ApplyRecalibration as input and does the metrics collection and gathering.

    2- I also noticed that the maximum number of jobs is configurable in Cromwell and the default is 1,000,000:

    https://github.com/broadinstitute/cromwell/blob/9d0cf9d964ef1328f73b69da7e21f51f3b604bc4/engine/src/main/scala/cromwell/engine/workflow/lifecycle/execution/WorkflowExecutionActor.scala

    private val DefaultTotalMaxJobsPerRootWf = 1000000
    private val DefaultMaxScatterSize = 1000000
    private val TotalMaxJobsPerRootWf = params.rootConfig.getOrElse("system.total-max-jobs-per-root-workflow", DefaultTotalMaxJobsPerRootWf)
    private val MaxScatterWidth = params.rootConfig.getOrElse("system.max-scatter-width-per-scatter", DefaultMaxScatterSize)

    If possible, please increase the Terra configuration to 60,000 so that the joint discovery workflow can run to completion.

    0
  • Comment author
    Matt Bookman

    Note that I have added a github issue for the workflow itself:

    https://github.com/gatk-workflows/gatk4-germline-snps-indels/issues/40

    If this 50,000 limit is going to stay as a hard limit, there are options within the workflow to examine.

    0
  • Comment author
    Matt Bookman

    According to:

    https://support.terra.bio/hc/en-us/articles/360033659472-September-23-2019

    In Terra, each batch analysis workflow is subject to a limit on the number of jobs it can launch. In this release, the limit is increasing from 50,000 to 200,000.

    So this issue looks to have been addressed.

    0
  • Comment author
    Sushma Chaluvadi

    Hello Matt,

    Just double checked our internal ticket and it does indeed look like this ticket was completed!

    0
  • Comment author
    Giulio Genovese

    A job I submitted yesterday on Terra failed with the following message:

    Workflow has scatter width 38717, which is more than the max scatter width 35000 allowed per scatter!

    The scatter was not calling any task so I did not worry about this as an issue when I wrote the WDL. Thankfully it was easy to remove the scatter from the WDL and package it as a separate task. But I could not find this hard limit in the documentation. Where would developers learn about such limits?

    Giulio

    0
  • Comment author
    Jason Cerrato

    Hi Giulio Genovese,

    Thanks for writing in. Let me check with our documentation team and Cromwell team to see if we have this documented anywhere. If we don't, I'll make sure we get it documented!

    Kind regards,

    Jason

    0
  • Comment author
    Giulio Genovese

    I looked at the code and the error is generated from the ScatterKey.scala file:

    if(scatterSize > maxScatterWidth) {
    workflowExecutionActor ! JobFailedNonRetryableResponse(this, new Exception(s"Workflow has scatter width $scatterSize, which is more than the max scatter width $maxScatterWidth allowed per scatter!"), None)
    WorkflowExecutionDiff(Map(this -> ExecutionStatus.Failed))
    }

    The MaxScatterWidth is defined in file WorkflowExecutionActor.scala as follows:

    private val DefaultTotalMaxJobsPerRootWf = 1000000
    private val DefaultMaxScatterSize = 1000000
    private val TotalMaxJobsPerRootWf = params.rootConfig.getOrElse("system.total-max-jobs-per-root-workflow", DefaultTotalMaxJobsPerRootWf)
    private val MaxScatterWidth = params.rootConfig.getOrElse("system.max-scatter-width-per-scatter", DefaultMaxScatterSize)

    So it is my understanding that by default Cromwell allows scatters with a width of 1,000,000 but somehow in Terra it is configured with a more modest limit of 35,000. Is there a way to see what configuration file is used to run the Cromwell server behind Terra?

    0
  • Comment author
    Jason Cerrato

    Hi Giulio Genovese,

    You can find this max scatter definition for Terra here: https://github.com/broadinstitute/firecloud-develop/blob/dev/base-configs/cromwell/cromwell.conf.ctmpl#L134

    Note that you need to be a member of the broadinstitute Github organization to access the file.

    I've added a note to our internal documentation about scatter so others are made aware of this limit! Thank you for flagging this up.

    Kind regards,

    Jason

    1

Please sign in to leave a comment.