Support workflows with more than 50,000 nodes Completed
Trying to run joint-discovery-gatk4.wdl over 4000+ samples, the workflow errored out with:
2019-07-09 23:21:54,288 ERROR - WorkflowExecutionActor-54cf1196-53fa-48ab-8f96-b042abc85549 [UUID(54cf1196)]: Job BackendJobDescriptorKey_CommandCallNode_JointGenotyping.CollectMetricsSharded:671:1 failed to be created! Error: Root workflow tried creating 50043 jobs, which is more than 50000, the max cumulative jobs allowed per root workflow
I believe that this occurs because the intervals file contains 10,187 intervals and the workflow scatters 5 times over those 10,187 intervals.
I have worked around this problem by commenting out the collection and gathering of metrics (CollectMetricsSharded, GatherMetrics, and the *_metrics_files workflow outputs) such that the scatter over ApplyRecalibration can run. I can then re-run the workflow with call caching enabled, commenting out the ApplyRecalibration, in order to gather the metrics.
But it would be great if this maximum could be increased.
Hi Giulio Genovese,
You can find this max scatter definition for Terra here: https://github.com/broadinstitute/firecloud-develop/blob/dev/base-configs/cromwell/cromwell.conf.ctmpl#L134
Note that you need to be a member of the broadinstitute Github organization to access the file.
I've added a note to our internal documentation about scatter so others are made aware of this limit! Thank you for flagging this up.
Two additional notes:
1- My statement above was incorrect:
I can then re-run the workflow with call caching enabled, commenting out the ApplyRecalibration, in order to gather the metrics.
The CollectMetricsSharded needs the input of ApplyRecalibration, so it isn't as simple as I indicated. We will need to craft a separate workflow that takes the ApplyRecalibration as input and does the metrics collection and gathering.
2- I also noticed that the maximum number of jobs is configurable in Cromwell and the default is 1,000,000:
private val DefaultTotalMaxJobsPerRootWf = 1000000
private val DefaultMaxScatterSize = 1000000
private val TotalMaxJobsPerRootWf = params.rootConfig.getOrElse("system.total-max-jobs-per-root-workflow", DefaultTotalMaxJobsPerRootWf)
private val MaxScatterWidth = params.rootConfig.getOrElse("system.max-scatter-width-per-scatter", DefaultMaxScatterSize)
If possible, please increase the Terra configuration to 60,000 so that the joint discovery workflow can run to completion.
Note that I have added a github issue for the workflow itself:
If this 50,000 limit is going to stay as a hard limit, there are options within the workflow to examine.
In Terra, each batch analysis workflow is subject to a limit on the number of jobs it can launch. In this release, the limit is increasing from 50,000 to 200,000.
So this issue looks to have been addressed.
Just double checked our internal ticket and it does indeed look like this ticket was completed!
A job I submitted yesterday on Terra failed with the following message:
The scatter was not calling any task so I did not worry about this as an issue when I wrote the WDL. Thankfully it was easy to remove the scatter from the WDL and package it as a separate task. But I could not find this hard limit in the documentation. Where would developers learn about such limits?
Hi Giulio Genovese,
Thanks for writing in. Let me check with our documentation team and Cromwell team to see if we have this documented anywhere. If we don't, I'll make sure we get it documented!
I looked at the code and the error is generated from the ScatterKey.scala file:
The MaxScatterWidth is defined in file WorkflowExecutionActor.scala as follows:
So it is my understanding that by default Cromwell allows scatters with a width of 1,000,000 but somehow in Terra it is configured with a more modest limit of 35,000. Is there a way to see what configuration file is used to run the Cromwell server behind Terra?
Please sign in to leave a comment.