Workflow failure: "Workflow is making no progress but has the following unstarted job keys"
Hi - I am getting the following error in a large scatter job submission in Terra:
"Workflow is making no progress but has the following unstarted job keys"
The workflow is a version of the Mutect2 scatter task with 200 scatter jobs which mostly appear to have been successful (no failures, some pre-emptions with misses ongoing).
Comments
7 comments
Hi Arvind Ravi,
Thanks for writing in. We'll be happy to take a closer look at this!
Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
2. Click Save.
Let us know the workspace name, as well as the relevant submission and workflow IDs. We’ll be happy to take a closer look as soon as we can.
Kind regards,
Jason
That sounds great Jason - suspect other users may encounter a similar issue at some point in the future!
Workspace: broad-firecloud-ibmwatson/Getz_IBM_Ravi_SeqOnly_WGS_copy_3-3-2020
Submission ID: df522a01-9492-46a1-b5ab-0bf66e7a031c
Look forward to your thoughts.
Best,
Arvind
Hi Arvind,
Thank you for your patience. We've confirmed that this error was due to a recently-discovered bug which was patched on Friday. We've also confirmed that all of the submission's jobs are indeed Done. We don't expect future runs of this job to run into this same issue, for you or anyone else on the platform.
Many thanks for flagging this up, and let us know if we can be of any further assistance!
Kind regards,
Jason
Fantastic - thanks so much!
Hi Jason,
I just wanted to report I'm having the same issue in a new run.
Workspace (already shared):
Submission IDs:
4b690b28-e3dc-42ee-909e-f9780a4a995a
6bee5985-ea3d-428c-84c2-39a0f960b9fe
The first submission with 300 scatters had the following error:
Call 302 failure message:
"Task Multisample_Variant_Calling.Mutect2_Call_Variants:191:3 failed. The job was stopped before the command finished. PAPI error code 2. Execution failed: selecting resources: querying available resources: getting project quota: querying machine types: listing machine types: googleapi: Error 503: Internal error. Please try again or contact Google Support. (Code: '5B19C1131CB9C.A302243.420C9033'), backendError"
The second submission had 200 scatters and the following error:
Workflow level failure message (first few lines):
"Workflow is making no progress but has the following unstarted job keys: ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:166:1 BackendJobDescriptorKey_CommandCallNode_Multisample_Variant_Calling.Mutect2_Call_Variants:154:1 ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:74:1 BackendJobDescriptorKey_CommandCallNode_Multisample_Variant_Calling.Mutect2_Call_Variants:191:1 ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:138:1 ExpressionKey_anon$1_Multisample_Variant_Calling.Mutect2_Passing_Calls_Index:-1:1 ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:146:1 ScatterCollectorKey_PortBasedGraphOutputNode_Mutect2_Call_Variants.Mutect2CallStats:-1:1
..."
Any suggestions on how to resolve this? Thanks so much!
Hi Arvind Ravi,
I've brought this to our engineers and they have created a bug ticket to look into the issue further. However, it’s possible that the errors were caused by an unusually high number of service restarts on Friday, so simply resubmitting your workflows might be enough to get them through this time.
Best,
Samantha
I wanted to mention that I have encountered the same issue last Saturday night / Sunday morning while running a workflow on GCS with my own instance of Cromwell. I have reported the problem here.
Please sign in to leave a comment.