Workflow failure: "Workflow is making no progress but has the following unstarted job keys"

September 24, 2020 19:24
7 comments

Hi - I am getting the following error in a large scatter job submission in Terra:

"Workflow is making no progress but has the following unstarted job keys"

The workflow is a version of the Mutect2 scatter task with 200 scatter jobs which mostly appear to have been successful (no failures, some pre-emptions with misses ongoing).

Comments

7 comments

Jason Cerrato
- September 25, 2020 14:07
Hi Arvind Ravi,

Thanks for writing in. We'll be happy to take a closer look at this!

Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
2. Click Save.

Let us know the workspace name, as well as the relevant submission and workflow IDs. We’ll be happy to take a closer look as soon as we can.

Kind regards,

Jason

0
Arvind Ravi
- September 25, 2020 15:18
That sounds great Jason - suspect other users may encounter a similar issue at some point in the future!

Workspace: broad-firecloud-ibmwatson/Getz_IBM_Ravi_SeqOnly_WGS_copy_3-3-2020

Submission ID: df522a01-9492-46a1-b5ab-0bf66e7a031c

Look forward to your thoughts.

Best,

Arvind

0
Jason Cerrato
- September 28, 2020 13:35
Hi Arvind,

Thank you for your patience. We've confirmed that this error was due to a recently-discovered bug which was patched on Friday. We've also confirmed that all of the submission's jobs are indeed Done. We don't expect future runs of this job to run into this same issue, for you or anyone else on the platform.

Many thanks for flagging this up, and let us know if we can be of any further assistance!

Kind regards,

Jason

0
Arvind Ravi
- September 28, 2020 13:35
Fantastic - thanks so much!

0
Arvind Ravi
- October 16, 2020 18:51
Hi Jason,

I just wanted to report I'm having the same issue in a new run.

Workspace (already shared):

broad-firecloud-ibmwatson/Getz_IBM_Ravi_SeqOnly_WGS_copy_3-3-2020

Submission IDs:

4b690b28-e3dc-42ee-909e-f9780a4a995a

6bee5985-ea3d-428c-84c2-39a0f960b9fe

The first submission with 300 scatters had the following error:

Call 302 failure message:

"Task Multisample_Variant_Calling.Mutect2_Call_Variants:191:3 failed. The job was stopped before the command finished. PAPI error code 2. Execution failed: selecting resources: querying available resources: getting project quota: querying machine types: listing machine types: googleapi: Error 503: Internal error. Please try again or contact Google Support. (Code: '5B19C1131CB9C.A302243.420C9033'), backendError"

The second submission had 200 scatters and the following error:

Workflow level failure message (first few lines):

"Workflow is making no progress but has the following unstarted job keys: ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:166:1 BackendJobDescriptorKey_CommandCallNode_Multisample_Variant_Calling.Mutect2_Call_Variants:154:1 ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:74:1 BackendJobDescriptorKey_CommandCallNode_Multisample_Variant_Calling.Mutect2_Call_Variants:191:1 ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:138:1 ExpressionKey_anon$1_Multisample_Variant_Calling.Mutect2_Passing_Calls_Index:-1:1 ExpressionKey_TaskCallInputExpressionNode_Multisample_Variant_Calling.Mutect2_Call_Variants.task_disk:146:1 ScatterCollectorKey_PortBasedGraphOutputNode_Mutect2_Call_Variants.Mutect2CallStats:-1:1

..."

Any suggestions on how to resolve this? Thanks so much!

0
Samantha (she/her)
- October 19, 2020 20:22
Hi Arvind Ravi,

I've brought this to our engineers and they have created a bug ticket to look into the issue further. However, it’s possible that the errors were caused by an unusually high number of service restarts on Friday, so simply resubmitting your workflows might be enough to get them through this time.

Best,
Samantha

0
Giulio Genovese
- March 31, 2021 02:14
I wanted to mention that I have encountered the same issue last Saturday night / Sunday morning while running a workflow on GCS with my own instance of Cromwell. I have reported the problem here.

0

Please sign in to leave a comment.