Job stuck at aborting and possibly affecting call-caching

October 26, 2020 12:19
12 comments

Hi,

I submitted a job (d730c1c3-4d60-44a6-9ba2-4f0a626f5070) on 10/23 and it failed at 15th tasks
I fixed the problem of the method, re-submitted the job using call caching along with two other workflows, but when I checked the job-manager after several hours, I found no "List View" tab.
When I looked into the google bucket of the submission, there was no folder for the workflow id under
gs://fc-1010661f-0156-4d73-a52f-27c25da77db7/d2bbea9a-1b3e-4f39-9c4f-c6a76b4c0966/BamRealigner/
I aborted the job and submitted the same job again (f5e9b09d-3f1e-470a-8483-7982d620d2c5), and
instead of starting from the 14th tasks which had been run successfully before, it started from beginning. In addition, the first 5 tasks were run successfully, but the workflow stuck there and did not advance.

Then I realized that all the past jobs I aborted were still displayed as "Aborting" and this made me suspect a possibility that this aborting is somehow affecting the call-caching. Is there a way to completely abort the jobs below?

Comments

12 comments

Brendan Reardon
- October 26, 2020 13:19
I've also been observing very long queue times and simple processes such as aborting and starting jobs take much longer than expected. Is Cromwell having issues of late? Any transparency in this regard would be greatly received.

0
Jason Cerrato
- October 26, 2020 14:13
Hi Seunghun,

We'll be happy to take a closer look at what's going on here.

Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
2. Click Save.

Let us know the workspace name or share a link. We’ll be happy to take a closer look as soon as we can.

Brendan, at present there are no known issues with Cromwell but if you have an example of this type of behavior, please also share that workspace as well as submission and workflow ID(s) and we'll investigate these as well. I will let you all know if we have identified a broader issue happening.

Kind regards,

Jason

0
Seunghun Han
- October 26, 2020 17:11
Hi Jason,

I just shared the workspace named "Undiagnosed_Familial_Cancers_DFCI_WGS_SAUD_2018 copy" with the given email.

I noticed further weird issue regarding the jobs I submitted.
If you look at a job with submission ID d4396587-a6c9-4706-8faa-fd11e9e7dfbf,

you will find three jobs. Among the three, only one (d7aa2e0e-3e82-4847-bee7-57b20c5155f0) seems to be running properly, another one stuck at the very first task for several hours, which usually takes about 5min (a855f27d-a2e8-4fca-ba7d-d1de6edaaaf0), and last one (5d500656-6a27-408e-b073-305110587139) not displaying the List_view.

I suspected this has something to do with the jobs stuck at aborting, so I kicked off another job with the same method on 3 different samples on which I've never run the method. Submission id-dac9eddd-b488-4729-ba3f-5b9a38953cb7 .If you look at the job, you will find the same patter. One running, another stuck at the first task, and the last one not running at all.

After seeing this, I launched three separate jobs with another method, and weirdly enough, these three are showing the same pattern I described above.

It would be great if you could take a look at this.

Best,

Seunghun

0
Jason Cerrato
- October 26, 2020 18:22
Hi Seunghun,

Thanks for those details. This appears to be an example of the current service incident in-progress. Our engineers are actively investigating this. I will post all relevant updates to that page - please Follow the page to get the latest updates to the page sent directly to you by email. I will also update you on this thread once the issue is resolved.

Kind regards,

Jason

0
Seunghun Han
- October 27, 2020 06:23
Hi Jason,

I just checked my job history and looks like jobs were properly aborted. Also regarding the recent jobs I resubmitted, job manager is correctly showing the progress.

Thank you for your help

-Seunghun

0
Jason Cerrato
- October 27, 2020 13:12
Hi Seunghun,

Happy to see that the remediation step put in place by the engineers was successful in resolving the state of your workflows. If you run into this problem again, please let us know and I will pass those details on to our engineers.

Kind regards,

Jason

0
Seunghun Han
- October 27, 2020 16:41
Hi Jason,

I thought all my jobs were running properly after the remediation, but I just noticed that some jobs are still hanging for several hours at various steps.

Based on the Terra announcement, looks like you guys are still investigating this, but I just wanted to write this to prevent any misunderstanding rising from my previous post

-Seunghun

0
Sophia Kamran
- October 27, 2020 16:59
Hello,

I am also having difficulty running jobs. They are hanging for hours, and one job may not have started as there is no evidence that anything is happening.

It sounds like you guys are aware, but just wanted to make note that these are affecting my jobs too.

Thanks!

-Sophia

0
Jason Cerrato
- October 27, 2020 18:31
Hi both,

Thank you for the updates. We are still investigating the issue. If either of you come across any behaviors or states for your workflows that seem different from those described in the article, please let us know.

Many thanks for keeping us in the loop!

Kind regards,
Jason

0
Riaz Gillani
- October 27, 2020 21:55
Hi Jason, I too am having this issue of a workflow with several tasks, failing at the 4th task, and then subsequently restarting. The workspace is called "St_Jude_WGS_sample_download" under the "vanallen-firecloud-nih" namespace. The submission ID is "e092f46e-083b-421c-bf6a-127b1dc131fb". I've shared it with GROUP_FireCloud-Support@firecloud.org as you suggested above.

I imagine this is part of the same issue that you've reported here: https://support.terra.bio/hc/en-us/articles/360051597071-Service-Incident-October-26-2020-Cromwell-

Do you recommend killing the current jobs and restarting once this is resolved?

Thanks,

Riaz

0
Jason Cerrato
- October 29, 2020 13:17
Hi Riaz,

Thank you for writing in. I'll take a look at this submission to confirm it's an example of the current service incident and let you know. For now, please do not abort the current jobs.

Kind regards,

Jason

0
Jason Cerrato
- October 29, 2020 15:34
Hi everyone,

Our engineers have identified the root cause of the workflow issues and have taken the appropriate resolution steps. We believe this issue is now resolved. Your workflow submissions should now resolve normally, if they haven't already.

Thank you so much for your patience and for flagging up your workflow(s) in error. This data was extremely valuable for our engineers in investigation of the underlying cause. If you believe you are still experiencing the issues associated with the service incident, please let us know and we'll be happy to take a look as soon as we can.

Thanks again, and if there's anything else we can assist with please don't hesitate to let us know.

Kind regards,

Jason

0

Please sign in to leave a comment.