submission queued for an hour. Chet Birger March 02, 2020 20:13 34 comments Is there a current issue with Terra? A submission I made has been in the queued state for an hour. Thanks, Chet Comments 34 comments Sort by Date Votes Jason Cerrato March 10, 2020 15:32 Official comment Hi all, I've created a Known Issues post for this issue. All relevant updates will be posted there. https://support.terra.bio/hc/en-us/community/posts/360058314871-Queued-submissions-and-inaccurate-job-status Kind regards, Jason Chet Birger March 02, 2020 20:52 It began processing after sitting in queue for an hour and 20 minutes. Was there a known issue that caused this delay? 0 Jason Cerrato March 03, 2020 15:13 Hi Chet, Thanks for writing in about this. One of our engineers checked the logs and it looks like there was a huge submission that you likely got stuck behind. Are you experiencing any queue issues at this time? Kind regards, Jason 0 Chet Birger March 03, 2020 15:22 Not experiencing any issues currently. Thank you for looking into this. Is there a way of monitoring the queue or reporting an estimated wait time in queue? thanks, Chet 0 Liudmila Elagina March 05, 2020 18:05 Today workflows are getting stuck for hours in the queue 0 Liudmila Elagina March 05, 2020 18:41 0 Jason Cerrato March 05, 2020 18:45 Hi Luda, Thank you for reporting these wait times. The FC wait time estimator is not always accurate (which is why it is no longer present in the Terra interface)—would you be willing to let us know if the original submission you showed as having an estimated wait time of four hours actually did need four hours to start? Please provide the submission ID & workflow ID if so, and share the workspace with GROUP_FireCloud-Support@firecloud.org. Many thanks, Jason 0 Liudmila Elagina March 05, 2020 19:02 My workflows are so far stuck for an hour but I see in another workspace that workflows are stuck for 2 hours. Those are different workspaces/workflows/billing projects. 0 Liudmila Elagina March 05, 2020 19:16 My workspace is called: broad-firecloud-ibmwatson/Wu_Richters_IBM. it is already shared with GROUP_FireCloud-Support@firecloud.org Thank you, Luda 0 Chet Birger March 05, 2020 19:29 I am also again experiencing a workflow submission that is stuck waiting in queue. -Chet 0 Sarah Walker March 05, 2020 20:03 I also have this problem, and have been waiting 3 hours for my job to run. Workspace: broadtagteam/TAG_735_CompareArrayWGSSites Id: 6fd328de-3aab-4a3d-b2f3-5da0ee3514ab 0 Jason Cerrato March 05, 2020 20:06 Hi Chet and Luda, It looks like there was a massive submission this morning that's the root cause of this queue. Our Cromwell team has been looking into why this happened and how we can make sure this type of submission doesn't cause issues for users going forward. They are also looking at refactoring our submission service, as well as adding more elastic scalability to Cromwell submissions. The queue backlog should be cleared out shortly (probably <30 minutes). Kind regards, Jason 0 Liudmila Elagina March 05, 2020 21:06 Hello Jason, Thank you for this update however my workflows are still stuck in the queue (~ 3+ hours). I will keep you posted on the progress. Luda 0 Liudmila Elagina March 05, 2020 21:11 I am just worried that at 8:00 pm tonight I won't be able to submit any workflows. So I have 4 hours to start my workflows and it still states in that monitor 3 hours wait, I really hope as you said it is not accurate. 0 Adam Nichols March 05, 2020 21:29 Hi Luda, Terra developer here. We're able to confirm that we had a high volume of submissions today. The system does not provide specific guarantees around when new submissions will start, but we are looking into how to make this more fair and efficient so that a single large submission does not impact other users. Best, Adam 0 Liudmila Elagina March 05, 2020 22:00 Hello Adam, Thank you for the update. I am not sure what is happening now as I do not see many jobs running. it states that only ~3K jobs are active and yet I have to wait for 4 more hours. My jobs already have been sitting for 4 hours. 0 Adam Nichols March 05, 2020 22:58 Hello again Luda, The system's global queued submission count has returned to zero, so I would expect that your submissions should be running. Please let me know if you see otherwise. Best, Adam 0 Liudmila Elagina March 05, 2020 23:00 It is running. Thank you 0 Cora Ricker March 06, 2020 15:49 Though some of my jobs are running, they have been running very slowly and some tasks have been queued for hours. Is there still a queue backlog? 0 Jason Cerrato March 06, 2020 17:25 Hi Cora, Yesterday's queue issue was related to Rawls—in this case, the task is queued in Cromwell. One of our engineers has taken a look and has confirmed that this queueing is due to the large number of jobs submitted from your billing project. I hope this answers your question. If I can help clarify anything else please let me know! Kind regards, Jason 0 Cora Ricker March 06, 2020 19:40 That answered my question. Thank you for your help! 0 Liudmila Elagina March 09, 2020 17:06 Hello Jason, I am curious what is a large number of submissions from the same billing project (queueing is due to the large number of jobs submitted from your billing project) My workflow is got stuck in queued in Cromwell. I see in the monitor there are 2910 active workflows. Given that most likely not all of those workflows are from the same billing project what is the upper bound on the number of workflows from the same billing project? Thank you, Luda 0 Liudmila Elagina March 09, 2020 17:09 Also, I just checked our billing project and see no VMs running (I am the owner of the billing project broad-firecloud-ibmwatson). Could there be another reason for this stuck in Cromwell issue? 0 Liudmila Elagina March 09, 2020 17:19 And just submitted the job from another billing project and it is also stuck in Cromwell. 0 Adam Nichols March 09, 2020 17:21 Hi Luda, While it may be frustrating that your jobs are taking longer than you're used to, it is normal for jobs to queue in a multi-user system that experiences variability in load. We do not have any evidence that jobs are getting "stuck" such that they never make progress. Best, Adam 0 Sarah Walker March 09, 2020 17:27 Hi, My jobs were also queued for over an hour... but now the status says that they've been "running" for almost 2 hours, but I cannot see the log files or anything, and don't actually think they're running.. Is there a way to ensure they are indeed running? I should be able to see the log files and gs directories. These issues of jobs queuing and things taking forever to start running are happening more and more frequently, and it is really frustrating, especially since I thought this system was supposed to be scalable. Sarah 0 Liudmila Elagina March 09, 2020 17:28 Hello Adam, I understand that this kind of issues are bound to happen. What I am trying to understand is what causes this as I do not see an enormous amount of jobs currently running. Plus I do not think waiting for a job to start for 2+ hours is the behavior we should expect from the muti-user system. Thank you, Luda 0 Liudmila Elagina March 09, 2020 18:41 Hello Adam, So my job that takes 15 minutes to run has been stuck in Cromwell for the past 3 hours. It is faster for me to spin out my own VM and run it. Thank you, Luda 0 Adam Nichols March 09, 2020 18:44 Hi all, Your concerns are heard! We don't have any immediate remedies to offer, but it appears that the queue is once again on its way to resolving itself. Hope this helps, Adam 0 Sarah Walker March 09, 2020 18:54 Thanks for that update, but what about my job that says it's running but the directory is completely empty and it's not actually running ? 0 Next › Last » Please sign in to leave a comment.