submission queued for an hour.

Post author
Chet Birger

Is there a current issue with Terra?  A submission I made has been in the queued state for an hour.


Thanks,  Chet

Comments

34 comments

  • Comment author
    Jason Cerrato
    • Official comment

    Hi all,

    I've created a Known Issues post for this issue. All relevant updates will be posted there.

    https://support.terra.bio/hc/en-us/community/posts/360058314871-Queued-submissions-and-inaccurate-job-status

    Kind regards,

    Jason

  • Comment author
    Chet Birger

    It began processing after sitting in queue for an hour and 20 minutes.  Was there a known issue that caused this delay?

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    Thanks for writing in about this. One of our engineers checked the logs and it looks like there was a huge submission that you likely got stuck behind. Are you experiencing any queue issues at this time?

    Kind regards,

    Jason

    0
  • Comment author
    Chet Birger

    Not experiencing any issues currently.  Thank you for looking into this.  Is there a way of monitoring the queue or reporting an estimated wait time in queue?

     

    thanks,

     

    Chet

    0
  • Comment author
    Liudmila Elagina

    Today workflows are getting stuck for hours in the queue

    0
  • Comment author
    Liudmila Elagina

    0
  • Comment author
    Jason Cerrato

    Hi Luda,

    Thank you for reporting these wait times. The FC wait time estimator is not always accurate (which is why it is no longer present in the Terra interface)—would you be willing to let us know if the original submission you showed as having an estimated wait time of four hours actually did need four hours to start? Please provide the submission ID & workflow ID if so, and share the workspace with GROUP_FireCloud-Support@firecloud.org.

    Many thanks,

    Jason

    0
  • Comment author
    Liudmila Elagina

    My workflows are so far stuck for an hour but I see in another workspace that workflows are stuck for 2 hours. Those are different workspaces/workflows/billing projects.

     

    0
  • Comment author
    Liudmila Elagina

    My workspace is called: broad-firecloud-ibmwatson/Wu_Richters_IBM. it is already shared with GROUP_FireCloud-Support@firecloud.org

     

    Thank you,

    Luda

    0
  • Comment author
    Chet Birger

    I am also again experiencing a workflow submission that is stuck waiting in queue.

    -Chet

    0
  • Comment author
    Sarah Walker

    I also have this problem, and have been waiting 3 hours for my job to run. Workspace: broadtagteam/TAG_735_CompareArrayWGSSites Id: 6fd328de-3aab-4a3d-b2f3-5da0ee3514ab

    0
  • Comment author
    Jason Cerrato

    Hi Chet and Luda,

    It looks like there was a massive submission this morning that's the root cause of this queue.

    Our Cromwell team has been looking into why this happened and how we can make sure this type of submission doesn't cause issues for users going forward. They are also looking at refactoring our submission service, as well as adding more elastic scalability to Cromwell submissions.
     
    The queue backlog should be cleared out shortly (probably <30 minutes).
     
    Kind regards,
    Jason
    0
  • Comment author
    Liudmila Elagina

    Hello Jason,

     

    Thank you for this update however my workflows are still stuck in the queue (~ 3+ hours). I will keep you posted on the progress.

     

    Luda

     

     

    0
  • Comment author
    Liudmila Elagina

    I am just worried that at 8:00 pm tonight I won't be able to submit any workflows. So I have 4 hours to start my workflows and it still states in that monitor 3 hours wait, I really hope as you said it is not accurate.

    0
  • Comment author
    Adam Nichols

    Hi Luda,

    Terra developer here. We're able to confirm that we had a high volume of submissions today. The system does not provide specific guarantees around when new submissions will start, but we are looking into how to make this more fair and efficient so that a single large submission does not impact other users.

    Best,

    Adam

    0
  • Comment author
    Liudmila Elagina

    Hello Adam,

    Thank you for the update. I am not sure what is happening now as I do not see many jobs running. it states that only ~3K jobs are active and yet I have to wait for 4 more hours. My jobs already have been sitting for 4 hours.

     

    0
  • Comment author
    Adam Nichols

    Hello again Luda,

    The system's global queued submission count has returned to zero, so I would expect that your submissions should be running. Please let me know if you see otherwise.

    Best,

    Adam

    0
  • Comment author
    Liudmila Elagina

    It is running. Thank you

    0
  • Comment author
    Cora Ricker

    Though some of my jobs are running, they have been running very slowly and some tasks have been queued for hours. Is there still a queue backlog?

     

     

    0
  • Comment author
    Jason Cerrato

    Hi Cora,

    Yesterday's queue issue was related to Rawls—in this case, the task is queued in Cromwell. One of our engineers has taken a look and has confirmed that this queueing is due to the large number of jobs submitted from your billing project.

    I hope this answers your question. If I can help clarify anything else please let me know!

    Kind regards,

    Jason

    0
  • Comment author
    Cora Ricker

    That answered my question. Thank you for your help! 

    0
  • Comment author
    Liudmila Elagina

    Hello Jason,

     

    I am curious what is a large number of submissions from the same billing project (queueing is due to the large number of jobs submitted from your billing project)

     

    My workflow is got stuck in queued in Cromwell. I see in the monitor there are 2910 active workflows. Given that most likely not all of those workflows are from the same billing project what is the upper bound on the number of workflows from the same billing project?

     

     

    Thank you,

    Luda

    0
  • Comment author
    Liudmila Elagina

    Also, I just checked our billing project and see no VMs running (I am the owner of the billing project broad-firecloud-ibmwatson). Could there be another reason for this stuck in Cromwell issue?

    0
  • Comment author
    Liudmila Elagina

    And just submitted the job from another billing project and it is also stuck in Cromwell.

    0
  • Comment author
    Adam Nichols

    Hi Luda,

    While it may be frustrating that your jobs are taking longer than you're used to, it is normal for jobs to queue in a multi-user system that experiences variability in load.

    We do not have any evidence that jobs are getting "stuck" such that they never make progress.

    Best,

    Adam

    0
  • Comment author
    Sarah Walker

    Hi,

    My jobs were also queued for over an hour... but now the status says that they've been "running" for almost 2 hours, but I cannot see the log files or anything, and don't actually think they're running.. Is there a way to ensure they are indeed running? I should be able to see the log files and gs directories.

    These issues of jobs queuing and things taking forever to start running are happening more and more frequently, and it is really frustrating, especially since I thought this system was supposed to be scalable.

    Sarah

    0
  • Comment author
    Liudmila Elagina

    Hello Adam,

     

    I understand that this kind of issues are bound to happen. What I am trying to understand is what causes this as I do not see an enormous amount of jobs currently running. Plus I do not think waiting for a job to start for 2+ hours is the behavior we should expect from the muti-user system.

     

    Thank you,

    Luda

     

     

    0
  • Comment author
    Liudmila Elagina

    Hello Adam,

     

    So my job that takes 15 minutes to run has been stuck in Cromwell for the past 3 hours. It is faster for me to spin out my own VM and run it.

     

    Thank you,

    Luda

    0
  • Comment author
    Adam Nichols

    Hi all,

    Your concerns are heard! We don't have any immediate remedies to offer, but it appears that the queue is once again on its way to resolving itself.

    Hope this helps,

    Adam

    0
  • Comment author
    Sarah Walker

    Thanks for that update, but what about my job that says it's running but the directory is completely empty and it's not actually running ?

    0

Please sign in to leave a comment.