Indefinite workflow stalls
Workflows are stalling randomly and indefinitely for subsets of sample submissions. ~2.5% of samples may be stalling, which appears to be greater in some workspaces compared to others within the same project. We have not been able to identify a definitive link to certain workspace metadata. Rerunning samples, sometimes more than 2 times, resolves this issue. However, it is affecting automated workflows and high-priority clinical sample runs, which is significantly impeding sample processing. No errors are being reported, and run cost is not accruing despite stalls more than 24 hours. Sometimes, samples will proceed past stalled steps, though runtime is ~10-100x longer than anticipated for samples with similar data sized.
Comments
2 comments
Hi Zachary,
I wanted to let you know about a new community post providing a few more details about the recent slowdowns and the improvements we have in the works: https://support.terra.bio/hc/en-us/community/posts/50288990075931-Workflow-delays-and-upcoming-improvements
We'll be posting updates there as things roll out. Feel free to share it with colleagues, Follow the thread for email notifications, and drop a comment if you have any follow-up questions.
Hey Zachary,
Thank you for the thorough description of your situation. I'm glad to hear that run cost is not being accrued inappropriately, but I can see how frustrating it is to see these stalls.
We do see that there's been heavy load on Terra's workflow system in recent weeks, which is the likely cause of these stalls. During periods of high utilization and/or very large jobs being submitted, we're seeing higher rates of workflows sitting in the queue.
The engineering team is working on some adjustments to the workflow system to try to help improve queue times and general performance. I expect I'll hear more details about that soon, and I'll share what I learn as soon as I can.
Please sign in to leave a comment.