We've discovered issues related to the aborting, submitting, and completion of workflows in Terra.
- There may be delays updating the status of aborted workflows.
- Newly submitted workflows may take longer than normal to launch. These submitted workflows will not generate expense until they actual start running.
- Some workflow tasks appear to have finished per the log and outputs but do not show a Succeeded status. Instead, these continue to show as Running.
- Some workflow tasks remain in Submitted status for long periods of time without starting.
NOTE: If your submission or workflow is experiencing any of the above-mentioned symptoms, please do not abort it. This will put it in a hanging Aborting state.
NOTE: If your workflow task appears to have completed but is hanging in Running state, it should have stopped accruing costs at the time the log shows that the work is completed. Support is happy to verify this for you if needed.
See the Timeline section for the latest troubleshooting and resolution updates and the Impact section to understand how this could impact your use of the system.
October 29, 2020 12:00 AM ET - Issue resolution - Our engineers have identified the root cause of the issue and have taken the appropriate resolution steps. We believe this issue is now resolved.
If you believe you are still experiencing one of the symptoms of this service incident listed above, please write to us at firstname.lastname@example.org with details and we will be happy to investigate.
October 28, 2020 4:45 PM ET - Log enhancements - Our engineers have built log enhancements to aid in identifying root cause of the ongoing issue. These enhancements are now in effect.
October 27, 2020 4:30 PM ET - Issue update - We have discovered that Google is reporting issues with Google Cloud Storage. Our engineers are investigating whether this is a contributing factor for the ongoing issues.
October 27, 2020 10:30 AM ET - Issue update - The remediation appears to have resolved the states of jobs run prior to its application, but we are seeing reports of new workflow submissions (and older submissions with workflows that started after the remediation) still being affected by this issue. We recommend not aborting any jobs currently stuck in Running, as they may actually be complete despite their state. Please write to us with details for any jobs currently in error state. The engineering team is continuing to investigate.
October 26, 2020 4:30 PM ET - Issue remediation - Our Cromwell engineers applied a remediation that appears to be resolving the hanging states users were previously experiencing. If you are still seeing this issue, please write to email@example.com with details.
October 26, 2020 11:30 AM ET - Issue discovered - Our Cromwell engineers identified issues related to delays in status update and submission launch after reviewing the Terra internal metrics.
Users who have aborted workflows may experience delays in seeing the workflow fully aborted.
Users who have submitted jobs may experience longer-than-usual wait times to see the workflows launched.
Users may see workflow tasks not truly starting for a long time (no log generation or indication that work is being done). Others may see tasks that should be completed appear as still running, even though no work is being done.
For more information
Please follow this article to get the most up to date information on this incident. If you would like to be notified of all service incidents or upcoming scheduled maintenance, click Follow on this page.
Hello - thanks for investigating this issue. My jobs started on 10/22 are still stuck at the same step of the workflow. I've posted about this here: https://support.terra.bio/hc/en-us/community/posts/360074245312-Job-stuck-at-aborting-and-possibly-affecting-call-caching.
Do you know when the underlying issue will be addressed?
Hi Riaz Gillani,
Thanks for your patience here. Our engineers are working as quickly as they can to resolve this issue! I will take a look at this submission right away to let you know if this is an example of the current service incident. I will respond in the original thread with my findings.
Please sign in to leave a comment.