The issue was found at 12:02 PM EDT on Monday, June 17, and impacts users running workflows that use call caching between June 15-21 at 2:00 AM EDT. See the Timeline section for the latest troubleshooting and resolution updates and the Impact section to understand how this could impact your use of the system.
June 21 - Issue resolved
June 20, 12:00 PM EDT - Release extended until 2:00 A.M. EDT
Rolling out of Cromwell call cache fix today was expected to have a downtime of 5 hours (4 PM to 9 PM EST). Unfortunately, it's taking longer than expected and current ETA of downtime is now extended till 2 AM tomorrow morning.
June 20, 4:00-9:00 PM EDT - Release scheduled
June 18, 12:28 PM EDT - Bug fix rescheduled to be included in a release on June 20 from 4:00-9:00 PM EDT. During the release window, Terra will be accessible, but you will not be able to look at workflow details in Job Manager, running workflows will be paused, and new workflows will be queued. Workflows will resume again after the outage.
June 17, 09:15 PM EDT - Bug fix is ready for release on June 18. The expected outage time is around 4:00-9:00 PM EDT.
June 17, 01:18 PM EDT - Testing a bug fix
June 17, 12:02 PM EDT - Issue reported
June 15 - Issue starts
Call caching is not working as expected. Jobs that completed successfully on June 15-21 (fix released at 2:00 AM EDT) with call caching enabled will rerun if relaunched. If you launch a submission between those dates, call caching will only pick up successful jobs run up until June 15. Here is a visual diagram that explains the impact to call caching during this incident.
For more information
Please follow this article to get the most up to date information on this incident. If you would like to be notified of all service incidents or upcoming scheduled maintenance, click Follow on this page.
Hi Tiff & Terra Team,
As always, thanks for your hard work and for posting these updates.
One question regarding the most recent update: would you advise not submitting new workflows (if possible) until after the scheduled downtime tonight? Or does it not matter?
We have a large workflow waiting in the wings to be launched, but I’d rather wait until this issue is resolved and not add further stress to the system.
I would advise not launching a big submission with multiple jobs if you are using call caching because if any fail during this outage, you have to rerun everything costing you extra compute $.
We have now scheduled a fix to go out on June 20, so I'd wait until Friday.
Hope this helps,
Hi again Tiff,
Wanted to share a brief update / confirmation: we launched our large workflow this morning, and call caching appears to be working as expected.
Thanks to you & the team for getting this issue resolved! We appreciate it!
Thanks for confirming Ryan!! Our pleasure!
Please sign in to leave a comment.