Service Incident - June 17, 2019

Summary

The issue was found at 12:02 PM EDT on Monday, June 17, and impacts users running workflows that use call caching between June 15-21 at 2:00 AM EDT. See the Timeline section for the latest troubleshooting and resolution updates and the Impact section to understand how this could impact your use of the system.

Timeline

June 21 - Issue resolved

June 20, 12:00 PM EDT - Release extended until 2:00 A.M. EDT

Rolling out of Cromwell call cache fix today was expected to have a downtime of 5 hours (4 PM to 9 PM EST). Unfortunately, it's taking longer than expected and current ETA of downtime is now extended till 2 AM tomorrow morning.

June 20, 4:00-9:00 PM EDT - Release scheduled

June 18, 12:28 PM EDT - Bug fix rescheduled to be included in a release on June 20 from 4:00-9:00 PM EDT. During the release window, Terra will be accessible, but you will not be able to look at workflow details in Job Manager, running workflows will be paused, and new workflows will be queued. Workflows will resume again after the outage.

June 17, 09:15 PM EDT - Bug fix is ready for release on June 18. The expected outage time is around 4:00-9:00 PM EDT.

June 17, 01:18 PM EDT - Testing a bug fix

June 17, 12:02 PM EDT - Issue reported

June 15 - Issue starts

Impact

Call caching is not working as expected. Jobs that completed successfully on June 15-21 (fix released at 2:00 AM EDT) with call caching enabled will rerun if relaunched. If you launch a submission between those dates, call caching will only pick up successful jobs run up until June 15. Here is a visual diagram that explains the impact to call caching during this incident.

For more information

Please follow this article to get the most up to date information on this incident. If you would like to be notified of all service incidents or upcoming scheduled maintenance, click Follow on this page.

Comments

4 comments

Ryan Collins
- June 18, 2019 11:40
Hi Tiff & Terra Team,

As always, thanks for your hard work and for posting these updates.

One question regarding the most recent update: would you advise not submitting new workflows (if possible) until after the scheduled downtime tonight? Or does it not matter?

We have a large workflow waiting in the wings to be launched, but I’d rather wait until this issue is resolved and not add further stress to the system.

Thanks,
Ryan

0
Tiffany Miller
- June 18, 2019 16:43
Hi Ryan,

I would advise not launching a big submission with multiple jobs if you are using call caching because if any fail during this outage, you have to rerun everything costing you extra compute $.

We have now scheduled a fix to go out on June 20, so I'd wait until Friday.

Hope this helps,

Tiff

0
RLCollins
- June 21, 2019 12:25
Hi again Tiff,

Wanted to share a brief update / confirmation: we launched our large workflow this morning, and call caching appears to be working as expected.

Thanks to you & the team for getting this issue resolved! We appreciate it!

- Ryan

0
Tiffany Miller
- June 21, 2019 18:25
Thanks for confirming Ryan!! Our pleasure!

0

Please sign in to leave a comment.

Service Incident - June 17, 2019

Summary

Timeline

Impact

For more information

Was this article helpful?

That’s great, can you tell us why? (Click all that apply)

Thanks for your feedback, help us improve by telling us what you think could be better (click all that apply)

Comments