Service Incident - June 17, 2019

Tiffany Miller

Summary

The issue was found at 12:02 PM EDT on Monday, June 17, and impacts users running workflows that use call caching between June 15-21 at 2:00 AM EDT. See the Timeline section for the latest troubleshooting and resolution updates and the Impact section to understand how this could impact your use of the system. 

Timeline

June 21 - Issue resolved

June 20, 12:00 PM EDT - Release extended until 2:00 A.M. EDT

Rolling out of Cromwell call cache fix today was expected to have a downtime of 5 hours (4 PM to 9 PM EST). Unfortunately, it's taking longer than expected and current ETA of downtime is now extended till 2 AM tomorrow morning.

June 20, 4:00-9:00 PM EDT - Release scheduled

June 18, 12:28 PM EDT - Bug fix rescheduled to be included in a release on June 20 from 4:00-9:00 PM EDT. During the release window, Terra will be accessible, but you will not be able to look at workflow details in Job Manager, running workflows will be paused, and new workflows will be queued. Workflows will resume again after the outage.

June 17, 09:15 PM EDT - Bug fix is ready for release on June 18. The expected outage time is around 4:00-9:00 PM EDT.

June 17, 01:18 PM EDT - Testing a bug fix

June 17, 12:02 PM EDT - Issue reported

June 15 - Issue starts

Impact

Call caching is not working as expected. Jobs that completed successfully on June 15-21 (fix released at 2:00 AM EDT) with call caching enabled will rerun if relaunched. If you launch a submission between those dates, call caching will only pick up successful jobs run up until June 15.  Here is a visual diagram that explains the impact to call caching during this incident. 

PROD-158_diagram_1_.png

For more information

Please follow this article to get the most up to date information on this incident. If you would like to be notified of all service incidents or upcoming scheduled maintenance, click Follow on this page

Was this article helpful?

1 out of 1 found this helpful

Comments

4 comments

  • Comment author
    Ryan Collins

    Hi Tiff & Terra Team,

    As always, thanks for your hard work and for posting these updates.

    One question regarding the most recent update: would you advise not submitting new workflows (if possible) until after the scheduled downtime tonight? Or does it not matter?

    We have a large workflow waiting in the wings to be launched, but I’d rather wait until this issue is resolved and not add further stress to the system.

    Thanks,
    Ryan

    0
  • Comment author
    Tiffany Miller

    Hi Ryan, 

    I would advise not launching a big submission with multiple jobs if you are using call caching because if any fail during this outage, you have to rerun everything costing you extra compute $.

    We have now scheduled a fix to go out on June 20, so I'd wait until Friday. 

    Hope this helps,

    Tiff

     

    0
  • Comment author
    RLCollins

    Hi again Tiff,

    Wanted to share a brief update / confirmation: we launched our large workflow this morning, and call caching appears to be working as expected.

    Thanks to you & the team for getting this issue resolved! We appreciate it!

    - Ryan

    0
  • Comment author
    Tiffany Miller

    Thanks for confirming Ryan!! Our pleasure!

    0

Please sign in to leave a comment.