Service Incident - May 6, 2020

Jason Cerrato
  • Updated

Summary

On the morning of May 6, 2020, we reported massive spikes of jobs receiving PAPI error code 2 to Google. They have concluded this to be an error on their end, and are working to put out a fix within the week.

See the Timeline section for the latest troubleshooting and resolution updates and the Impact section to understand how this could impact your use of the system. 

Timeline 

May 7, 2020 9:33 AM ET - Issue remediation - Google has reported that the fix for this issue has finished rolling out. It should no longer occur for new pipelines going forward but those in flight might still encounter it.

May 6, 2020 3:37 PM ET - Root cause - Google has reported that jobs are failing due to preemption and should be returning with error code 10 or 14 but are incorrectly returning 2. See workaround in Impact section.

May 6, 2020 1:41 PM ET - Issue identification - Google has informed us that this is an issue on their end and that they are working on a fix.

May 6, 2020 11:03 AM ET - Issue discovered and reported - Our engineers noticed a sharp increase in the number of PAPI error code 2 instances over the last 24 hours. They reported this to Google.

Impact

Users may see their jobs failing with PAPI error code 2. 

Workaround: Add maxRetries option to your job's runtime to retry jobs in case of transient failure. More info here: https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/#maxretries

Users can alternatively use non-preemptibles.

For more information

Please follow this article to get the most up to date information on this incident. If you would like to be notified of all service incidents or upcoming scheduled maintenance, click Follow on this page

There is also a Known Issues post for this issue. Please follow the Known Issues board for email updates of new posts.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.