In general, if you ever see your workflow fail with PAPI error code 10. The assigned worker has failed to complete the operation, then one of two things happened:
1. Your machine actually failed.
2. The machine was preempted in such a way that it looks like machine failure.
This means that "error code 10" is sort of a catch-all error message, and this error doesn't get re-tried, so more of you will tend to notice workflows failing with that error message.
So what's next? We are having discussions with the Google team on how to catch such a failure and retry on behalf for the users, so that there's no extra action on your part. Before doing so, we're starting to monitor what the frequency of such a failure is and whether it's more prevalent in some projects to understand the nature of this failure more thoroughly.
Meanwhile, we are sharing some mitigation strategies to reduce such a failure rate. Would these strategies take away all failures? Definitely not, as sometimes your machine has truly crashed and has nothing to do with preemption, and in such cases it's worth looking into your logs to see if the job always crashes a certain number of mins into the process, or always when copying inputs/outputs or when pulling docker images. Those patterns could help indicate why your tasks are failing with `error code 10`.
Workarounds for the short-term:
1. You can choose to turn preemptibles off, as this will take away the possibility of silent preemption failures.
2. Use the attribute `maxRetries` to retry tasks that failed *after your command* fails with a non-zero return code, so that transient type failures get more attempts without manual intervention.
Please sign in to leave a comment.