The Error
PAPI error code 10.
The assigned worker has failed to complete the operation.
What it means
In general, if you ever see your workflow fail with PAPI error code 10. The assigned worker has failed to complete the operation, then one of two things happened:
- Your machine actually failed.
- The machine was preempted in such a way that it looks like machine failure.
This means that error code 10
is sort of a catch-all error message, and this error doesn't get re-tried, so more of you will tend to notice workflows failing with that error message.
Workarounds
- You can choose to turn preemptibles off, as this will take away the possibility of silent preemption failures.
- Use the attribute
maxRetries
to retry tasks that failed after your command fails with a non-zero return code, so that transient type failures get more attempts without manual intervention. - PAPI error code 10 can also occur for reasons related to insufficient memory or disk space. If the problem does not likely appear to be related to preemptibles or retries, we recommend increasing the allocated memory to see if you get farther in your task or if it resolves the issue. If increasing the memory does not resolve the issue, you can try increasing the disk size.
- Insufficient disk space is a common cause of the error when not allocating enough for the localization of the bams you are trying to process in the workflow.
What to look for
Sometimes your machine has truly crashed and has nothing to do with preemption, and in such cases it's worth looking into your logs to see if the job always crashes a certain number of mins into the process, or always when copying inputs/outputs or when pulling docker images. Those patterns could help indicate why your tasks are failing with error code 10
.