Are you experiencing PAPI error code 10? Read this for a workaround!
In general, if you ever see your workflow fail with PAPI error code 10. The assigned worker has failed to complete the operation, then one of two things happened:
1. Your machine actually failed.
2. The machine was preempted in such a way that it looks like machine failure.
This means that "error code 10" is sort of a catch-all error message, and this error doesn't get re-tried, so more of you will tend to notice workflows failing with that error message.
So what's next? We are having discussions with the Google team on how to catch such a failure and retry on behalf for the users, so that there's no extra action on your part. Before doing so, we're starting to monitor what the frequency of such a failure is and whether it's more prevalent in some projects to understand the nature of this failure more thoroughly.
Meanwhile, we are sharing some mitigation strategies to reduce such a failure rate. Would these strategies take away all failures? Definitely not, as sometimes your machine has truly crashed and has nothing to do with preemption, and in such cases it's worth looking into your logs to see if the job always crashes a certain number of mins into the process, or always when copying inputs/outputs or when pulling docker images. Those patterns could help indicate why your tasks are failing with `error code 10`.
Workarounds for the short-term:
1. You can choose to turn preemptibles off, as this will take away the possibility of silent preemption failures.
2. Use the attribute `maxRetries` to retry tasks that failed *after your command* fails with a non-zero return code, so that transient type failures get more attempts without manual intervention.
Comments
4 comments
Hi @Sushma Chaluvadi, would you be able to advise how one would turn preemptibles off, and set maxRetries for a given FireCloud workspace. Apologies if this is a dumb question..
@Owen Wilkins, not a dumb question.
maxRetries and preemptions are something you set on a task level, so it's something in the method, not the workspace. Specifically it's something that's specified in the runtime block of each task in your WDL(s) that comprise the method. You can see more about writing this into the WDL here: https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/#maxretries
Hi
I want to share my experience and save a lot of time, in case someone faces the same problem I had.
I'm working with WGS samples from a NovaSeq 6000. I'm using the workflow 1-Processing-For-Variant-Discovery v1.1.0 from Broad's showcase workspace Germline-SNPs-Indels-GATK4-hg38. I struggled for dayyys trying to figure out why all my jobs were failing on the SortAndFixTags step with this poor "PAPI error code 10. The assigned worker has failed to complete the operation" message.
The exact same sample failed on slightly different step according to logs. Logs were suddenly interrupted without any informative error description, and this issue was driving me crazy. After digging deep into this and making tons of tests, I figure out the problem was a stupid out-of-memory problem. Simply increasing the default mem_size parameter on the workflow inputs from "7500 MB" to "10000 MB" totally solved the problem for me.
This may be largely explained by my lack of previous experience with WDL and Terra, but I hope this tip helps other people and Broad to make Terra's logs more informative.
Best,
Rodrigo Guarischi Sousa
Many thanks to Rodrigo Guarischi-Sousa. I am getting stuck at the same problem as yours. I have yearly experience running GATK workflows on my local server. However, on Terra now we cannot see all error messages such as out-of-memory error.
Please sign in to leave a comment.