I have a task that I'm allowing 1 preemptible attempt and is scattered into 100 shards. I see several of the shards get preempted and retry on a regular machine but over a few attempts to run this task it looks like a few shards get preempted and fail to retry on a regular machine. This results in the entire task failing and if I try to rerun the task it looks like the individual shards don't call cache so they all have to rerun. Could you take a look and see if I'm understanding this failure mode correctly? Are there shards that are preempted but fail to rerun on a non-preemptible machine? Thanks for your help!
Example - Workflow ID: 8f858803-87af-416a-9b3a-1884f7ec5f69, Task: M2, Shard: 46
Please sign in to leave a comment.