Preemptible task not rerunning after being preempted Answered
I have a task that I'm allowing 1 preemptible attempt and is scattered into 100 shards. I see several of the shards get preempted and retry on a regular machine but over a few attempts to run this task it looks like a few shards get preempted and fail to retry on a regular machine. This results in the entire task failing and if I try to rerun the task it looks like the individual shards don't call cache so they all have to rerun. Could you take a look and see if I'm understanding this failure mode correctly? Are there shards that are preempted but fail to rerun on a non-preemptible machine? Thanks for your help!
Example - Workflow ID: 8f858803-87af-416a-9b3a-1884f7ec5f69, Task: M2, Shard: 46
Comments
16 comments
Hi Justin,
If you are able to, can you share the workspace with GROUP_FireCloud-Support@firecloud.org as a Writer (you can remove this permission once we have resolved the issue)? You may need to share any Workflows that are not already publicly readable independent of the Workspace - this can be done through the FireCloud Methods Repository which you can access with this link https://portal.firecloud.org/?return=terra#methods. Sharing the Workspace will allow us to look directly into the logs and troubleshoot more efficiently.
Best,
Marianie
Hi Marianie,
The workspace should be shared already with that account. The workflow should be public as well.
workspace-id: c1d3840a-733b-4a89-8a78-c794e3acd032 submission-id: 547cd2da-7d2e-4462-85d8-86cf2c76d521
Best,
Justin
Hi Justin,
Thank you for sharing. We are looking into it now. In the meantime, can you also share the workspace name?
Best,
Marianie
Hi Marianie,
The workspace is blood-biopsy/early_stage_BC_whole_genome_analysis.
Best,
Justin
Hi Justin,
We saw this error "A USER ERROR has occurred: Argument -L, --interval-set-rule has a bad value: [gs://fc-c1d3840a-733b-4a89-8a78-c794e3acd032/547cd2da-7d2e-4462-85d8-86cf2c76d521/Mutect2/8f858803-87af-416a-9b3a-1884f7ec5f69/call-SplitIntervals/glob-0fc990c5ca95eebc97c4c204e3e303e1/0099-scattered.interval_list, gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf],INTERSECTION. The specified intervals had an empty intersection" in the JES's M2-99.log for call #136.
Can you inspect the 0099-scattered.interval_list to see if there is an intersecting issue with the small_exac_common_3.vcf.
Best,
Marianie
Hi Marianie,
I saw that failure and think I understand what happened and how to prevent it from happening again on a subsequent run. My real question is what happened to shard 46. It looks like it failed to relaunch after it was preempted. Can you check that one for me?
Best,
Justin
Hi Justin,
You are right that shard 46 failed due to preemption and indeed it should have been retried with a non-preemptible machine. We think that it failed because shard 99 caused a return code that was non-zero which resulted in the entire workflow getting an indication to stop running before shard 46 had a chance to try again with a non-preemptible. Have you had the opportunity to fix the error causing shard 99 to fail and re-running the workflow?
Sushma
Hi Sushma,
I think I've fixed the problem with shard 99 and am re-running now to see if that solves the problem.
Justin
Hi Sushma,
I've been able to get the problematic task to run successfully and several subsequent tasks as well. Eventually one task failed for running out of memory and I relaunched the the workflow giving that task more memory but I noticed that the the long running task that was previously problematic was not call cached. Are tasks that are scattered not call cached?
-Justin
Hi Justin,
Call caching should not be automatically enabled. Can you confirm that this was not disabled somewhere along the way? It should appear in your Job History tab within the Submission view listed as either Call Caching (Enabled or Disabled). If the previous tasks were not modified in any way, it should be call-cached. I am checking with the team to see if there might be another reason why it did not work as expected.
Sushma
Hi Sushma,
I took a look at the previous time I ran the workflow and the current rerun of the workflow and it appears that call caching was enabled for both. As far as I can tell, I only changed the mem optional parameter for the LearnReadOrientationModel task.
-Justin
Hi Justin,
We are taking a look. I'll get back to you! I have access to the workspace you shared before but can you confirm the two submissionIDs you are comparing?
Sushma
Thanks for taking a look. The initial submission ID should be a6115b0f-b599-44f7-a363-08fc5b4bd850 and the submission ID that is running now is ff790ada-9534-4db1-8c59-f94c09977c46. The particular task I thought should have been call cached is M2.
-Justin
Hi Sushma,
Any updates on this?
Hi Justin,
Sorry about the delay, I will check with the team again for an update.
Posting solution on forum for other users:
Here is an update of what I have been able to find so far:
1. I did a diff check and saw that your first submission was indeed call-cached successfully but the reason that the second submission did not call-cache the M2 task is because the SplitIntervals task's output, which is different intervals (used to shard) is different.
2. One would assume that the output of SplitIntervals should stay the same if the task was not modified. While the file itself does stay the same, each submission comes with a unique ID and within each submission, each workflow comes with another unique ID. This means that the intervals files generated in the first workflow have a different gs:// bucket path than the second workflow. While this would not normally matter, the M2 task is taking in a String intervals input. Therefore, call-caching is looking for the same EXACT string to the input intervals file. This is where we think the issue is happening.
For example if you assume the below structure for bucket paths gs://bucket-id/submission-id/workflow-id/task/output.file
Submission 1's intervals file path:
Submission 2's intervals file path:
This is to illustrate that while the intervals_1.list output file may be the exact same, the path to the buckets is now different between submissions.
3. Why does this matter? In the M2 task, the input is String? intervals. That means that the path to intervals_1.list is being read in as a String and when you compare the two strings, they are not the same so call-cache is not activate. It looks as though the choice to make it a String and not a File for intervals was to be able to make use of NIO streaming of files (to avoid localization). I looked up the M2 WDL in the GitHub and in that version, you can see that the same variable is File in the M2 task (instead of String): https://github.com/gatk-workflows/gatk4-somatic-with-preprocessing/blob/master/mutect2.wdl
4. What next? I have not been able to do so yet but I would like to import the mutect2.wdl from github into the broad repository and test running a small sample and then running it again to see if that initiates call-caching. I wanted to ask if you would be able to run this test? If not, I can try to do so as well.
Sushma
Please sign in to leave a comment.