Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, DataSTAGE. Learn more.

Preemptible task not rerunning after being preempted

Answered

Comments

16 comments

  • Avatar
    Marianie Simeon

    Hi Justin,

    If you are able to, can you share the workspace with GROUP_FireCloud-Support@firecloud.org as a Writer (you can remove this permission once we have resolved the issue)? You may need to share any Workflows that are not already publicly readable independent of the Workspace - this can be done through the FireCloud Methods Repository which you can access with this link https://portal.firecloud.org/?return=terra#methods. Sharing the Workspace will allow us to look directly into the logs and troubleshoot more efficiently.

     

    Best,

    Marianie

     

     

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Hi Marianie,

    The workspace should be shared already with that account.  The workflow should be public as well.

    workspace-id: c1d3840a-733b-4a89-8a78-c794e3acd032 submission-id: 547cd2da-7d2e-4462-85d8-86cf2c76d521

     

    Best,

    Justin

    0
    Comment actions Permalink
  • Avatar
    Marianie Simeon

    Hi Justin, 

     

    Thank you for sharing. We are looking into it now. In the meantime, can you also share the workspace name?

     

    Best,

    Marianie

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Hi Marianie,

     

    The workspace is blood-biopsy/early_stage_BC_whole_genome_analysis.

     

    Best,

    Justin

    0
    Comment actions Permalink
  • Avatar
    Marianie Simeon

    Hi Justin,

    We saw this error "A USER ERROR has occurred: Argument -L, --interval-set-rule has a bad value: [gs://fc-c1d3840a-733b-4a89-8a78-c794e3acd032/547cd2da-7d2e-4462-85d8-86cf2c76d521/Mutect2/8f858803-87af-416a-9b3a-1884f7ec5f69/call-SplitIntervals/glob-0fc990c5ca95eebc97c4c204e3e303e1/0099-scattered.interval_list, gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf],INTERSECTION. The specified intervals had an empty intersection" in the JES's M2-99.log for call #136.

    Can you inspect the 0099-scattered.interval_list to see if there is an intersecting issue with the small_exac_common_3.vcf.

    Best,
    Marianie

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Hi Marianie,

    I saw that failure and think I understand what happened and how to prevent it from happening again on a subsequent run.  My real question is what happened to shard 46.  It looks like it failed to relaunch after it was preempted.  Can you check that one for me?

    Best,

    Justin

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hi Justin,

    You are right that shard 46 failed due to preemption and indeed it should have been retried with a non-preemptible machine. We think that it failed because shard 99 caused a return code that was non-zero which resulted in the entire workflow getting an indication to stop running before shard 46 had a chance to try again with a non-preemptible. Have you had the opportunity to fix the error causing shard 99 to fail and re-running the workflow?

     

    Sushma

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Hi Sushma,

    I think I've fixed the problem with shard 99 and am re-running now to see if that solves the problem.

    Justin

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Hi Sushma,

    I've been able to get the problematic task to run successfully and several subsequent tasks as well.  Eventually one task failed for running out of memory and I relaunched the the workflow giving that task more memory but I noticed that the the long running task that was previously problematic was not call cached.  Are tasks that are scattered not call cached?

    -Justin 

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hi Justin,

    Call caching should not be automatically enabled. Can you confirm that this was not disabled somewhere along the way? It should appear in your Job History tab within the Submission view listed as either Call Caching (Enabled or Disabled). If the previous tasks were not modified in any way, it should be call-cached. I am checking with the team to see if there might be another reason why it did not work as expected.

     

    Sushma

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Hi Sushma,

    I took a look at the previous time I ran the workflow and the current rerun of the workflow and it appears that call caching was enabled for both.  As far as I can tell, I only changed the mem optional parameter for the LearnReadOrientationModel task.

    -Justin

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hi Justin,

    We are taking a look. I'll get back to you! I have access to the workspace you shared before but can you confirm the two submissionIDs you are comparing?

     

    Sushma

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Thanks for taking a look.  The initial submission ID should be a6115b0f-b599-44f7-a363-08fc5b4bd850 and the submission ID that is running now is ff790ada-9534-4db1-8c59-f94c09977c46.  The particular task I thought should have been call cached is M2.

    -Justin

    0
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    Hi Sushma,

    Any updates on this?

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hi Justin,

    Sorry about the delay, I will check with the team again for an update.

     

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Posting solution on forum for other users:

    Here is an update of what I have been able to find so far:

    1. I did a diff check and saw that your first submission was indeed call-cached successfully but the reason that the second submission did not call-cache the M2 task is because the SplitIntervals task's output, which is different intervals (used to shard) is different.

    2. One would assume that the output of SplitIntervals should stay the same if the task was not modified. While the file itself does stay the same, each submission comes with a unique ID and within each submission, each workflow comes with another unique ID. This means that the intervals files generated in the first workflow have a different gs:// bucket path than the second workflow. While this would not normally matter, the M2 task is taking in a String intervals input. Therefore, call-caching is looking for the same EXACT string to the input intervals file. This is where we think the issue is happening. 

    For example if you assume the below structure for bucket paths gs://bucket-id/submission-id/workflow-id/task/output.file

    Submission 1's intervals file path: 

    gs://fc-c1d3840a-733b-4a89-8a78-c794e3acd032/ff790ada-9534-4db1-8c59-f94c09977c46/9bf64d30-e357-47a5-9115-2221a548eb7a/intervals_1.list

    Submission 2's intervals file path:

    gs://fc-c1d3840a-733b-4a89-8a78-c794e3acd032/a6115b0f-b599-44f7-a363-08fc5b4bd850/57feedc6-cd51-48f7-b643-5a6216e270ce/intervals_1.list

    This is to illustrate that while the intervals_1.list output file may be the exact same, the path to the buckets is now different between submissions.

    3. Why does this matter? In the M2 task, the input is String? intervals. That means that the path to intervals_1.list is being read in as a String and when you compare the two strings, they are not the same so call-cache is not activate. It looks as though the choice to make it a String and not a File for intervals was to be able to make use of NIO streaming of files (to avoid localization). I looked up the M2 WDL in the GitHub and in that version, you can see that the same variable is File in the M2 task (instead of String): https://github.com/gatk-workflows/gatk4-somatic-with-preprocessing/blob/master/mutect2.wdl

    4. What next? I have not been able to do so yet but I would like to import the mutect2.wdl from github into the broad repository and test running a small sample and then running it again to see if that initiates call-caching. I wanted to ask if you would be able to run this test? If not, I can try to do so as well.

    Sushma

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk