intermittent task failing during localization of BAMs

Post author
Chet Birger

This may be related to the earlier bug where tasks were hanging during localization of BAMs.  In my case tasks are not hanging, but rather reporting a failure.  While intermittent, when scattering a job, where each scattered job takes as input the same BAM, there is a high probability a failure will occur; e.g. when scattering mutect1 over 10 VMs,  the first time I launched the workflow I got two failures.  I then relaunched and got 3 failures.  I wanted to run the workflow WITHOUT call caching in order to get an estimate of the cost of the entire workflows.  I find I cannot do that because the only way to successfully get through the workflow is to run it repeatedly with call caching.

The BAMs are whole exomes, but still large, between 30 and 50 GB.

Comments

16 comments

  • Comment author
    Jason Cerrato

    Hi Chet,

    Thanks for writing in about this. Would you be able to share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace (see the icon with the three dots at the top-right)?

    1. Add GROUP_FireCloud-Support@firecloud.org to the User email field
    2. Click Add User
    3. Click Save

    Please let us know the submission IDs for these jobs where you are seeing this issue occurring and we'll take a look.

    Kind regards,

    Jason

    0
  • Comment author
    Chet Birger

    Done.  The workspace URL is : https://app.terra.bio/#workspaces/broad-firecloud-cptac/CBB_20191122_hg38_wes_char_pipeline_LUAD

    I see this alot.  Here is a submission ID in which it occurred: 15b215a2-b2d2-4d73-9436-801e713ee47c. Take a look at the failed mutect1 shards.

    -Chet

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    A few other users have reported running into the same issue, and our Cromwell engineers are currently investigating this issue in conversation with Google. We've done a few tests that show that, somewhat similarly to the previous localization issue, this issue seems exasperated by disks that are (for whatever reason) too small and/or too slow. Increasing the HDD disk space or changing to SSD has helped a couple of other users get their jobs moving. As such, if this is time sensitive, we recommend doing one of these two actions.

    One of our engineers is building a script to collect data for Google and will be looking to point it at this submission once it's built. I can get back to you with anything I hear comes as a result of that if you're interested.

    Kind regards,

    Jason

    0
  • Comment author
    Chet Birger

    Yes, Romanos and Luda in our lab also reported having this issue.  I will look into expanding the size of the disks, or changing to SSD.

    -Chet

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    Yes I have been working with Romanos on his issue. Let us know how the disk space change goes.

    Many thanks,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    I was able to successfully run a workflow Romanos was running and getting the localization script execution complete issue for by changing the WDL to use SSD instead of HDD. I've let him know and passed him that edited workflow, including a suggestion that it may be worth experimenting with larger HDDs and smaller SSDs to see what works better, cost-wise.

    Kind regards

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    The latest from Google is that they've identified this relatively new behavior to be a bug on their end, so they are working on determining a best fix. I will update you when we hear more from them.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    We've received word from Google that they've released a fix for the new localization-then-failure behavior. If you run one of the configurations that used to work, would you mind letting us know if you see success?

    Many thanks,

    Jason

    0
  • Comment author
    Chet Birger

    Jason,

    I ran the workflow yesterday.  The mutect1 scatter jobs used a local SSD disk that was not over-provisioned.  (Changing back to HDD would have required a WDL change, and I didn't have the time to do that.). The mutect1 jobs succeeded, but the final job, mutation_validator, which takes as input both normal and tumor bams, failed with the same PAPI 10 Error.  The log file indicated the failure occurred during bam file localization.  The mutation_validator task uses an HDD local disk, and disk "padding" of 20 GB.

    -Chet

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    Thanks for letting us know. Can you share the .log file as well as the submission and workflow IDs? I'm curious to see if the log shows the same behavior as before, failing right after "Localization script executed completed," or if it shows something else.

    Kind regards,

    Jason

    0
  • Comment author
    Chet Birger

    I shared the workspace broad-firecloud-cptac/CBB_20200415_CPTAC3_LUAD with the support group address.  Take a look at the most recent submission.  You can pull the submission and workflow ID, and the log file.

    -Chet

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    I see that this workspace is protected by the CPTAC3-dbGaP-Authorized authorization domain. I have requested access to it.

    Kind regards,

    Jason

    0
  • Comment author
    Chet Birger

    To speed things up then:

     

    Submission ID: 467044f4-bfe7-4130-9d8a-867e9e2b99dc

    workflow ID: 44cc916b-810a-4511-a45d-149a167a0c9f

    I will send you the log file via email (doesn't seem to be a way to attach a log file to a posting).

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    Thank you for that information. Based on the log you provided, it does not appear as though this is running into the same bug Google has reportedly resolved. The bug manifested as tasks reaching the "Localization script execution completed" step then failing. I see based on your log that your task was in the process of copying a bam then abruptly stopping partway through, which potentially points to an issue of preemption, or inadequate memory or hard disk space.

    Can you share the name of the task that go this PAPI Error Code 10 error?

    Kind regards,

    Jason

    0
  • Comment author
    Chet Birger

    The task was mutation_validator. 

    0
  • Comment author
    Jason Cerrato

    Hi Chet,

    Looking at each of the logs for the two attempts of this task, this does not appear to be related to the Google bug. As previously mentioned, the bug manifested as tasks reaching the "Localization script execution completed" step then failing. In this case, the logs of both attempts of the mutation_validator tasks show the copying of the .bam to fail abruptly. 

    If you are interested, I can try to do more digging to see if there are any signs of what could have caused it. 

    Kind regards,

    Jason

    0

Please sign in to leave a comment.