intermittent task failing during localization of BAMs
This may be related to the earlier bug where tasks were hanging during localization of BAMs. In my case tasks are not hanging, but rather reporting a failure. While intermittent, when scattering a job, where each scattered job takes as input the same BAM, there is a high probability a failure will occur; e.g. when scattering mutect1 over 10 VMs, the first time I launched the workflow I got two failures. I then relaunched and got 3 failures. I wanted to run the workflow WITHOUT call caching in order to get an estimate of the cost of the entire workflows. I find I cannot do that because the only way to successfully get through the workflow is to run it repeatedly with call caching.
The BAMs are whole exomes, but still large, between 30 and 50 GB.
Comments
16 comments
Hi Chet,
Thanks for writing in about this. Would you be able to share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace (see the icon with the three dots at the top-right)?
1. Add GROUP_FireCloud-Support@firecloud.org to the User email field
2. Click Add User
3. Click Save
Please let us know the submission IDs for these jobs where you are seeing this issue occurring and we'll take a look.
Kind regards,
Jason
Done. The workspace URL is : https://app.terra.bio/#workspaces/broad-firecloud-cptac/CBB_20191122_hg38_wes_char_pipeline_LUAD
I see this alot. Here is a submission ID in which it occurred: 15b215a2-b2d2-4d73-9436-801e713ee47c. Take a look at the failed mutect1 shards.
-Chet
Hi Chet,
A few other users have reported running into the same issue, and our Cromwell engineers are currently investigating this issue in conversation with Google. We've done a few tests that show that, somewhat similarly to the previous localization issue, this issue seems exasperated by disks that are (for whatever reason) too small and/or too slow. Increasing the HDD disk space or changing to SSD has helped a couple of other users get their jobs moving. As such, if this is time sensitive, we recommend doing one of these two actions.
One of our engineers is building a script to collect data for Google and will be looking to point it at this submission once it's built. I can get back to you with anything I hear comes as a result of that if you're interested.
Kind regards,
Jason
Yes, Romanos and Luda in our lab also reported having this issue. I will look into expanding the size of the disks, or changing to SSD.
-Chet
Hi Chet,
Yes I have been working with Romanos on his issue. Let us know how the disk space change goes.
Many thanks,
Jason
Hi Chet,
I was able to successfully run a workflow Romanos was running and getting the localization script execution complete issue for by changing the WDL to use SSD instead of HDD. I've let him know and passed him that edited workflow, including a suggestion that it may be worth experimenting with larger HDDs and smaller SSDs to see what works better, cost-wise.
Kind regards
Jason
Hi Chet,
The latest from Google is that they've identified this relatively new behavior to be a bug on their end, so they are working on determining a best fix. I will update you when we hear more from them.
Kind regards,
Jason
Hi Chet,
We've received word from Google that they've released a fix for the new localization-then-failure behavior. If you run one of the configurations that used to work, would you mind letting us know if you see success?
Many thanks,
Jason
Jason,
I ran the workflow yesterday. The mutect1 scatter jobs used a local SSD disk that was not over-provisioned. (Changing back to HDD would have required a WDL change, and I didn't have the time to do that.). The mutect1 jobs succeeded, but the final job, mutation_validator, which takes as input both normal and tumor bams, failed with the same PAPI 10 Error. The log file indicated the failure occurred during bam file localization. The mutation_validator task uses an HDD local disk, and disk "padding" of 20 GB.
-Chet
Hi Chet,
Thanks for letting us know. Can you share the .log file as well as the submission and workflow IDs? I'm curious to see if the log shows the same behavior as before, failing right after "Localization script executed completed," or if it shows something else.
Kind regards,
Jason
I shared the workspace broad-firecloud-cptac/CBB_20200415_CPTAC3_LUAD with the support group address. Take a look at the most recent submission. You can pull the submission and workflow ID, and the log file.
-Chet
Hi Chet,
I see that this workspace is protected by the CPTAC3-dbGaP-Authorized authorization domain. I have requested access to it.
Kind regards,
Jason
To speed things up then:
Submission ID: 467044f4-bfe7-4130-9d8a-867e9e2b99dc
workflow ID: 44cc916b-810a-4511-a45d-149a167a0c9f
I will send you the log file via email (doesn't seem to be a way to attach a log file to a posting).
Hi Chet,
Thank you for that information. Based on the log you provided, it does not appear as though this is running into the same bug Google has reportedly resolved. The bug manifested as tasks reaching the "Localization script execution completed" step then failing. I see based on your log that your task was in the process of copying a bam then abruptly stopping partway through, which potentially points to an issue of preemption, or inadequate memory or hard disk space.
Can you share the name of the task that go this PAPI Error Code 10 error?
Kind regards,
Jason
The task was mutation_validator.
Hi Chet,
Looking at each of the logs for the two attempts of this task, this does not appear to be related to the Google bug. As previously mentioned, the bug manifested as tasks reaching the "Localization script execution completed" step then failing. In this case, the logs of both attempts of the mutation_validator tasks show the copying of the .bam to fail abruptly.
If you are interested, I can try to do more digging to see if there are any signs of what could have caused it.
Kind regards,
Jason
Please sign in to leave a comment.