intermittent task failing during localization of BAMs

March 30, 2020 12:59
16 comments

This may be related to the earlier bug where tasks were hanging during localization of BAMs. In my case tasks are not hanging, but rather reporting a failure. While intermittent, when scattering a job, where each scattered job takes as input the same BAM, there is a high probability a failure will occur; e.g. when scattering mutect1 over 10 VMs, the first time I launched the workflow I got two failures. I then relaunched and got 3 failures. I wanted to run the workflow WITHOUT call caching in order to get an estimate of the cost of the entire workflows. I find I cannot do that because the only way to successfully get through the workflow is to run it repeatedly with call caching.

The BAMs are whole exomes, but still large, between 30 and 50 GB.

Comments

16 comments

Jason Cerrato
- March 31, 2020 13:39
Hi Chet,

Thanks for writing in about this. Would you be able to share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace (see the icon with the three dots at the top-right)?

1. Add GROUP_FireCloud-Support@firecloud.org to the User email field
2. Click Add User
3. Click Save

Please let us know the submission IDs for these jobs where you are seeing this issue occurring and we'll take a look.

Kind regards,

Jason

0
Chet Birger
- April 02, 2020 21:01
Done. The workspace URL is : https://app.terra.bio/#workspaces/broad-firecloud-cptac/CBB_20191122_hg38_wes_char_pipeline_LUAD

I see this alot. Here is a submission ID in which it occurred: 15b215a2-b2d2-4d73-9436-801e713ee47c. Take a look at the failed mutect1 shards.

-Chet

0
Jason Cerrato
- April 03, 2020 15:21
Hi Chet,

A few other users have reported running into the same issue, and our Cromwell engineers are currently investigating this issue in conversation with Google. We've done a few tests that show that, somewhat similarly to the previous localization issue, this issue seems exasperated by disks that are (for whatever reason) too small and/or too slow. Increasing the HDD disk space or changing to SSD has helped a couple of other users get their jobs moving. As such, if this is time sensitive, we recommend doing one of these two actions.

One of our engineers is building a script to collect data for Google and will be looking to point it at this submission once it's built. I can get back to you with anything I hear comes as a result of that if you're interested.

Kind regards,

Jason

0
Chet Birger
- April 03, 2020 15:30
Yes, Romanos and Luda in our lab also reported having this issue. I will look into expanding the size of the disks, or changing to SSD.

-Chet

0
Jason Cerrato
- April 03, 2020 15:31
Hi Chet,

Yes I have been working with Romanos on his issue. Let us know how the disk space change goes.

Many thanks,

Jason

0
Jason Cerrato
- April 06, 2020 15:04
Hi Chet,

I was able to successfully run a workflow Romanos was running and getting the localization script execution complete issue for by changing the WDL to use SSD instead of HDD. I've let him know and passed him that edited workflow, including a suggestion that it may be worth experimenting with larger HDDs and smaller SSDs to see what works better, cost-wise.

Kind regards

Jason

0
Jason Cerrato
- April 09, 2020 15:50
Hi Chet,

The latest from Google is that they've identified this relatively new behavior to be a bug on their end, so they are working on determining a best fix. I will update you when we hear more from them.

Kind regards,

Jason

0
Jason Cerrato
- April 17, 2020 19:48
Hi Chet,

We've received word from Google that they've released a fix for the new localization-then-failure behavior. If you run one of the configurations that used to work, would you mind letting us know if you see success?

Many thanks,

Jason

0
Chet Birger
- April 22, 2020 12:44
Jason,

I ran the workflow yesterday. The mutect1 scatter jobs used a local SSD disk that was not over-provisioned. (Changing back to HDD would have required a WDL change, and I didn't have the time to do that.). The mutect1 jobs succeeded, but the final job, mutation_validator, which takes as input both normal and tumor bams, failed with the same PAPI 10 Error. The log file indicated the failure occurred during bam file localization. The mutation_validator task uses an HDD local disk, and disk "padding" of 20 GB.

-Chet

0
Jason Cerrato
- April 22, 2020 13:22
Hi Chet,

Thanks for letting us know. Can you share the .log file as well as the submission and workflow IDs? I'm curious to see if the log shows the same behavior as before, failing right after "Localization script executed completed," or if it shows something else.

Kind regards,

Jason

0
Chet Birger
- April 22, 2020 14:55
I shared the workspace broad-firecloud-cptac/CBB_20200415_CPTAC3_LUAD with the support group address. Take a look at the most recent submission. You can pull the submission and workflow ID, and the log file.

-Chet

0
Jason Cerrato
- April 22, 2020 15:25
Hi Chet,

I see that this workspace is protected by the CPTAC3-dbGaP-Authorized authorization domain. I have requested access to it.

Kind regards,

Jason

0
Chet Birger
- April 22, 2020 15:37
To speed things up then:

Submission ID: 467044f4-bfe7-4130-9d8a-867e9e2b99dc

workflow ID: 44cc916b-810a-4511-a45d-149a167a0c9f

I will send you the log file via email (doesn't seem to be a way to attach a log file to a posting).

0
Jason Cerrato
- April 22, 2020 18:21
Hi Chet,

Thank you for that information. Based on the log you provided, it does not appear as though this is running into the same bug Google has reportedly resolved. The bug manifested as tasks reaching the "Localization script execution completed" step then failing. I see based on your log that your task was in the process of copying a bam then abruptly stopping partway through, which potentially points to an issue of preemption, or inadequate memory or hard disk space.

Can you share the name of the task that go this PAPI Error Code 10 error?

Kind regards,

Jason

0
Chet Birger
- April 23, 2020 13:00
The task was mutation_validator.

0
Jason Cerrato
- April 23, 2020 20:12
Hi Chet,

Looking at each of the logs for the two attempts of this task, this does not appear to be related to the Google bug. As previously mentioned, the bug manifested as tasks reaching the "Localization script execution completed" step then failing. In this case, the logs of both attempts of the mutation_validator tasks show the copying of the .bam to fail abruptly.

If you are interested, I can try to do more digging to see if there are any signs of what could have caused it.

Kind regards,

Jason

0

Please sign in to leave a comment.