CGA pipeline scatter getting stuck at localization step for hours
Hello Terra Team,
I am running the CGA pipeline (CGA_WES_Characterization_Pipeline_v0.2_Jun2019) over a few pairs (~15 pairs). For each pair workflow runs scatter for MuTect1 and MuTect2 tasks. Workflow gets stuck on various shards for those two tasks. The most common step where workflow gets stuck is the localization of input files. It is accumulating cost while it is stuck.
We experienced the same issue during Christmas break and the cost of the running pipeline increased significantly because of that.
I am happy to share the workspace in order to figure out this problem.
I attached one screenshot for example for Call #11 that has been running for 3 hours but log file states that it is still localizing interval list file (which is a really small file)

Thank you,
Luda
Comments
11 comments
Hi Luda,
We have opened a ticket with Google to investigate this as it seems to be a problem across many users. Can you please share the stdout, stderr, and **Task*.log files as well aas the Operation ID so that we can pass it along to the team looking into this issue? If you cannot post on the forum, feel free to email your information to Terra-support@broadinstitute.zendesk.com.
Sushma
Hello Sushma,
I emailed log file to the email you provided, only the log file was generated. No stdout or stderr files.
Thank you,
Luda
Hello Sushma,
I shared workspace broad-firecloud-ibmwatson/Wu_Richters_IBM with GROUP_FireCloud-Support@firecloud.org . The CGA_WES_Characterization_Pipeline_v0.2_Jun2019 pipeline has been running since yesterday (~23 hours) and the current estimated cost is > $100 for 6 pairs. It is only completed 2 out of 6 pairs.
In the same workspace, you can find runs from the same pipeline where the cost of running it was $1-5 per pair just a few months ago. What has changed? Why scatter tasks hang indefinitely?
Is it possible to reimburse the billing project for this type of issue?
Thank you,
Luda
Hi Terra,
We have also observed similar hang ups but have been unable to verify the scary cost increase that Liudmila describes. We would love to hear something about this time sensitive issue. Thank you,
Hello Brendan,
We are updating this thread with new information as we hear it: https://support.terra.bio/hc/en-us/community/posts/360056045911-Hanging-Localization-Step
Sushma
Thank you, Sushma! I'll keep watch of that thread.
Hello Sushma,
I just restarted the pipeline on 6 pairs and see the same issue where scatter tasks get stuck for an hour at the localization step. You can see it in the same workspace I shared with you.
Thank you,
Luda
I shared workspace with GROUP_FireCloud-Support@firecloud.org . It is called broad-firecloud-ibmwatson/Wu_Richters_IBM. Let me know if you have any issues accessing it.
The pipeline is still running from yesterday and it is stuck on MuTect1 and MuTect2 scatter tasks.
Thank you,
Luda
Hello Luda,
We are attempting to re-collect new information to pass back to our Google partners to help determine what is happening. You shared your workspace but it seems that there is an Authorization Domain protecting the workspace so we are unable to access it. Would you also add us to the Authorization Domain?
Thank you,
Sushma
Hello Sushma,
I am not an owner of this authorization domain. I will request access for you. Are you able to replicate the issue in your workspace?
Thank you,
Luda
Hello Luda,
We were hoping to look at your workspace but since we are waiting on access to the Auth Domain, we can try and replicate in our test workspace. Once you can get access, having details that are specific to your run would be great information to pass back!
Please sign in to leave a comment.