Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

Hanging Localization Step

Completed

Comments

45 comments

  • Official comment
    Avatar
    Jason Cerrato

    Update as of February 12, 2020 9:30AM ET:

    Google's engineers released a fix for the hanging localization issue yesterday, February 11, 2020 at 1:00PM ET. The fix targeted an issue with small/slow disks not being able to read that localization has actually finished, which caused the task to hang. Our engineering team has run a series of tests which validate the efficacy of the fix. We would like to hear back from any users previously affected whether jobs are now running as expected.

    In general, we recommend that advanced users still consider using larger disks for increased performance. See Adam's breakdown of costs between an 80 GB and 500 GB disk here: https://support.terra.bio/hc/en-us/community/posts/360056045911/comments/360009184152

    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hello All,

    Update on the above described issue as of January 13, 2020 12:00 PM ET:

    1. You may have already done this, but if not, we highly recommend aborting any currently running jobs that are stuck in a hanging state to avoid accruing additional charges. We recommend holding off on submitting any other jobs if possible as well, just in case the localization issue persists. 


    2. We believe we have found a root cause and solution to this issue. We can provide you with an update this afternoon once we determine the timing for a release with the engineers.

    Please "FOLLOW" this post to get notifications of further updates. This thread will be updated as we get more information.

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hello,

    Update as of January 14, 2020 4:12 PM ET:

    A fix for this bug has been released today at 4:11 PM ET. Please let us know via this thread if you continue to experience the same error with localization.

     

    Thank you for your patience!

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hello all,

    Update on the above described issue as of January 15, 2020 10:30 AM ET:

    Our engineering team released an update yesterday at 4:11 PM ET that we believe should resolve the hanging localization issue some users were recently experiencing. Anybody who was previously affected by this issue should be able to re-submit their workflows without running into this problem. We appreciate your participation and patience.

    If any of your submissions from today (Jan 15) onward exhibit the same behavior, please let us know as soon as possible. If you have any other questions or concerns, don't hesitate to reach out to us via this thread, or by emailing terra-support@broadinstitute.zendesk.com.

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Hello Jason,

     

    I resubmitted my workflows and scatter tasks are still hanging at localization step after an hour.

     

    Thank you,

    Luda

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Hello Jason,

     

    I already shared workspace with GROUP_FireCloud-Support@firecloud.org. It is called broad-firecloud-ibmwatson/Wu_Richters_IBM. Let me know if you have any issues accessing it.

     

    The pipeline is still running from yesterday and it is stuck on MuTect1 and MuTect2 scatter tasks. 

     

    Thank you,

    Luda

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hello all,

    Update on the above described issue as of January 16, 2020 3:15 PM ET:

    Our engineering team released an update on Tuesday, January 14 at 4:11 PM ET that we believed would resolve the hanging localization issue. Due to the number of users who wrote in experiencing the issue, and the impact the bug had on the time and cost of workflows, the engineering team quickly released a hot fix based on the best available data at the time.

    Due to reports that some workflows have run into the same issue after the hot fix, the engineering team has gone back to review the provided data and has found an addendum to our original solution that we feel will ultimately resolve this problem. We will update this page once the release has been scheduled, and again once it's pushed to live.

    Thank you to all members of the community who have written in about their experience with this error and have provided our engineering team with data. Please "FOLLOW" this post to get notifications of further updates. This thread will be updated as we get more information.

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Hello Jason,

     

    Thank you for the update. Unfortunately, the addendum also did not fix this issue. Attached is a screenshot of the scatter task being stuck for almost an hour at the localization step.

     

    Thank you,

    Luda

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Luda,

    Apologies for the confusion—we have not yet released the aforementioned addendum. The previous message was simply an indication that one has been identified. We will be updating this page once the new release is scheduled, and then again once it's been integrated. I will update the original message for the sake of clarity.

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Update as of January 17, 2020 4:00PM ET:

    Our engineering team released the aforementioned addendum at 2:23PM ET today, January 17. After some testing, we have reasonable confidence that the hanging localization issue is fully resolved. Anybody who was previously affected by this issue may re-submit their workflows to test whether they are able to get past the localization step.

    If any of your submissions from today (Jan 17) onward exhibit the same behavior, please let us know as soon as possible. If you have any other questions or concerns, don't hesitate to reach out to us via this thread, or by emailing support@terra.bio.

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Hello Jason,

    I restarted workflows today (Tuesday) after the implementation of the addendum and still, there is a hanging issue. Localization of files has completed in 30 minutes and the task has been running for 1 hour. The last step in the log file is "Localization script execution complete.". It has been hanging at that step for 30 + minutes.

    You should now be able to access the workspace called (broad-firecloud-ibmwatson/Wu_Richters_IBM). Let me know if you have any issues with it. Workflow name is CGA_WES_Characterization_Pipeline_v0.2_Jun2019.

    Thank you,
    Luda

     
     
    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Tasks have been now for 2 hours at the "Localization script execution complete." step.

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hello Luda,

    Thank you for reporting this continue bug. Please abort the Workflows that are stuck at the localization step. We have isolated the information we need to report to our Google partners - we believe that the issue may not be fully resolved.

    Sushma

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Hello Sushma,

     

    Jason stated that the latest addendum was tested before the release. How come they have not encountered the same issue? I just run it on 5 pairs. 

    This is a crucial part of our analysis and since Christmas, we are unable to run anything. 

     

    Thank you,

    Luda

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Luda,

     

    Our tests did indeed pass with the latest patch, but it is becoming clearer with each step that the underlying problem is quite complex. Each of the fixes thus far have resulted in some previously affected users now being able to successfully run their workflows. Why we are seeing the fixes work only for some, but not all, users requires further investigation. Although we hoped to see the issue resolved for everyone with the very first patch, the complexity of the issue requires our continued concerted effort in coordination with Google to ultimately resolve all cases of this issue. 

    Please accept our sincere apologies. We appreciate your frustration with the error and with how long it's taking to get resolved. We are truly doing all that we can to put this issue to rest. 

     

    Jason

    0
    Comment actions Permalink
  • Avatar
    Susan Klaeger

    Hello Jason, 

    My workflow has also been stuck over the long weekend and I just aborted now.

    This task usually takes 30 minutes to complete but is now stuck due to the scatter. 

    Can this be reimbursed? 

    Let me know if you need any additional information.

    Thank you, 

    Susan

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Susan,

    We are currently collecting information to request reimbursement from Google for users affected by this error. Please email support@terra.bio with the following pieces of information:

    Project Name: 
    Project Number:
    Billing Account ID:
    Amount for reimbursement:
    Proof of charges (screenshot or attached file):

    To find the Project Name and Project Number please visit https://console.cloud.google.com/, select the appropriate project, and the details can be found in the box at the top-left called Project info. To get the Billing Account ID, please go to https://console.cloud.google.com/billing/. You will see it next to the appropriate billing account as an 18 character string separated by two dashes.

    Many thanks,

    Jason

    0
    Comment actions Permalink
  • Avatar
    birger

    What is the status of this issue?  It is effectively preventing us from running our production somatic variant calling workflows and directly impacting several of our research projects.  In addition, the hanging jobs accrue large costs that we will need to request reimbursements for.

    thanks,

    Chet

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Chet,

    We are currently awaiting the latest from Google's end of the investigation. If you are interested in pursuing reimbursement, please email support@terra.bio with the following pieces of information:

    Project Name: 
    Project Number:
    Billing Account ID:
    Amount for reimbursement:
    Proof of charges (screenshot or attached file):

    To find the Project Name and Project Number please visit https://console.cloud.google.com/, select the appropriate project, and the details can be found in the box at the top-left called Project info. To get the Billing Account ID, please go to https://console.cloud.google.com/billing/. You will see it next to the appropriate billing account as an 18 character string separated by two dashes.

    We recommend aborting any currently running jobs that are currently in the hung state, and avoiding the run of new submissions of workflows where this issue has been experienced until we hear more. If we don't hear back from Google within the day, I will reach out to get the latest in their end of the investigation.

    Many thanks,

    Jason

    0
    Comment actions Permalink
  • Avatar
    birger

    Hi Jason,

    We do have a number of reimbursement requests in the pipeline.

    Could you provide us with a brief technical summary of what the Terra engineering team believes the issue is?

    Thanks!

    Chet

    1
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Hello Jason,

     

    I am currently running GATK workflow called CNV_Somatic_Pair_Workflow on 71 pairs and I encountered the same hanging issue. Task called CollectCountsTumor has been running for 4 hours where localization has been completed in less than 30 minutes and has been hanging for another 3.5 hours. There is no scatter in that workflow.

    Workspace called: broad-firecloud-ibmwatson/Wu_Richters_IBM (already shared with GROUP_FireCloud-Support@firecloud.org)

    Submission ID: 8a0927cb-7122-4b37-9f89-97ffa35e38ef

     

    Thank you,

    Luda

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi all,

    Thank you for your latest updates, as well as your continued patience. We've received word that our engineers and Google's engineers are in contact with a group who has been able to reliably reproduce the bug, and they are currently testing a possible workaround. We will provide an update once we hear more.

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Yosef Maruvka

    Hello Jason,

     

    I have been analyzing WGS samples. While most of them finished after 10-15 hours some samples have been running more than two days, and ate still running. You can set it here. Please let me know if you have an access to the workspace.  This bag already cost something like 450$-550$ instead of the expected ~250$. 

     

    Thanks,

     

    Yosi

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Yosi,

    Thank you for reporting this. We are unfortunately unable to see the submission details. Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in either the icon of your workspace in the workspace list or inside the workspace dashboard (see the icon with the three dots). Let us know the workspace name and relevant submission ID.

    Would you also be willing to write an email to support@terra.bio with the following details so that we can pass them to Google when we seek reimbursement for users affected by this bug?

    Project Name:
    Project Number:
    Billing Account ID:
    Amount for reimbursement:​
    Proof of charges (screenshot or attached file):

    To find the Project Name and Project Number please visit https://console.cloud.google.com/, select the appropriate project, and the details can be found in the box at the top-left called Project info. To get the Billing Account ID, please go to https://console.cloud.google.com/billing/. You will see it next to the appropriate billing account as an 18 character string separated by two dashes.

    We recommend aborting any currently running jobs that are currently in the hung state, and avoiding the run of new submissions of workflows where this issue has been experienced until we hear more. We'll continue to post relevant updates to the identification and resolution of this bug on this thread, so please press "Follow" at the top if you would like the latest updates.

    Many thanks,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Kristy Schlueter-Kuck

    Hello Jason,

    I recently encountered issues with hanging localization where I had to abort 29 of 30 jobs with a total cost of over $650 accumulated over less than 24 hours (the one job that completed only cost $4.12 to run and took approximately 9 hours), and have already submitted a request for reimbursement.  Thanks for your support.

    Best,

    Kristy

    0
    Comment actions Permalink
  • Avatar
    Yosef Maruvka

    Thank Jason,

     

    I shared the workspace "Permissions for broad-getzlab-starrlynch-terra/tag_696_Getz_Lipkin_STARR_LynchSyndrome_ULP-WGS_MSIDetect" with your email.

     

    I sent a reimbursement email after I posted the post.  

     

    Best,

    Yosi

    0
    Comment actions Permalink
  • Avatar
    Yosef Maruvka

    Hi Jason,

    I aborted the runs that were hanging-out and I rerun them. The success rate with lower than in the first run, but it changed from 9 hanging-out runs to only 4. I tried to repeat the process (aborting and rerunning), but the remaining 4 runs are all hanging-out. So, maybe there is a sample association with being stock in the hanging-out mode.

    It looks that the best solution for me know will be to download locally these 4 WGS (it will cost around 50$) and just to use UGER to run them. 

    Do you have any other suggestions?

     

    Thanks,

     

    Yosi

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Hello Jason,

     

    I just wanted to check-in to see if there are any updates on the solution for this issue.

     

    Thank you,

    Luda

    0
    Comment actions Permalink
  • Avatar
    breardon

    Ditto, just following up. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk