Error Calling URI (Rawls/Martha/DRS issue) Completed

Post author
Jason Cerrato

We've identified an issue affecting users making high numbers of DRS URI calls in their workflows. This error occurs when a user has their appropriate external server linkage set up in https://app.terra.bio/#profile. These are

  • NHLBI BioData Catalyst Framework Services
  • NCI CRDC Framework Services
  • and/or NHGRI AnVIL Data Commons Framework Services

The submission that utilizes these calls errors out with the following message before a workflow is successfully run:

 

ErrorReport(rawls,http error calling uri https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v2,Some(502 Bad Gateway),List(),List(),None)

 

The overall problem lies in datastage.io being unable to scale properly for large requests of DRS objects. We're tackling the issue on both fronts by communicating with datastage to see where we can help improve their app performance as well as identifying possible measures on our end that may work around the scaling issue.

 

If you urgently need progress on this, you are welcome to try batching the data in such a way that reduces the overall number of requests to DRS objects. However, we do not know what the "sweet spot" is for how many DRS objects can be used as input without error at this time. We recommend starting with 100 DRS objects and scaling up or down from there depending on the results.

 

If you are testing what number of DRS objects work for your workflow, please post here with your results so that other users can benefit from the information. We will also be posting here with our own test results, as well as other relevant updates from the development team as we receive them.

 

PLEASE NOTE: This error can also occur in cases when you do not have the above mentioned external server account(s) linked in your Profile, or if the link has expired. See this article about linking for details on how to link.

Comments

16 comments

  • Comment author
    Jason Cerrato
    • Official comment

    Users should no longer run into this specific error message when running DOS/DRS workflows. While this issue can be considered remediated, note that using DOS/DRS URI for large cloud-scale workflows may still result in other varieties of errors. If you experience any issues running large workflows that utilize DOS/DRS, please report it to our General Discussion board and we will be happy to investigate.

    Many thanks to everyone who contributed their experiences to our investigation.

  • Comment author
    Jason Cerrato

    I was able to successfully run a 1_MuTect1_Variants_Calling workflow on 100 DRS objects. I will do a test of 200 DRS objects and report the results here.

    0
  • Comment author
    Jason Cerrato

    I was able to successfully run a 1_MuTect1_Variants_Calling workflow on 200 DRS objects. However, one user has reported running into the same error with 100 DRS object calls.

    0
  • Comment author
    Arvind Ravi

    I'm unfortunately running into the related error below on single samples. Are there any updates on whether this has been resolved? Thank you!

    Error:

    ErrorReport(rawls,http call failed: https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v2: TCP idle-timeout encountered on connection to [us-central1-broad-dsde-prod.cloudfunctions.net:443], no bytes passed in the last 1 minute,Some(500 Internal Server Error),WrappedArray(),WrappedArray(),None)

    Workspace (shared with support):

    nci-chip-su2c-gmail-com/SU2C-LUNG-data_TCGA_LUAD_OpenAccess_V1-0_DATA_copy_2-14-18

     

    0
  • Comment author
    Jason Cerrato

    Hi Arvind Ravi,

    There is currently an outage for connection to DRS/DOS objects in the Terra platform. This may be the reason for your error. You can read more in the banner on Terra, or by going here: https://support.terra.bio/hc/en-us/articles/360050614031-Service-Incident-October-6-2020

    I will follow-up once the incident is over to see if you are still running into issues.

    Kind regards,
    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Arvind Ravi,

    The issue is now resolved. Can you try again to see if you run into the error? If you do, please write to support@terra.bio with the details or contact support through the Contact Us module in the Terra UI, found in the left-hand side menu.

    Kind regards,
    Jason

    0
  • Comment author
    palash pandey

    Jason Cerrato I am facing the same issue when I am trying to run gatk/mutect2-gatk4 workflow. Are you sure this was resolved? If so, can you suggest any other reason for this error?

    Thanks!

    -Palash

    0
  • Comment author
    Jason Cerrato

    Hi Palash,

    As mentioned in our ticket, the issue seems to be related to the attempt to access a requester pays bucket without providing a billing project to bill. I will continue working with you via that ticket thread, and we can revisit this conversation if we find that this error is still at play.

    Kind regards,

    Jason

    0
  • Comment author
    Arvind Ravi

    Hi Jason,

    Just a quick follow up that there are now 3 submissions in my workspace that have stalled (listed as "running" despite having delocalized their final outputs hours ago):

    workspace: broad-firecloud-ibmwatson/Getz_IBM_Ravi_SeqOnly_WGS_copy_3-3-2020

    submission IDs: 51799d75-8c65-483b-9670-e7a13a5997e7, 33bad55a-457c-4a7f-833c-1e707992b563, 52d19af2-992e-4f02-b687-83ee11863064

    Is this related to the issue above?

    Thanks so much,

    Arvind

    0
  • Comment author
    Jason Cerrato

    Hi Arvind Ravi,

    That seems curious. It doesn't sound related to the issue described above since the behavior you would expect to see is failure at the submission level with the message

    ErrorReport(rawls,http error calling uri https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v2,Some(502 Bad Gateway),List(),List(),None)

    when you have the correct external server linkage set up properly.

    We are happy to take a closer look at the issue you're experiencing. Would you mind sharing the workspace with GROUP_FireCloud-Support@firecloud.org if you haven't already? A member of our team will be in contact later today.

    Many thanks,

    Jason

    0
  • Comment author
    Qing Zhang

    Hi Jason Cerrato,

    I get the same error now and I can confirm that I have linked my NIH account. Should I share with you the workspace to debug?

    0
  • Comment author
    Jason Cerrato

    Hi Qing Zhang,

    Yes, please share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace. The Share option is in the three-dots menu at the top-right.

    1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
    2. Click Save.

    Let us know the workspace name, as well as the relevant submission and workflow IDs. We’ll be happy to take a closer look as soon as we can.

    Kind regards,

    Jason

    0
  • Comment author
    Seunghun Han

    Hi Jason Cerrato,

    Has there been any updates on this issue? 
    I have constantly experienced this issue when I submitted multiple parallel jobs using TCGA samples with DRS paths, and 
    when that happened I just submitted the failed jobs again, and it would eventually work.
    However, I'm trying to run a Terra method which requires array of samples together as an input 
    for a cohort mode, and since my batches have over 200 samples, it is keep giving me the error in the screenshot. 
    I can't split the batches into smaller ones since the method is supposed to be run on all the samples together per batch. 
    Is there any way to solve this issue? 


    Best,
    Seunghun

    0
  • Comment author
    Jason Cerrato

    Hi Seunghun Han,

    Thanks for writing in. The issue described in this thread is now fixed. It looks like you may be experiencing the issue described here: https://support.terra.bio/hc/en-us/community/posts/4406795903515

    Since you are using TCGA data, can you try unlinking and re-linking your NCI CRDC Framework Services and try again?

    Kind regards,

    Jason

    0
  • Comment author
    Seunghun Han

    Thank you Jason Cerrato. I followed the instruction and it seems to have fixed the issue.

    Seunghun

    0
  • Comment author
    Jason Cerrato

    Thank you for following up Seunghun Han! I've updated our Known Issues post accordingly.

    0

Please sign in to leave a comment.