Error Calling URI (Rawls/Martha/DRS issue) Completed

Edited October 22, 2020 13:04
16 comments

We've identified an issue affecting users making high numbers of DRS URI calls in their workflows. This error occurs when a user has their appropriate external server linkage set up in https://app.terra.bio/#profile. These are

NHLBI BioData Catalyst Framework Services
NCI CRDC Framework Services
and/or NHGRI AnVIL Data Commons Framework Services

The submission that utilizes these calls errors out with the following message before a workflow is successfully run:

ErrorReport(rawls,http error calling uri https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v2,Some(502 Bad Gateway),List(),List(),None)

The overall problem lies in datastage.io being unable to scale properly for large requests of DRS objects. We're tackling the issue on both fronts by communicating with datastage to see where we can help improve their app performance as well as identifying possible measures on our end that may work around the scaling issue.

If you urgently need progress on this, you are welcome to try batching the data in such a way that reduces the overall number of requests to DRS objects. However, we do not know what the "sweet spot" is for how many DRS objects can be used as input without error at this time. We recommend starting with 100 DRS objects and scaling up or down from there depending on the results.

If you are testing what number of DRS objects work for your workflow, please post here with your results so that other users can benefit from the information. We will also be posting here with our own test results, as well as other relevant updates from the development team as we receive them.

PLEASE NOTE: This error can also occur in cases when you do not have the above mentioned external server account(s) linked in your Profile, or if the link has expired. See this article about linking for details on how to link.

Comments

16 comments

Jason Cerrato
- October 28, 2020 15:05
- Official comment
Users should no longer run into this specific error message when running DOS/DRS workflows. While this issue can be considered remediated, note that using DOS/DRS URI for large cloud-scale workflows may still result in other varieties of errors. If you experience any issues running large workflows that utilize DOS/DRS, please report it to our General Discussion board and we will be happy to investigate.

Many thanks to everyone who contributed their experiences to our investigation.
Jason Cerrato
- February 05, 2020 13:49
I was able to successfully run a 1_MuTect1_Variants_Calling workflow on 100 DRS objects. I will do a test of 200 DRS objects and report the results here.

0
Jason Cerrato
- February 10, 2020 14:38
I was able to successfully run a 1_MuTect1_Variants_Calling workflow on 200 DRS objects. However, one user has reported running into the same error with 100 DRS object calls.

0
Arvind Ravi
- October 06, 2020 14:29
I'm unfortunately running into the related error below on single samples. Are there any updates on whether this has been resolved? Thank you!

Error:

ErrorReport(rawls,http call failed: https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v2: TCP idle-timeout encountered on connection to [us-central1-broad-dsde-prod.cloudfunctions.net:443], no bytes passed in the last 1 minute,Some(500 Internal Server Error),WrappedArray(),WrappedArray(),None)

Workspace (shared with support):

nci-chip-su2c-gmail-com/SU2C-LUNG-data_TCGA_LUAD_OpenAccess_V1-0_DATA_copy_2-14-18

0
Jason Cerrato
- October 06, 2020 15:30
Hi Arvind Ravi,

There is currently an outage for connection to DRS/DOS objects in the Terra platform. This may be the reason for your error. You can read more in the banner on Terra, or by going here: https://support.terra.bio/hc/en-us/articles/360050614031-Service-Incident-October-6-2020

I will follow-up once the incident is over to see if you are still running into issues.

Kind regards,
Jason

0
Jason Cerrato
- October 06, 2020 17:11
Hi Arvind Ravi,

The issue is now resolved. Can you try again to see if you run into the error? If you do, please write to support@terra.bio with the details or contact support through the Contact Us module in the Terra UI, found in the left-hand side menu.

Kind regards,
Jason

0
palash pandey
- October 09, 2020 13:49
Jason Cerrato I am facing the same issue when I am trying to run gatk/mutect2-gatk4 workflow. Are you sure this was resolved? If so, can you suggest any other reason for this error?

Thanks!

-Palash

0
Jason Cerrato
- October 09, 2020 19:01
Hi Palash,

As mentioned in our ticket, the issue seems to be related to the attempt to access a requester pays bucket without providing a billing project to bill. I will continue working with you via that ticket thread, and we can revisit this conversation if we find that this error is still at play.

Kind regards,

Jason

0
Arvind Ravi
- October 22, 2020 11:47
Hi Jason,

Just a quick follow up that there are now 3 submissions in my workspace that have stalled (listed as "running" despite having delocalized their final outputs hours ago):

workspace: broad-firecloud-ibmwatson/Getz_IBM_Ravi_SeqOnly_WGS_copy_3-3-2020

submission IDs: 51799d75-8c65-483b-9670-e7a13a5997e7, 33bad55a-457c-4a7f-833c-1e707992b563, 52d19af2-992e-4f02-b687-83ee11863064

Is this related to the issue above?

Thanks so much,

Arvind

0
Jason Cerrato
- October 22, 2020 12:58
Hi Arvind Ravi,

That seems curious. It doesn't sound related to the issue described above since the behavior you would expect to see is failure at the submission level with the message

ErrorReport(rawls,http error calling uri https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v2,Some(502 Bad Gateway),List(),List(),None)

when you have the correct external server linkage set up properly.

We are happy to take a closer look at the issue you're experiencing. Would you mind sharing the workspace with GROUP_FireCloud-Support@firecloud.org if you haven't already? A member of our team will be in contact later today.

Many thanks,

Jason

0
Qing Zhang
- April 26, 2021 17:50
Hi Jason Cerrato,

I get the same error now and I can confirm that I have linked my NIH account. Should I share with you the workspace to debug?

0
Jason Cerrato
- April 26, 2021 18:27
Hi Qing Zhang,

Yes, please share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace. The Share option is in the three-dots menu at the top-right.
1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
2. Click Save.
Let us know the workspace name, as well as the relevant submission and workflow IDs. We’ll be happy to take a closer look as soon as we can.

Kind regards,

Jason
0
Seunghun Han
- September 28, 2021 22:01
Hi Jason Cerrato,

Has there been any updates on this issue?
I have constantly experienced this issue when I submitted multiple parallel jobs using TCGA samples with DRS paths, and
when that happened I just submitted the failed jobs again, and it would eventually work.
However, I'm trying to run a Terra method which requires array of samples together as an input
for a cohort mode, and since my batches have over 200 samples, it is keep giving me the error in the screenshot.
I can't split the batches into smaller ones since the method is supposed to be run on all the samples together per batch.
Is there any way to solve this issue?

Best,
Seunghun

0
Jason Cerrato
- September 29, 2021 13:15
Hi Seunghun Han,

Thanks for writing in. The issue described in this thread is now fixed. It looks like you may be experiencing the issue described here: https://support.terra.bio/hc/en-us/community/posts/4406795903515

Since you are using TCGA data, can you try unlinking and re-linking your NCI CRDC Framework Services and try again?

Kind regards,

Jason

0
Seunghun Han
- September 29, 2021 15:39
Thank you Jason Cerrato. I followed the instruction and it seems to have fixed the issue.

Seunghun

0
Jason Cerrato
- September 29, 2021 16:31
Thank you for following up Seunghun Han! I've updated our Known Issues post accordingly.

0

Please sign in to leave a comment.