Transient error resolving/localizing DRS
What the title says. Backend log pasted below. Localization of the results from the preceding tasks works (file names redacted in the log), after that the VM is stuck at downloading additional files specified using DRS. This error occurs in some tasks and does not appear reproducible - for example, other tasks from the same workflow were able to localize the same DRS successfully. This particular task which the backend log is pasted for ran for almost two days with no progress. This kind of error appears periodically in our workflows.
Thank you for any help!
2023/11/03 22:35:34 Starting container setup.
2023/11/03 22:35:35 Done container setup.
2023/11/03 22:35:36 Starting localization.
2023/11/03 22:35:45 Localization script execution started...
2023/11/03 22:35:45 Localizing input gs://fc-secure-.../.../... -> /cromwell_root/some_file_name
2023/11/03 22:35:48 Localizing input gs://fc-secure-...redacted
2023/11/03 22:35:58 Localization script execution complete.
Using https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v3 to resolve DRS Objects
Attempting to download data
getm --manifest getm-manifest.json
Comments
5 comments
Hi Alexander,
Thanks for writing in with this issue! A member of the Terra support team will follow up with you as soon as they are able.
If relevant, please let us know if there is any urgency around this request so that the team can prioritize it appropriately.
Kind regards,
Josh
Hi Alexander,
Thanks again for writing in! Are there any error messages that you can take a screenshot of? Can you provide a Submission ID and Workflow ID so that we can take a look at the metadata for the submission? Also is it possible for you to provide a link to where the WDL is hosted for this workflow? That should be on either Dockstore or the Broad Methods Repository.
Please let me know if you have any questions.
Best,
Josh
Hi Josh,
Thank you for responding,
submission id: will send in a private ticket
workflow id: will send in a private ticket
This is a new submission, and the problem appeared again.
Log from one of the tasks:
The problem is before the task starts. The task uses input specified by DRS URI (https://support.terra.bio/hc/en-us/articles/6635144998939-How-to-use-DRS-URIs-in-a-workflow) for TCGA data.
This is our custom WDL, it is in the Terra workspace, to the best of my knowledge it is not on Dockstore or Broad Methods Repository. Should we send it to you or provide any permission in GCP?
Some of my thoughts: maybe there is something wrong with the TCGA data? I noticed that the GDC portal uses the same UUIDs for files - for example, for the tumor BAM for C4-A0F7 the DRS URI is drs://dg.4DFC:8dc521e9-f975-440b-a5e2-c78261057907 and last part is the same as the TCGA UUID (https://portal.gdc.cancer.gov/files/8dc521e9-f975-440b-a5e2-c78261057907 is the page for this file in the GDC portal). I once had a problem with their gdc client getting stalled (on my workstation, not in Terra), their documentation also mentions that "network time out or dropped network connect can manifest as a hung or unresponsive download session"(https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Appendix_B_TroubleShooting/#very-large-manifests).
I do not know how Terra downloads these data, but maybe getm command hangs for some reason as well? Killing and restarting usually helps.
Is there any way to stop the VM and fail the task if the task has not started in say 2 or 3 hours? If it were within the task one could prefix the command with something like timeout 150m - but this is before the task.
Sincerely,
Alexander.
Hi Alexander,
Thanks for the reply. I'd like to take a look at this workspace if I could. Can you share the workspace where you are seeing this issue with Terra Support by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
Please provide us with
We’ll be happy to take a closer look as soon as we can!
Kind regards,
Josh
Was there a resolution to this issue? I'm experiencing the same thing (i.e. ~10% of DRS localizations hang indefinitely, stopping at the same point as the original poster). Thank you,
Sung
Please sign in to leave a comment.