Disparity between RAM usage on GCP VM compared to Terra Workflow
I have a workflow task; when it runs on Terra it keeps dying with the out-of-memory error. I have maxRetries set, and I have the 'Retry with more memory' option set, and have tried different recommended 'Factors' but what happens now is that eventually the amount of RAM requested on the retry exceeds the limits for the available configurations, and then the workflow dies with the "custom machine with many CPUs and tons of RAM not available in your region/zone" error (because there simply. are no GCP VMs available on Terra that satisfy the request from Terra). So that is one curious bug/feature. However what is more perplexing is that running the same simple command on a GCP VM, from the command line, with all of the equivalent input files and flags/options (but not running in Cromwell from a WDL file), the "task" completes and never uses more that 26 GB of RAM. Hence the disparity. Any suggestions on how to unpack/debug what is going on here?
Comments
12 comments
Hi Marc,
Thanks for writing in. We'll take a look at your inquiry and get back to you as soon as we can!
Kind regards,
Jason
Hey Marc,
Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
Please provide us with
We’ll be happy to take a closer look as soon as we can.
Kind regards,
Jason
Hi Jason,
I tried to provide this information on OCT-12 by replying to your Email but perhaps that method does not work the way I imagined.
This workspace already has that user added as a Reader: ucsc-idgc/ucsc_gi_usher
Here is the actual link: https://app.terra.bio/#workspaces/ucsc-idgc/ucsc_gi_usher
Relevant submission ID: 2949c88e-8f26-472c-8513-271a2378a3ce
Relevant workflow ID: a80cb571-9217-4f8f-a21e-8a93e8e1bd49
Thanks,
-- Marc
Hey Marc,
Thanks for following up. I hadn't received this information in our other thread but I can confirm I have access to this workspace. I'll take a closer look and get back to you as soon as I can.
I'm also happy to help troubleshoot what went wrong replying to email on Oct 12 if you are interested in finding out why it didn't work.
Kind regards,
Jason
Hey Marc,
In order to exclude the possibility of inadequate memory, would you be willing to try submitting this workflow requesting 16 CPUs and 104 GB memory for the Extract task and with Retry with more memory turned off? I recommend running with call caching enabled to save time and money with this test!
I know the source of the confusion here comes in part from the fact that this seems to work just fine with much lower amounts of memory in a non-Terra VM, so this is just a test to help rule out the possibility that there is some weird memory issue happening.
Kind regards,
Jason
Hi Jason,
Sure thing. I will change those settings and re-run. I do run it with call caching enabled, but we are pulling data from California Department of Public Health SARS-CoV-2 sequence database and so if there are new samples submitted, the call caching gets overwritten (I guess sort of like the UNIX make utility, or something). I will update here after the run.
Thanks,
-- Marc
Hi Jason,
That job died with this error:
"stderr for job `usherPlaceNewSamples.Extract:NA:1` contained one of the `memory-retry-error-keys: [OutOfMemory,Killed]` specified in the Cromwell config. Job might have run out of memory."
That was with the Extract task set with 16 vCPU and 104 GB of memory and the Retry with more Memory option turned off.
FYI: In order to get Extract task in the workflow to run successfully, I have used a process of trial and error and now use these settings:
In the runtime block I specify:
cpu: "80"
memory: "640" GB
cpuPlatform: "AMD Rome"
zones: "us-central1-b"
Thanks,
-- Marc
Hi Marc,
Thanks for confirming. Do you happen to know what machine configuration you're using outside of Terra to get the successful result (CPU platform type, num CPUs, etc.)?
Kind regards,
Jason
Hi Jason,
Sure, since I thought I was having a problem with limited RAM memory, I configured a GCP VM that was optimized for memory:
Hi Marc,
Ah that's interesting! Would you be willing to do a test with the docker + an external VM and letting us know if you still see that discrepancy?
Kind regards,
Jason
Hi Jason,
Yes, I ran it both as a WDL workflow using Cromwell, and simply in the Docker container, and they both used over 600 GB of RAM. However I think this has now drawn my attention to the fact that the latest GitHub release has not been incorporated into the docker container, and I am hopeful that once it is updated then this perplexing discrepancy will disappear (maybe). I will let you know.
Thanks for all your help with this.
-- Marc
Hi Marc,
Absolutely my pleasure. Do let us know how it goes and if we can help with troubleshooting anything else.
Kind regards,
Jason
Please sign in to leave a comment.