Disparity between RAM usage on GCP VM compared to Terra Workflow

October 07, 2021 21:54
12 comments

I have a workflow task; when it runs on Terra it keeps dying with the out-of-memory error. I have maxRetries set, and I have the 'Retry with more memory' option set, and have tried different recommended 'Factors' but what happens now is that eventually the amount of RAM requested on the retry exceeds the limits for the available configurations, and then the workflow dies with the "custom machine with many CPUs and tons of RAM not available in your region/zone" error (because there simply. are no GCP VMs available on Terra that satisfy the request from Terra). So that is one curious bug/feature. However what is more perplexing is that running the same simple command on a GCP VM, from the command line, with all of the equivalent input files and flags/options (but not running in Cromwell from a WDL file), the "task" completes and never uses more that 26 GB of RAM. Hence the disparity. Any suggestions on how to unpack/debug what is going on here?

Comments

12 comments

Jason Cerrato
- October 12, 2021 14:42
Hi Marc,

Thanks for writing in. We'll take a look at your inquiry and get back to you as soon as we can!

Kind regards,

Jason

0
Jason Cerrato
- October 12, 2021 15:18
Hey Marc,

Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
2. Click Save.
Please provide us with
1. A link to your workspace
2. The relevant submission ID
3. The relevant workflow ID
We’ll be happy to take a closer look as soon as we can.

Kind regards,

Jason
0
Marc Perry
- Edited October 14, 2021 18:02
Hi Jason,

I tried to provide this information on OCT-12 by replying to your Email but perhaps that method does not work the way I imagined.

This workspace already has that user added as a Reader: ucsc-idgc/ucsc_gi_usher

Here is the actual link: https://app.terra.bio/#workspaces/ucsc-idgc/ucsc_gi_usher

Relevant submission ID: 2949c88e-8f26-472c-8513-271a2378a3ce

Relevant workflow ID: a80cb571-9217-4f8f-a21e-8a93e8e1bd49

Thanks,

-- Marc

0
Jason Cerrato
- October 14, 2021 18:08
Hey Marc,

Thanks for following up. I hadn't received this information in our other thread but I can confirm I have access to this workspace. I'll take a closer look and get back to you as soon as I can.

I'm also happy to help troubleshoot what went wrong replying to email on Oct 12 if you are interested in finding out why it didn't work.

Kind regards,

Jason

0
Jason Cerrato
- October 15, 2021 20:18
Hey Marc,

In order to exclude the possibility of inadequate memory, would you be willing to try submitting this workflow requesting 16 CPUs and 104 GB memory for the Extract task and with Retry with more memory turned off? I recommend running with call caching enabled to save time and money with this test!

I know the source of the confusion here comes in part from the fact that this seems to work just fine with much lower amounts of memory in a non-Terra VM, so this is just a test to help rule out the possibility that there is some weird memory issue happening.

Kind regards,

Jason

0
Marc Perry
- October 15, 2021 20:38
Hi Jason,

Sure thing. I will change those settings and re-run. I do run it with call caching enabled, but we are pulling data from California Department of Public Health SARS-CoV-2 sequence database and so if there are new samples submitted, the call caching gets overwritten (I guess sort of like the UNIX make utility, or something). I will update here after the run.

Thanks,

-- Marc

0
Marc Perry
- October 18, 2021 17:58
Hi Jason,

That job died with this error:

"stderr for job `usherPlaceNewSamples.Extract:NA:1` contained one of the `memory-retry-error-keys: [OutOfMemory,Killed]` specified in the Cromwell config. Job might have run out of memory."

That was with the Extract task set with 16 vCPU and 104 GB of memory and the Retry with more Memory option turned off.

FYI: In order to get Extract task in the workflow to run successfully, I have used a process of trial and error and now use these settings:

In the runtime block I specify:

cpu: "80"

memory: "640" GB

cpuPlatform: "AMD Rome"

zones: "us-central1-b"

Thanks,

-- Marc

0
Jason Cerrato
- October 18, 2021 19:45
Hi Marc,

Thanks for confirming. Do you happen to know what machine configuration you're using outside of Terra to get the successful result (CPU platform type, num CPUs, etc.)?

Kind regards,

Jason

0
Marc Perry
- October 18, 2021 20:02
Hi Jason,

Sure, since I thought I was having a problem with limited RAM memory, I configured a GCP VM that was optimized for memory:

m1-megamem-96 (96 vCPUs, 1,433.6 GB memory) (!) But when I run it for testing I have used several methods of tracking the amount of RAM being used, and they all agree that 25 to 26 GB of RAM are the maximum being used to process the same input files. I limit the Extract task on this host to 32 vCPU and it runs in 25 min.

So 32 CPU, same input files, almost unlimited RAM, on demand, but it doesn't need more than 26 GB MAX, and it runs in 25 min on a GCP VM.

There is one additional difference between the two situations I am comparing. On my VM I have installed the usher suite of tools, and am calling them from the build directory (I think it is all C++). Whereas on Terra, of course, I pulling in the latest docker container of the same codebase. So what I have not tested yet is whether using Cromwell, and the WDL code for this task/workflow, and the docker container would use more RAM running on the VM, but that was what I was thinking of testing when I decided to reach to Terra Support directly (instead).

Thanks,

--Marc

0
Jason Cerrato
- October 18, 2021 20:28
Hi Marc,

Ah that's interesting! Would you be willing to do a test with the docker + an external VM and letting us know if you still see that discrepancy?

Kind regards,

Jason

0
Marc Perry
- October 20, 2021 20:27
Hi Jason,

Yes, I ran it both as a WDL workflow using Cromwell, and simply in the Docker container, and they both used over 600 GB of RAM. However I think this has now drawn my attention to the fact that the latest GitHub release has not been incorporated into the docker container, and I am hopeful that once it is updated then this perplexing discrepancy will disappear (maybe). I will let you know.

Thanks for all your help with this.

-- Marc

0
Jason Cerrato
- October 20, 2021 20:53
Hi Marc,

Absolutely my pleasure. Do let us know how it goes and if we can help with troubleshooting anything else.

Kind regards,

Jason

0

Please sign in to leave a comment.