Out of Memory Retry

Is your workflow running out of memory? Learn how to configure it to immediately retry certain tasks.

Out of Memory Retry Overview

Sometimes, workflow submissions fail solely because the virtual machine (VM) doesn’t have the memory it needs to complete a task. You can fix this by having Terra increase the VM's memory and retry the job using the Out of Memory Retry feature.

This feature is officially supported for processes that exit with a Java OOM message. This is most likely to happen with maximum heap size set several GB below the total memory of the machine. Tasks that exhibit different OOM behavior may not be reliably recognized as memory failures.

Memory retry will multiply the runtime memory for specific tasks in the workflow's Workflow Description Language (WDL) script by a fixed amount (retry factor) a fixed number of times (maxRetries).

To enable this feature, do two things:

1. modify the WDL and

2. toggle the feature in the workflow configuration (setup) form.

Example

Say your workflow is initially set to use 2GB of runtime memory and the workflow fails because it uses more than 2GB of memory. You don’t want to manually rerun the workflow again and again, increasing the runtime memory each time. Instead, you decide to have Terra increase memory by a fixed multiplier every time it fails, up to a fixed number of retries.

You specify a retry factor of 2 for the workflow tasks, with a maximum of four retries. With these settings, Terra will try to re-run the WDL up to four times (or until it succeeds), with the following memory sizes:

1st run
(2GB)

2nd run
(4GB)

3rd run
(8GB)

4th run
(16GB)

Indicator strings trigger Out of Memory Retry Cromwell is preconfigured with a set of “out of memory” indicator strings. If memory retry is enabled for a task, and the task fails, and one of these strings appears in your task’s stderr file, then Cromwell will retry the task with more memory (multiplying each time by the memory retry factor).

The current strings we look for in stderr are:
OutOfMemory
Killed

If you want to add more strings to this set, please contact support with your suggestion and reasoning. For example, if you would like to add a custom stderr printout “I RAN OUT OF MEMORY” to help catch failures from a specific task, we could add that to known patterns.

Step 1: Modify the WDL

The first step to enabling the out-of-memory retry feature is to modify your workflow's WDL script.

1.1. Specify the maximum number of retries in the task's runtime block with the variable maxRetries. This is the number of times Terra will retry the task - multiplying the memory available for the task each time by the memory_retry_factor set in the workflow configuration form (step 2 below).

1.2. Make sure your commands/tools know how much memory is available. Some commands/tools might be able to pick up and use the additional memory provided by the out-of-memory retry feature automatically, without customization. Others - particularly if the tool is Java based - may need to be explicitly told how much memory they are allowed to use. For example, if you set your JVM to have a max 2GB of memory, increasing the workflow VM memory to 16GB will not help your workflow run better.

The solution is to tell the Java tool explicitly how much memory is available by specifying two Terra-specific environment variables within a task's command section: MEM_SIZE and MEM_UNIT. These environment variables represent the memory Cromwell is allowed to use (the actual number and the unit of measurement, respectively). The values will increase automatically as the WDL is rerun with more memory.

Below is a modified script with changes highlighted in bold font:

task max_retries_task {
   command <<<
       # My tool command which might potentially run out of memory:
       java -jar mytool.jar -mem "${MEM_SIZE} $MEM_UNIT" [...]
   >>>

   {
       memory: "5GB"
       docker: "mytool:latest"
<       maxRetries: 3
   }
}

See Create, edit, and share a workflow for more information about modifying a WDL.

Terra's system specifies bash environment variables that hold the numerical value (MEM_SIZE, e.g. 4) and unit (MEM_UNIT , e.g. GB) of available memory. When setting up the execution environment for the command in your WDL script, you can use these variables in the command statement instead of specifying a value directly. The value given to the command will automatically track the amount of memory that's actually available for the operation.

MEM_SIZE is an integer value representing how much memory will be available in Cromwell (e.g., 512). MEM_UNIT will indicate what scale the size is measured in (i.e., KB, MB, GB or TB). So a combined “$MEM_UNIT $MEM_SIZE translates to “512 GB” - meaning that 512 gigabytes of memory has been provisioned by Cromwell for the machine.

Caveat when using MEM_SZE and MEM_UNIT outside of TerraNote: MEM_SIZE and MEM_UNIT are Terra-specific environment variables, and may not work with other workflow engines or backends on other platforms (e.g., DNA Nexus runs their own WDL interpreter, and MEM_SIZE and MEM_UNIT. are not defined on their platform). If you plan to share your WDL with others who use a WDL execution engine other than Cromwell, the task can fail because the memory isn’t specified.

You may wish to deduce the free memory from within the task - by running a few “how much memory is available to me” commands before the main Java command - rather than relying on environment variables for this value.

Step 2: Adjust the workflow configuration form

Screenshot of the workflow configuration page for an example workflow. An orange rectangle highlights the checked 'retry with more memory' box and the memory retry factor setting.

2.1. Check the Retry with more memory box (highlighted in the screenshot above).

2.2. Specify the memory factor when you submit your workflow.

The memory factor is multiplicative and compounding (cost warning)If you ask for a retry factor of two and the task retries 10 times, you will end up with 1024x the original memory request on your 10th retry (2^10). That means that if you started with 2GB of memory for your task, the 10th retry would be run with 2TB each! See the example above to see how the memory factor multiples across retries.

It’s best to be conservative when specifying memory factors and maxRetries values.
Although Terra will allow you to select a memory factor as large as 10, it is strongly recommended to stay in the “typical” range between 1.1 and 2.0. If you go outside this range, Terra will warn you to consider whether you need such a high memory factor.

Hints for calculating memory factors and max retries
Think about what would happen if all of your tasks had to use all of their retries. Make sure you would be willing to use that much compute resource in your project, because there is no manual confirmation!

To learn more about this option, see the Cromwell documentation on maxRetries.

Command syntax

For details and examples of the correct command syntax, see the Cromwell documentation for memory retries.

Comments

2 comments

Yossi Farjoun
- Edited January 31, 2022 18:51
thanks for this guide! I was unable to find the various options that could be present in the variable MEM_UNIT. For example, if the units are GB, would the value be:

- "g"

- "GB"

-"Gb"

-1000000000

?

I see that in your example you use ${MEM_UNIT} as input to java's -mem argument, from which I deduce that it's `g`, (is that supposed to be -Xmx?) but it would be comforting to see the actual list somewhere.

Thanks!

0
Allie Cliffe
- April 06, 2022 14:52
Yossi Farjoun - Thanks for the feedback! I updated the docs (after consulting with Geraldine) to hopefully address your questions.

0

Please sign in to leave a comment.