Out of Memory Retry

Allie Hajian

Learn how to configure your workflow to immediately retry certain tasks if the only error is running out of memory.

Out of Memory Retry Overview

Sometimes, workflow submissions fail solely because the virtual machine (VM) doesn’t have the memory it needs to complete a task. You can fix this by having Terra increase the VM memory and retry the job using the Out of Memory Retry feature. Memory retry will multiply the runtime memory for specific tasks in the WDL by a fixed amount (retry factor) a fixed number of times (max_retries).

To enable this feature, do two things: 1. modify the WDL and 2. toggle the feature in the workflow configuration (setup) form.

Example

Say your workflow is initially set to use 2GB of runtime memory and the workflow fails because it uses more than 2GB of memory. You don’t want to manually rerun the workflow again and again, increasing the runtime memory each time. Instead, you decide to have Terra increase memory by a fixed multiplier every time it fails up to a fixed number of retries.

You specify a retry factor of 2 for the workflow, with maximum retries of four. Terra will try to run the WDL a total of four times, with the following memory sizes:

1st run
(2GB)
2nd run
(4GB)
3rd run
(8GB)
4th run
(16GB)

Triggering Out of Memory Retry

Cromwell is preconfigured with a set of “out of memory” indicator strings. If memory retry is enabled for a task, and the task fails, and one of these strings appears in your task’s stderr file, then Cromwell will retry the task with more memory (multiplying each time by the memory retry factor).

The current strings we look for in stderr are:
    OutOfMemory
    Killed

If you want to add more strings to this set, please contact support with your suggestion and reasoning. For example, if you would like to add a custom stderr printout “I RAN OUT OF MEMORY” to help catch failures from a specific task, we could add that to known patterns.

How to enable Out of Memory Retry

Change the WDL to make sure it is compatible with memory retry (step 1) and toggle the feature in the workflow configuration form (step 2).

Step 1: Modify the WDL

Below is a modified WDL with changes highlighted in bold font:

  1. Memory available to the task specified with environment variables MEM_SIZE and MEM_UNIT.
  2. Task runtime block includes the variable maxRetries.
task max_retries_task {
   command <<<
      # My tool command which might potentially run out of memory:
      java -jar mytool.jar -mem "${MEM_SIZE} $MEM_UNIT" [...]
  >>>

   {
       memory: "5GB"
       docker: "mytool:latest"
<       maxRetries: 3
   }
}

See Create, edit, and share a workflow for more information about modifying a WDL.

1.1. Add a maxRetries runtime attribute to the task(s) you want Terra to retry.
This is the number of times Terra will retry the task - multiplying the memory available for the task each time by the memory_retry_factor set in the workflow configuration form (step 2 below)

1.2. Make sure your commands/tools know how much memory is available.
Some commands/tools might be able to pick up and use the additional memory provided automatically without customization. Others - particularly if the tool is Java based - may need to be explicitly told how much memory they are allowed to use.

Example If you set your JVM to have a max 2GB of memory, increasing the workflow VM memory to 16GB will not help your workflow run better.

The solution is to tell the Java tool explicitly how much memory is available with Terra-specific environment variables MEM_SIZE and MEM_UNIT. These environment variables represent the memory Cromwell is allowed to use (the actual number and the unit of measurement, respectively). The values will increase automatically as the WDL is rerun with more memory (see explanation below).

How to control how much memory is available to the command

Terra system specifies bash environment variables that hold the numerical value (MEM_SIZE, e.g. 4) and unit (MEM_UNIT , e.g. GB) of available memory. When setting up the execution environment for the command in your WDL script, you can use these variables in the command statement instead of specifying a value directly. The value given to the command will automatically track the amount of memory that's actually available for the operation.

MEM_SIZE is an integer value representing how much memory will be available in Cromwell (e.g., 512). MEM_UNIT will indicate what scale the size is measured in (i.e., KB, MB, GB or TB). So a combined “$MEM_UNIT $MEM_SIZE translates to “512 GB” - meaning that 512 gigabytes of memory has been provisioned by Cromwell for the machine.

Caveat when using MEM_SZE and MEM_UNIT outside of TerraNote: MEM_SIZE and MEM_UNIT are Terra-specific environment variables, and may not work with other workflow engines or backends on other platforms (e.g., DNA Nexus runs their own WDL interpreter, and MEM_SIZE and MEM_UNIT. are not defined on their platform). If you plan to share your WDL with others who use a WDL execution engine other than Cromwell, the task can fail because the memory isn’t specified.

You may wish to deduce the free memory from within the task - by running a few “how much memory is available to me” commands before the main Java command - rather than relying on environment variables for this value.

Step 2: Adjust the workflow configuration form

mceclip0.png

 

2.1. Turn on memory retry (highlighted in the screenshot above).

2.2. Specify the retry factor when you submit your workflow.

The retry factor is multiplicative and compounding (cost warning)If you ask for a retry factor of two and the task retries 10 times, you will end up with 1024x the original memory request on your 10th retry (2^10). See the example above.

If you started with 2GB of memory for your task, the 10th retry would be run with 2TB each!

It’s best to be conservative when specifying memory retry factors and maxRetries values.
Although Terra will allow you to select values up to 10 for the retry factor, it is strongly recommended to stay in the “typical” range between 1.1 and 2.0. If you go outside this range, Terra will warn you to consider whether you need such a high retry factor.

Hints for calculating retry factors and max retries
Think about what would happen if all of your tasks had to use all of their retries. Make sure you would be willing to use that much compute resource in your project, because there is no manual confirmation!

To learn more about this option, see the Cromwell documentation at https://cromwell.readthedocs.io/en/develop/RuntimeAttributes/#maxretries.

Command syntax

For details and examples of the correct command syntax, see the Cromwell documentation for memory retries: https://cromwell.readthedocs.io/en/develop/cromwell_features/RetryWithMoreMemory/#retry-with-more-memory.

Was this article helpful?

Comments

2 comments

  • Comment author
    Yossi Farjoun
    • Edited

    thanks for this guide! I was unable to find the various options that could be present in the variable MEM_UNIT. For example, if the units are GB, would the value be: 

    - "g"

    - "GB"

    -"Gb"

    -1000000000

    ?

    I see that in your example you use ${MEM_UNIT} as input to java's -mem argument, from which I deduce that it's `g`, (is that supposed to be -Xmx?) but it would be comforting to see the actual list somewhere.

    Thanks!

    0
  • Comment author
    Allie Cliffe

    Yossi Farjoun - Thanks for the feedback! I updated the docs (after consulting with Geraldine) to hopefully address your questions. 

    0

Please sign in to leave a comment.