Out of Memory Retry

Allie Hajian
  • Updated

Learn how to configure your workflow to immediately retry certain tasks if the only error was to run out of memory.

Contents

- How to enable memory retry
- Make sure your task is "more-memory" aware (compatible)
- How memory retry works (technical details and how to modify)
- Additional reading

How to enable memory retry

1. Turn on memory retry and specify the retry factor when you submit your workflow from the Terra UI 
Memory-retry_toggle_Screen_shot.png

2. Add a maxRetries runtime attribute to the task(s) you would like to retry.

task max_retries_task {
   command <<<
      # My tool command which might potentially run out of memory:
      java -jar mytool.jar -mem "${MEM_SIZE} $MEM_UNIT" [...]
  >>>

   runtime {
       memory: "5GB"
       docker: "mytool:latest"
       maxRetries: 3
   }
}

To learn more about this option, see the Cromwell documentation here: https://cromwell.readthedocs.io/en/develop/RuntimeAttributes/#maxretries

G0_warning-icon.png


The retry factor is multiplicative and compounding (cost warning)

  If you ask for a retry factor of two and the task retries 10 times, you will end up with 1024x the original memory request (210) on your 10th retry.
  • If you started with 2GB of memory for your task, the 10th retry would be run with 2TB each!

It’s best to be conservative when specifying memory retry factors and maxRetries values. 

Although Terra will allow you to select values up to 10 for the retry factor, it is strongly recommended to stay in the “typical” range between 1.1 and 2.0. If you go outside this range, Terra will warn you to consider whether you need such a high retry factor.

How to calculate retry factors and max retries
Think about what would happen if all of your tasks had to use all of their retries - and make sure that you would be willing to use that much compute resource in your project, because there is no manual confirmation!


Make sure your task is “more-memory aware” (compatible)

To prevent spending money on retries that don’t increase the memory, you might also need to modify your command’s operation based on how much memory it currently has. 

Example
If you set your JVM to have a max 2GB of memory, running on a 16GB-of-memory machine will not help your task run better.

Some commands might be able to pick up and use the additional memory provided automatically. Others - particularly if the tool is Java based - may need to be told how much memory they are allowed to use before being started. 

How to control the command’s operation
Use the bash environment variables MEM_UNIT and MEM_SIZE to control the command’s operation.

Caveat when using Terra-specific environment variables
Be aware that using these non-standard environment variables may make your task incompatible with other workflow engines or backends in other platforms. If possible, you may wish to deduce the free memory from within the task rather than relying on being told the value via environment variables.

How memory retry works (technical details and how to modify)

Cromwell is preconfigured with a set of “out of memory” indicator strings. If memory retry is enabled for a task, and the task fails, and one of these strings appears in your task’s stderr file, then Cromwell will retry the task with more memory.

The current strings we look for in stderr are:

  • OutOfMemory
  • Killed

If you would like to add more strings to this set, please contact support with your suggestion and reasoning. For example, if you wanted to add a custom stderr printout “I RAN OUT OF MEMORY” to help catch failures from a specific task, we could add that to known patterns.

Additional Reading

The Cromwell documentation for memory retries is here: https://cromwell.readthedocs.io/en/develop/cromwell_features/RetryWithMoreMemory/#retry-with-more-memory.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.