Memory retry mechanism not retrying

May 13, 2024 15:05
6 comments

We ran this example task:

task TestPythonMemRetry {
    command <<<
        echo "MEM_SIZE=$MEM_SIZE" >&2
        echo "MEM_UNIT=$MEM_UNIT" >&2
        python3 -c 'print(len([0] * (2**34)))'
    >>>
    runtime {
        docker: "google/cloud-sdk:slim"
        memory: "1 GB"
        maxRetries: 1
    }
}

When "Retry with more memory" is selected in Terra, it doesn't retry.

Here's an example workflow shared with Terra support:

https://job-manager.dsde-prod.broadinstitute.org/jobs/11abf754-7735-48d4-a0a2-46fb80315e1d

via:

workspace-id: d6d96bf4-7662-4cb2-85a6-fcbf92d692b5
submission-id: 8e9a98ad-1d9d-496e-90b9-061b0b33fc0f

Looking at the log files in the Job Manager linked "execution directory", it appears that the stderr no longer contains the "13 Killed" messages. However the Job Manager "backend log" still contains the message indicating the job should be retried with more memory.

Comments

6 comments

Anthony DiCi
- May 13, 2024 15:47
Hi Khalid,

Thank you for writing in about this issue. I appreciate the details you've included so far. Can you share the workspace where you are seeing this issue with Terra Support by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
1. Toggle the "Share with support" button to "Yes"
2. Click Save
Please provide us with
1. A link to your workspace
We’ll be happy to take a closer look as soon as we can!

Kind regards,

Anthony
0
Khalid Shakir
- May 13, 2024 15:55
The workspace name for ID d6d96bf4-7662-4cb2-85a6-fcbf92d692b5 above is https://app.terra.bio/#workspaces/bican_um1/pipeline_comparison_testing

0
Pamela Bretscher
- June 18, 2024 20:22
Hi Khalid,

I apologize that we did not respond to this ticket sooner. Thank you for your patience and understanding. On an initial look at the workflow you mentioned, it does look like two of the tasks were retried with more memory, so the feature appears to be working as expected.

However, you're correct that the other two tasks did not retry. To retry with more memory, Cromwell looks for the keyword "killed" in the stderr file which you pointed out was missing. I'm going to bring this to engineers to see if they can look into why that might have happened.

Kind regards,
Pamela

0
Khalid Shakir
- July 24, 2024 18:54
I'm still curious what the engineers are seeing when they try the Python code above.

Since last month, I've updated the more exhaustive WDL that you looked at with another example. It still shows that if the internal WDL command gracefully exits, first writing to its stderr that it is running out of memory and then exiting, then Cromwell is able to read the message from the program's stderr and retry.

Since Java/Scala/Groovy/Kotlin/etc. programs execute inside yet another layer of virtual machine, often with --Xmx arguments, they can detect memory issues and exit before the GCE VM kills them. But if the Linux OOM Killer terminates the job, Cromwell isn't reading the message from the "backend logs" and doesn't automatically retry the jobs.

As a manual workaround, the user can manually open the Job Manager UI, click on "backend log", see that the job ran out of memory via the "Killed" message in the logs, navigate back to Terra and re-run the whole workflow.

Python programs don't gracefully exit with out-of-memory errors. They are killed by the outer GCE Virtual Machine, and then the OOM Killer logs that it has killed the now-terminated Python program. Here's an example run of this updated workflow adding a Java job, plus the Python jobs with "Killed" in the "backend logs": https://job-manager.dsde-prod.broadinstitute.org/jobs/7a002993-dba2-4eb7-8162-152f5c9f3b6d

0
Josh Evans
- July 26, 2024 17:57
Hi Khalid,

Thanks for this great reply! I spoke with our engineers and as it turns out, you're correct that Cromwell wasn't designed to read the way Python exits in a way would active the Retry With More Memory feature. I've already submitted a Feature Request to have our team develop a feature to allow Cromwell to understand when python runs into these issues and allow the Retry With More Memory feature to work this kind of code.

We'll contact you if this feature is developed.

Please let me know if you have any questions.

Best,

Josh

0
Khalid Shakir
- July 27, 2024 16:36
Hi Josh,

Thanks for checking in with the team. This issue is not specific to Python. I'm sorry if I confused folks by providing examples using two different languages. I originally used Python because the syntax is less verbose.

I've updated the example Terra runs with a more verbose Java example called `TestJavaKilledRetry`. It also doesn't retry because the OOM isn't caught by the Java Virtual Machine. The JVM thinks the memory situation is fine, but the outer Google Virtual Machine terminates the program for going over the memory limit in the runtime attributes, leading to a "Killed" in the "backend logs" and an exit code of 137.
```
task TestJavaKilledRetry {
    command <<<
        echo "MEM_SIZE=$MEM_SIZE" >&2
        echo "MEM_UNIT=$MEM_UNIT" >&2
        cat > Mem.java << EOF
        class Mem {
          public static void main(String[] args) throws Exception {
            int gb = (int)Math.pow(2, 30);
            System.out.println("Allocating memory...");
            byte[][] byteArr = new byte[32][];
            for (int i = 0; i < byteArr.length; i++) {
              byteArr[i] = new byte[gb];
            }
            System.out.println("Sleeping a minute...");
            Thread.sleep(60_000);
            System.out.printf("Heap size: %,.2f%n", (double)Runtime.getRuntime().totalMemory() / gb);
            System.out.println(byteArr.hashCode());
          }
        }
        EOF
        java -Xms64g -Xmx64g Mem.java
    >>>
    runtime {
        docker: "eclipse-temurin:21"
        memory: "1 GB"
        maxRetries: 1
    }
}
```
See this example run here: https://job-manager.dsde-prod.broadinstitute.org/jobs/491641a1-5bed-4b9f-95b8-b1f14c6b54cd Even though an automatic Retry With More Memory has been requested and both jobs run out of memory according to the "backend logs" `TestJavaKilledRetry` does not get auto-retried while `TestJavaMemRetry` does.

Instead of describing the issue as Python vs. Java, perhaps it could be described as "if your command block goes over the memory limits from the runtime attributes, your job will not be automatically retried".

https://github.com/broadinstitute/cromwell/pull/7430 (a PR open in a chain of candidate PRs) is one option for a patch that has been auto-retrying Google and GridEngine jobs in our Cromwell instance which we use in addition to Terra.

Thanks,
-k
0

Please sign in to leave a comment.