Memory retry mechanism not retrying

Post author
Khalid Shakir

We ran this example task:

task TestPythonMemRetry {
    command <<<
        echo "MEM_SIZE=$MEM_SIZE" >&2
        echo "MEM_UNIT=$MEM_UNIT" >&2
        python3 -c 'print(len([0] * (2**34)))'
    >>>
    runtime {
        docker: "google/cloud-sdk:slim"
        memory: "1 GB"
        maxRetries: 1
    }
}

When "Retry with more memory" is selected in Terra, it doesn't retry.

Here's an example workflow shared with Terra support:

via:

  • workspace-id: d6d96bf4-7662-4cb2-85a6-fcbf92d692b5
  • submission-id: 8e9a98ad-1d9d-496e-90b9-061b0b33fc0f

Looking at the log files in the Job Manager linked "execution directory", it appears that the stderr no longer contains the "13 Killed" messages. However the Job Manager "backend log" still contains the message indicating the job should be retried with more memory.

Comments

3 comments

  • Comment author
    Anthony DiCi

    Hi Khalid,

    Thank you for writing in about this issue. I appreciate the details you've included so far. Can you share the workspace where you are seeing this issue with Terra Support by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

    1. Toggle the "Share with support" button to "Yes"
    2. Click Save

    Please provide us with

    1. A link to your workspace

    We’ll be happy to take a closer look as soon as we can!

    Kind regards,

    Anthony

    0
  • Comment author
    Khalid Shakir

    The workspace name for ID d6d96bf4-7662-4cb2-85a6-fcbf92d692b5 above is https://app.terra.bio/#workspaces/bican_um1/pipeline_comparison_testing

    0
  • Comment author
    Pamela Bretscher

    Hi Khalid,

    I apologize that we did not respond to this ticket sooner. Thank you for your patience and understanding. On an initial look at the workflow you mentioned, it does look like two of the tasks were retried with more memory, so the feature appears to be working as expected.

    However, you're correct that the other two tasks did not retry. To retry with more memory, Cromwell looks for the keyword "killed" in the stderr file which you pointed out was missing. I'm going to bring this to engineers to see if they can look into why that might have happened.

    Kind regards,
    Pamela

    0

Please sign in to leave a comment.