Papi Error Code 13

Post author
Daniel Boiarsky

Hi,

Looking for some direction as to how to address the error below. Thanks for the help!

The job was stopped before the command finished. PAPI error code 13. Execution failed: generic::internal: action 14: waiting for container: container is still running, possibly due to low system resources

Comments

39 comments

  • Comment author
    Jason Cerrato

    Hi Daniel Boiarsky,

    We would be happy to take a closer look at this. If your workspace isn't in an authorization domain, can you share it with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace (see the icon with the three dots at the top-right)?

    1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter
    2. Click Save

    Let us know the workspace name, as well as the relevant submission and workflow IDs.

    Many thanks,

    Jason

    0
  • Comment author
    Daniel Boiarsky

    Hi,

     
    Thanks for getting back to me. 
     
     
    Workflow ID: 27615c89-b1ec-41e7-81e6-01a0c03530a4
     
    A second example:
    Workflow ID: 4b0ce5a8-9b12-4b43-a5ca-75a8fdcc5e4a
    Submission ID: 32a16812-712e-4870-a47d-fdf79a8661eb
     
    Thanks for your help! 
    0
  • Comment author
    Jason Cerrato

    Hi Daniel Boiarsky,

    Would you possibly be able to share the workflow with jcerrato@broadinstitute.org so I can take a closer look at it?

     

    You can share it by clicking the link to "Source," and adding me to the list of users with access.

    Kind regards,

    Jason

    0
  • Comment author
    Daniel Boiarsky

    just shared with you. thanks!

    0
  • Comment author
    Jason Cerrato

    Hi Daniel Boiarsky,

    Would you be willing to add to your WDL to tidy your monitoring script to ensure it's not the cause of the error? Something like:

    ${monitoringScript} > monitoring.log &
    MONITORING_PID=$! # <= record the PID of the monitoring script
    [...]
    kill $MONITORING_PID

    Kind regards,

    Jason

    0
  • Comment author
    Daniel Boiarsky

    That worked! Thank you very much for the help!

    0
  • Comment author
    Jason Cerrato

    Glad to hear! If we can be of any further assistance, please let us know!

    0
  • Comment author
    Daniel Boiarsky

    Actually, that didn't solve the problem. See below:

     

    https://job-manager.dsde-prod.broadinstitute.org/jobs/9d3ddde9-8be0-4660-9474-5cd2dd02e312

    0
  • Comment author
    Jason Cerrato

    Hi Daniel,

    Thanks for letting us know—we'll take another look.

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Daniel Boiarsky,

    Would you be willing to share the monitoring script with us so we can do testing on our side to see if this is playing a role in the failures? You should be able to drop files onto the text box to attach them, or if you don't want them publicly shared you can email it to jcerrato@broadinstitute.org.

    If you are able, would you also be able to do a run of the workflow without the monitoring script at all on a sample that previously failed to see if you get a success?

    Kind regards,

    Jason

    0
  • Comment author
    Daniel Boiarsky

    I ran the workflow without the monitoring script and still received the error: https://job-manager.dsde-prod.broadinstitute.org/jobs/96773e8f-c668-495e-8192-8e4be5304458

    0
  • Comment author
    Jason Cerrato

    Hi Daniel,

    Thank you for trying that—that helps. We've started investigating this issue in coordination with Google and they've suggested that using more memory might resolve the issue. They are currently investigating to see if there is anything else on their end that could be causing the issue. If progress on this is pressing, it may be worth trying with more memory to see if you are able to get consistent sucesses. Otherwise, I'll inform you on what updates we hear from them.

    Kind regards,

    Jason

     

    0
  • Comment author
    Daniel Boiarsky

    Hi Jason,

    Thanks for looking into this issue in coordination with google. Before reaching out to you and your team I had tried to resolve the issue by significantly increasing the memory but that didn't work. Please do let me know if google is able to resolve the issue.

     

    Thanks!

    0
  • Comment author
    Jake Conway

    Hey Jason,

    I work in the same lab as Daniel, and have been running into the same issue independently. I also use a different billing project than Daniel if that helps. I've been getting the error: The job was stopped before the command finished. PAPI error code 13. Execution failed: generic::internal: action 14: waiting for container: container is still running, possibly due to low system resources for multiple methods across multiple workspaces, so I don't think it's a workflow specific issue. These workflows have run fine in the past.

    Any guidance to get around the issue would be appreciated!

    0
  • Comment author
    Jason Cerrato

    Hi Jake,

    Thanks for those details. I'll pass these along to the Google support representative as well.

    Daniel Boiarsky do you know by how much you increased the memory, so I can also forward this information? What amount used to work and what amount more did you try?

    Kind regards,

    Jason

    0
  • Comment author
    Jake Conway

    Hey Jason,

    I tried increasing memory too. For a task that I've run thousands of times with 4GB of memory, I tried up to 30GB. I also increased disk space to way beyond what is required, and still get the same error. I tried running these workflows under different workspaces and billing projects as well. No luck. 

    One weird thing is that right after I posted this about 1/3 of my pairs successfully ran, but the rest have consistently returned this error multiple times. Not sure what would have caused that small time window of success. I noticed that the VM is localizing everything normally until the last file. The last file (bam index file in the case of one particular workflow, ~2Mb in size) takes about an hour or more to localize, and then once the docker is called the task fails and produces this error.

    - Jake

    0
  • Comment author
    Jason Cerrato

    Hi Jake,

    Those details are both interesting and, I'm sure, helpful for the Google engineers investigating. Can you provide the submission and workflow IDs for that job?

    Kind regards,

    Jason

    0
  • Comment author
    Kyle Vernest

    Hi Jake,

    It's Kyle the Product Manager from the Cromwell team. Can you email me (kvernest@broadinstitute.org) with the workflow IDs of the run with limited memory that failed, and then the second workflow ID with the 30 GB of memory, and I'll create a ticket with Google to investigate further. 

    Thanks,

    Kyle

    0
  • Comment author
    Jake Conway

    Hey Jason and Kyle,

    Here is the submission ID for the lower memory (4GB) run: 7fa40bdf-eb43-47c8-9ed2-349f3c29c3ea

    One of the workflow IDs with this error is: de0c3d4b-b93d-4bb0-b861-482d5d4ba67b

    Submission ID for intermediate memory (12 GB): 8f124bc8-5773-403e-9261-447f676bab05

    example of workflow with this error at 12 GB of memory: 9d2f6a24-9e83-4f1b-847a-ac1c170f2f69

    Submission ID for 30 GB of memory: d7917a11-832a-4226-aa70-7ac6af21a8e3

    example of workflow with this error at 30 GB of memory: d174bbde-b53b-4d08-8d9e-418aef4833a0

    Thanks for looking into this!

    0
  • Comment author
    Jason Cerrato

    Hi all,

    Let us know if resolving the maxed quota and/or using non-preemptibles resolves the issue for you.

    Many thanks,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Jake Conway,

    You previously mentioned running a workflow with 30 GB that failed when it used to succeed with 4 GB. That job was this one, correct?

    Submission ID for 30 GB of memory: d7917a11-832a-4226-aa70-7ac6af21a8e3

    example of workflow with this error at 30 GB of memory: d174bbde-b53b-4d08-8d9e-418aef4833a0

    Looking at the metadata for this job, it looks like it was run in vanallen-firecloud-dfci, which we communicated about having hit a quota. Would you be able to try this test in vanallen-melanoma-wgs to see if you still get a failure with 4 GB or 30 GB?

    Kind regards,

    Jason

    0
  • Comment author
    Jake Conway

    Hey Jason, 

    Here is an a submission ID under vanallen-melanoma-wgs at 30GB: 5d2b82fa-fb86-4998-bbdf-56796f45b4d2

    An example workflow with error: 5f001429-008d-4734-bab2-4ad4568ff28a

    Submission ID under vanallen-melanoma-wgs at 4GB: ba9a40a2-d2b5-4a3c-8766-15205d13fcc3

    An example workflow with error: 1a73b01f-f271-416e-9a3f-c518b02a9013

    Best,

    Jake

    0
  • Comment author
    Jake Conway

    Just to make sure the error is still ongoing, here is a submission and workflow ID from just a few minutes ago.

    Submission ID: 9a484149-393b-4204-b9a7-51ff73672b06

    Workflow ID: 08cce805-c41c-40de-aa10-31f85575d14e

    Best,

    Jake

    0
  • Comment author
    Jason Cerrato

    Hi Jake,

    That's perfect, thank you. Would you also happen to have a couple example workflow IDs where this job succeeded in the past using 4GB? I think it will help us have the complete picture of this job and better allow us to communicate the clear difference in outcome in our discussions with Google.

    Many thanks,

    Jason

    0
  • Comment author
    Jake Conway

    Hey Jason,

    I couldn't find any examples of using with 4GB. I might have deleted the workspaces where I used 4GB. I do have examples using intermediate memory (12GB and 20GB). 

    Submission ID: 96eee71d-66b3-4301-a3b4-56ba68581507 and example workflow ID: d4b1f9e7-f489-486a-8fc1-341d891beb4e

    Submission ID: f5419fdb-b697-47e4-a714-4affed30468e and example workflow ID: 7d45bed1-aaa7-4b3e-9b54-c31ef5718454

    Best,

    Jake

    0
  • Comment author
    Jason Cerrato

    Hi Jake,

    Thanks for those. I'll pass these details on to Google for further investigation.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Jake Conway and Daniel Boiarsky,

    Google has informed us that they've rolled back the change that they believe to be the root of the issue and will continue investigating. Would either of you be able to run your jobs again and confirm whether they are successful (and not subject to quota issues)?

    Many thanks,

    Jason

    0
  • Comment author
    Jake Conway

    Hey Jason Cerrato,

    Running now. Will report back within the hour.

    Thanks,

    Jake

    0
  • Comment author
    Jake Conway

    Hey Jason,

    Now I'm getting: The job was stopped before the command finished. PAPI error code 2. Execution failed: generic::unknown: action 23: preparing standard output: creating logs root: mkdir /var/lib/pipelines/google/logs/action/23: no space left on device.

    I even tried bumping the disk space up to 500GB. There's now way it should need this much, since it's just 2 whole-exome BAMs.

    Best,

    Jake

    0
  • Comment author
    Jason Cerrato

    Hi Jake,

    We'll be happy to take a look. Can you provide your submission and workflow IDs for further investigation?

    Many thanks,

    Jason

    0

Please sign in to leave a comment.