Papi Error Code 13
Hi,
Looking for some direction as to how to address the error below. Thanks for the help!
The job was stopped before the command finished. PAPI error code 13. Execution failed: generic::internal: action 14: waiting for container: container is still running, possibly due to low system resources
Comments
39 comments
Hi Daniel Boiarsky,
We would be happy to take a closer look at this. If your workspace isn't in an authorization domain, can you share it with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace (see the icon with the three dots at the top-right)?
1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter
2. Click Save
Let us know the workspace name, as well as the relevant submission and workflow IDs.
Many thanks,
Jason
Hi,
Hi Daniel Boiarsky,
Would you possibly be able to share the workflow with jcerrato@broadinstitute.org so I can take a closer look at it?
You can share it by clicking the link to "Source," and adding me to the list of users with access.
Kind regards,
Jason
just shared with you. thanks!
Hi Daniel Boiarsky,
Would you be willing to add to your WDL to tidy your monitoring script to ensure it's not the cause of the error? Something like:
Kind regards,
Jason
That worked! Thank you very much for the help!
Glad to hear! If we can be of any further assistance, please let us know!
Actually, that didn't solve the problem. See below:
https://job-manager.dsde-prod.broadinstitute.org/jobs/9d3ddde9-8be0-4660-9474-5cd2dd02e312
Hi Daniel,
Thanks for letting us know—we'll take another look.
Jason
Hi Daniel Boiarsky,
Would you be willing to share the monitoring script with us so we can do testing on our side to see if this is playing a role in the failures? You should be able to drop files onto the text box to attach them, or if you don't want them publicly shared you can email it to jcerrato@broadinstitute.org.
If you are able, would you also be able to do a run of the workflow without the monitoring script at all on a sample that previously failed to see if you get a success?
Kind regards,
Jason
I ran the workflow without the monitoring script and still received the error: https://job-manager.dsde-prod.broadinstitute.org/jobs/96773e8f-c668-495e-8192-8e4be5304458
Hi Daniel,
Thank you for trying that—that helps. We've started investigating this issue in coordination with Google and they've suggested that using more memory might resolve the issue. They are currently investigating to see if there is anything else on their end that could be causing the issue. If progress on this is pressing, it may be worth trying with more memory to see if you are able to get consistent sucesses. Otherwise, I'll inform you on what updates we hear from them.
Kind regards,
Jason
Hi Jason,
Thanks for looking into this issue in coordination with google. Before reaching out to you and your team I had tried to resolve the issue by significantly increasing the memory but that didn't work. Please do let me know if google is able to resolve the issue.
Thanks!
Hey Jason,
I work in the same lab as Daniel, and have been running into the same issue independently. I also use a different billing project than Daniel if that helps. I've been getting the error: The job was stopped before the command finished. PAPI error code 13. Execution failed: generic::internal: action 14: waiting for container: container is still running, possibly due to low system resources for multiple methods across multiple workspaces, so I don't think it's a workflow specific issue. These workflows have run fine in the past.
Any guidance to get around the issue would be appreciated!
Hi Jake,
Thanks for those details. I'll pass these along to the Google support representative as well.
Daniel Boiarsky do you know by how much you increased the memory, so I can also forward this information? What amount used to work and what amount more did you try?
Kind regards,
Jason
Hey Jason,
I tried increasing memory too. For a task that I've run thousands of times with 4GB of memory, I tried up to 30GB. I also increased disk space to way beyond what is required, and still get the same error. I tried running these workflows under different workspaces and billing projects as well. No luck.
One weird thing is that right after I posted this about 1/3 of my pairs successfully ran, but the rest have consistently returned this error multiple times. Not sure what would have caused that small time window of success. I noticed that the VM is localizing everything normally until the last file. The last file (bam index file in the case of one particular workflow, ~2Mb in size) takes about an hour or more to localize, and then once the docker is called the task fails and produces this error.
- Jake
Hi Jake,
Those details are both interesting and, I'm sure, helpful for the Google engineers investigating. Can you provide the submission and workflow IDs for that job?
Kind regards,
Jason
Hi Jake,
It's Kyle the Product Manager from the Cromwell team. Can you email me (kvernest@broadinstitute.org) with the workflow IDs of the run with limited memory that failed, and then the second workflow ID with the 30 GB of memory, and I'll create a ticket with Google to investigate further.
Thanks,
Kyle
Hey Jason and Kyle,
Here is the submission ID for the lower memory (4GB) run: 7fa40bdf-eb43-47c8-9ed2-349f3c29c3ea
One of the workflow IDs with this error is: de0c3d4b-b93d-4bb0-b861-482d5d4ba67b
Submission ID for intermediate memory (12 GB): 8f124bc8-5773-403e-9261-447f676bab05
example of workflow with this error at 12 GB of memory: 9d2f6a24-9e83-4f1b-847a-ac1c170f2f69
Submission ID for 30 GB of memory: d7917a11-832a-4226-aa70-7ac6af21a8e3
example of workflow with this error at 30 GB of memory: d174bbde-b53b-4d08-8d9e-418aef4833a0
Thanks for looking into this!
Hi all,
Let us know if resolving the maxed quota and/or using non-preemptibles resolves the issue for you.
Many thanks,
Jason
Hi Jake Conway,
You previously mentioned running a workflow with 30 GB that failed when it used to succeed with 4 GB. That job was this one, correct?
Submission ID for 30 GB of memory: d7917a11-832a-4226-aa70-7ac6af21a8e3
example of workflow with this error at 30 GB of memory: d174bbde-b53b-4d08-8d9e-418aef4833a0
Looking at the metadata for this job, it looks like it was run in vanallen-firecloud-dfci, which we communicated about having hit a quota. Would you be able to try this test in vanallen-melanoma-wgs to see if you still get a failure with 4 GB or 30 GB?
Kind regards,
Jason
Hey Jason,
Here is an a submission ID under vanallen-melanoma-wgs at 30GB: 5d2b82fa-fb86-4998-bbdf-56796f45b4d2
An example workflow with error: 5f001429-008d-4734-bab2-4ad4568ff28a
Submission ID under vanallen-melanoma-wgs at 4GB: ba9a40a2-d2b5-4a3c-8766-15205d13fcc3
An example workflow with error: 1a73b01f-f271-416e-9a3f-c518b02a9013
Best,
Jake
Just to make sure the error is still ongoing, here is a submission and workflow ID from just a few minutes ago.
Submission ID: 9a484149-393b-4204-b9a7-51ff73672b06
Workflow ID: 08cce805-c41c-40de-aa10-31f85575d14e
Best,
Jake
Hi Jake,
That's perfect, thank you. Would you also happen to have a couple example workflow IDs where this job succeeded in the past using 4GB? I think it will help us have the complete picture of this job and better allow us to communicate the clear difference in outcome in our discussions with Google.
Many thanks,
Jason
Hey Jason,
I couldn't find any examples of using with 4GB. I might have deleted the workspaces where I used 4GB. I do have examples using intermediate memory (12GB and 20GB).
Submission ID: 96eee71d-66b3-4301-a3b4-56ba68581507 and example workflow ID: d4b1f9e7-f489-486a-8fc1-341d891beb4e
Submission ID: f5419fdb-b697-47e4-a714-4affed30468e and example workflow ID: 7d45bed1-aaa7-4b3e-9b54-c31ef5718454
Best,
Jake
Hi Jake,
Thanks for those. I'll pass these details on to Google for further investigation.
Kind regards,
Jason
Hi Jake Conway and Daniel Boiarsky,
Google has informed us that they've rolled back the change that they believe to be the root of the issue and will continue investigating. Would either of you be able to run your jobs again and confirm whether they are successful (and not subject to quota issues)?
Many thanks,
Jason
Hey Jason Cerrato,
Running now. Will report back within the hour.
Thanks,
Jake
Hey Jason,
Now I'm getting: The job was stopped before the command finished. PAPI error code 2. Execution failed: generic::unknown: action 23: preparing standard output: creating logs root: mkdir /var/lib/pipelines/google/logs/action/23: no space left on device.
I even tried bumping the disk space up to 500GB. There's now way it should need this much, since it's just 2 whole-exome BAMs.
Best,
Jake
Hi Jake,
We'll be happy to take a look. Can you provide your submission and workflow IDs for further investigation?
Many thanks,
Jason
Please sign in to leave a comment.