Papi Error Code 13

April 19, 2020 21:49
39 comments

Hi,

Looking for some direction as to how to address the error below. Thanks for the help!

The job was stopped before the command finished. PAPI error code 13. Execution failed: generic::internal: action 14: waiting for container: container is still running, possibly due to low system resources

Comments

39 comments

Jason Cerrato
- April 21, 2020 13:10
Hi Daniel Boiarsky,

We would be happy to take a closer look at this. If your workspace isn't in an authorization domain, can you share it with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace (see the icon with the three dots at the top-right)?

1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter
2. Click Save

Let us know the workspace name, as well as the relevant submission and workflow IDs.

Many thanks,

Jason

0
Daniel Boiarsky
- April 24, 2020 14:38
Hi,

Thanks for getting back to me.

The workspace is https://firecloud.terra.bio/#workspaces/vanallen-firecloud-dfci/CR_NV_WES_IO_Meta

Submissions ID: 24fc71d4-57af-4a05-810f-e61b900a9a75

Workflow ID: 27615c89-b1ec-41e7-81e6-01a0c03530a4

A second example:

Workflow ID: 4b0ce5a8-9b12-4b43-a5ca-75a8fdcc5e4a

Submission ID: 32a16812-712e-4870-a47d-fdf79a8661eb

Thanks for your help!

0
Jason Cerrato
- April 24, 2020 16:01
Hi Daniel Boiarsky,

Would you possibly be able to share the workflow with jcerrato@broadinstitute.org so I can take a closer look at it?

You can share it by clicking the link to "Source," and adding me to the list of users with access.

Kind regards,

Jason

0
Daniel Boiarsky
- April 24, 2020 16:03
just shared with you. thanks!

0
Jason Cerrato
- April 27, 2020 17:23
Hi Daniel Boiarsky,

Would you be willing to add to your WDL to tidy your monitoring script to ensure it's not the cause of the error? Something like:
```
${monitoringScript} > monitoring.log &
MONITORING_PID=$! # <= record the PID of the monitoring script
[...]
kill $MONITORING_PID
```
Kind regards,

Jason
0
Daniel Boiarsky
- April 28, 2020 13:54
That worked! Thank you very much for the help!

0
Jason Cerrato
- April 28, 2020 14:03
Glad to hear! If we can be of any further assistance, please let us know!

0
Daniel Boiarsky
- April 28, 2020 14:37
Actually, that didn't solve the problem. See below:

https://job-manager.dsde-prod.broadinstitute.org/jobs/9d3ddde9-8be0-4660-9474-5cd2dd02e312

0
Jason Cerrato
- April 28, 2020 14:52
Hi Daniel,

Thanks for letting us know—we'll take another look.

Jason

0
Jason Cerrato
- April 28, 2020 17:56
Hi Daniel Boiarsky,

Would you be willing to share the monitoring script with us so we can do testing on our side to see if this is playing a role in the failures? You should be able to drop files onto the text box to attach them, or if you don't want them publicly shared you can email it to jcerrato@broadinstitute.org.

If you are able, would you also be able to do a run of the workflow without the monitoring script at all on a sample that previously failed to see if you get a success?

Kind regards,

Jason

0
Daniel Boiarsky
- April 29, 2020 16:35
I ran the workflow without the monitoring script and still received the error: https://job-manager.dsde-prod.broadinstitute.org/jobs/96773e8f-c668-495e-8192-8e4be5304458

0
Jason Cerrato
- April 29, 2020 17:30
Hi Daniel,

Thank you for trying that—that helps. We've started investigating this issue in coordination with Google and they've suggested that using more memory might resolve the issue. They are currently investigating to see if there is anything else on their end that could be causing the issue. If progress on this is pressing, it may be worth trying with more memory to see if you are able to get consistent sucesses. Otherwise, I'll inform you on what updates we hear from them.

Kind regards,

Jason

0
Daniel Boiarsky
- April 29, 2020 18:04
Hi Jason,

Thanks for looking into this issue in coordination with google. Before reaching out to you and your team I had tried to resolve the issue by significantly increasing the memory but that didn't work. Please do let me know if google is able to resolve the issue.

Thanks!

0
Jake Conway
- May 10, 2020 22:57
Hey Jason,

I work in the same lab as Daniel, and have been running into the same issue independently. I also use a different billing project than Daniel if that helps. I've been getting the error: The job was stopped before the command finished. PAPI error code 13. Execution failed: generic::internal: action 14: waiting for container: container is still running, possibly due to low system resources for multiple methods across multiple workspaces, so I don't think it's a workflow specific issue. These workflows have run fine in the past.

Any guidance to get around the issue would be appreciated!

0
Jason Cerrato
- May 11, 2020 14:11
Hi Jake,

Thanks for those details. I'll pass these along to the Google support representative as well.

Daniel Boiarsky do you know by how much you increased the memory, so I can also forward this information? What amount used to work and what amount more did you try?

Kind regards,

Jason

0
Jake Conway
- May 12, 2020 21:50
Hey Jason,

I tried increasing memory too. For a task that I've run thousands of times with 4GB of memory, I tried up to 30GB. I also increased disk space to way beyond what is required, and still get the same error. I tried running these workflows under different workspaces and billing projects as well. No luck.

One weird thing is that right after I posted this about 1/3 of my pairs successfully ran, but the rest have consistently returned this error multiple times. Not sure what would have caused that small time window of success. I noticed that the VM is localizing everything normally until the last file. The last file (bam index file in the case of one particular workflow, ~2Mb in size) takes about an hour or more to localize, and then once the docker is called the task fails and produces this error.

- Jake

0
Jason Cerrato
- May 13, 2020 13:43
Hi Jake,

Those details are both interesting and, I'm sure, helpful for the Google engineers investigating. Can you provide the submission and workflow IDs for that job?

Kind regards,

Jason

0
Kyle Vernest
- May 13, 2020 14:03
Hi Jake,

It's Kyle the Product Manager from the Cromwell team. Can you email me (kvernest@broadinstitute.org) with the workflow IDs of the run with limited memory that failed, and then the second workflow ID with the 30 GB of memory, and I'll create a ticket with Google to investigate further.

Thanks,

Kyle

0
Jake Conway
- May 13, 2020 17:11
Hey Jason and Kyle,

Here is the submission ID for the lower memory (4GB) run: 7fa40bdf-eb43-47c8-9ed2-349f3c29c3ea

One of the workflow IDs with this error is: de0c3d4b-b93d-4bb0-b861-482d5d4ba67b

Submission ID for intermediate memory (12 GB): 8f124bc8-5773-403e-9261-447f676bab05

example of workflow with this error at 12 GB of memory: 9d2f6a24-9e83-4f1b-847a-ac1c170f2f69

Submission ID for 30 GB of memory: d7917a11-832a-4226-aa70-7ac6af21a8e3

example of workflow with this error at 30 GB of memory: d174bbde-b53b-4d08-8d9e-418aef4833a0

Thanks for looking into this!

0
Jason Cerrato
- May 13, 2020 19:19
Hi all,

Let us know if resolving the maxed quota and/or using non-preemptibles resolves the issue for you.

Many thanks,

Jason

0
Jason Cerrato
- May 14, 2020 15:08
Hi Jake Conway,

You previously mentioned running a workflow with 30 GB that failed when it used to succeed with 4 GB. That job was this one, correct?

Submission ID for 30 GB of memory: d7917a11-832a-4226-aa70-7ac6af21a8e3

example of workflow with this error at 30 GB of memory: d174bbde-b53b-4d08-8d9e-418aef4833a0

Looking at the metadata for this job, it looks like it was run in vanallen-firecloud-dfci, which we communicated about having hit a quota. Would you be able to try this test in vanallen-melanoma-wgs to see if you still get a failure with 4 GB or 30 GB?

Kind regards,

Jason

0
Jake Conway
- May 14, 2020 15:31
Hey Jason,

Here is an a submission ID under vanallen-melanoma-wgs at 30GB: 5d2b82fa-fb86-4998-bbdf-56796f45b4d2

An example workflow with error: 5f001429-008d-4734-bab2-4ad4568ff28a

Submission ID under vanallen-melanoma-wgs at 4GB: ba9a40a2-d2b5-4a3c-8766-15205d13fcc3

An example workflow with error: 1a73b01f-f271-416e-9a3f-c518b02a9013

Best,

Jake

0
Jake Conway
- May 14, 2020 15:53
Just to make sure the error is still ongoing, here is a submission and workflow ID from just a few minutes ago.

Submission ID: 9a484149-393b-4204-b9a7-51ff73672b06

Workflow ID: 08cce805-c41c-40de-aa10-31f85575d14e

Best,

Jake

0
Jason Cerrato
- May 14, 2020 15:53
Hi Jake,

That's perfect, thank you. Would you also happen to have a couple example workflow IDs where this job succeeded in the past using 4GB? I think it will help us have the complete picture of this job and better allow us to communicate the clear difference in outcome in our discussions with Google.

Many thanks,

Jason

0
Jake Conway
- May 14, 2020 16:41
Hey Jason,

I couldn't find any examples of using with 4GB. I might have deleted the workspaces where I used 4GB. I do have examples using intermediate memory (12GB and 20GB).

Submission ID: 96eee71d-66b3-4301-a3b4-56ba68581507 and example workflow ID: d4b1f9e7-f489-486a-8fc1-341d891beb4e

Submission ID: f5419fdb-b697-47e4-a714-4affed30468e and example workflow ID: 7d45bed1-aaa7-4b3e-9b54-c31ef5718454

Best,

Jake

0
Jason Cerrato
- May 14, 2020 17:41
Hi Jake,

Thanks for those. I'll pass these details on to Google for further investigation.

Kind regards,

Jason

0
Jason Cerrato
- May 15, 2020 17:14
Hi Jake Conway and Daniel Boiarsky,

Google has informed us that they've rolled back the change that they believe to be the root of the issue and will continue investigating. Would either of you be able to run your jobs again and confirm whether they are successful (and not subject to quota issues)?

Many thanks,

Jason

0
Jake Conway
- May 15, 2020 17:33
Hey Jason Cerrato,

Running now. Will report back within the hour.

Thanks,

Jake

0
Jake Conway
- May 15, 2020 18:34
Hey Jason,

Now I'm getting: The job was stopped before the command finished. PAPI error code 2. Execution failed: generic::unknown: action 23: preparing standard output: creating logs root: mkdir /var/lib/pipelines/google/logs/action/23: no space left on device.

I even tried bumping the disk space up to 500GB. There's now way it should need this much, since it's just 2 whole-exome BAMs.

Best,

Jake

0
Jason Cerrato
- May 15, 2020 18:36
Hi Jake,

We'll be happy to take a look. Can you provide your submission and workflow IDs for further investigation?

Many thanks,

Jason

0

Please sign in to leave a comment.