Failed run at container starting step
Hi,
I got a following type of error (not seen before) for a few shards in my pipeline in rjxmicrobiome workspace, method Micropilot_v2_mOTUS_v2.
Any ideas?
message: Task workflowMotus.motus:10:1 failed. The job was stopped before the command finished. PAPI error code 2. Execution failed: action 13: starting container: running container: running ["docker" "run" "-d" "--env-file" "/tmp/environment419495526" "--network" "pipelines" "-v" "/mnt/disks/local-disk:/cromwell_root:rslave" "-v" "/var/lib/pipelines/google:/google:ro,rslave" "google/cloud-sdk:slim" "/bin/sh" "-c" "python -c 'import base64; print(base64.b64decode(\"IyEvYmluL2Jhc2gKCmZvciBpIGluICQoc2VxIDMpOyBkbwogICgKICAgIHJtIC1mICRIT01FLy5jb25maWcvZ2Nsb3VkL2djZSAmJiBnc3V0aWwgIGNwIGdzOi8vZmMtc2VjdXJlLTgwMmJmODgwLTE2YjEtNGExMC1hZDg5LWRhOThmNzk5MTliOC8wMDIwMzM1MS1kZmY0LTQzYWUtOTY2Ny00MGVkNzA1NGU4ZDMvd29ya2Zsb3dNb3R1cy82NDRhMjI0My1iOGYwLTQxYTMtOTE5YS0xNTYxODczMzc1YzkvY2FsbC1xY1F1YWxpdHlIdW1hbi9zaGFyZC0xMC9hdHRlbXB0LTIvQkktMTYtMDI1Ny5hZGFwdGVyVHJpbW1lZC4xX2tuZWFkZGF0YV9wYWlyZWRfMS5mYXN0cS5neiAvY3JvbXdlbGxfcm9vdC9mYy1zZWN1cmUtODAyYmY4ODAtMTZiMS00YTEwLWFkODktZGE5OGY3OTkxOWI4LzAwMjAzMzUxLWRmZjQtNDNhZS05NjY3LTQwZWQ3MDU0ZThkMy93b3JrZmxvd01vdHVzLzY0NGEyMjQzLWI4ZjAtNDFhMy05MTlhLTE1NjE4NzMzNzVjOS9jYWxsLXFjUXVhbGl0eUh1bWFuL3NoYXJkLTEwL2F0dGVtcHQtMi9CSS0xNi0wMjU3LmFkYXB0ZXJUcmltbWVkLjFfa25lYWRkYXRhX3BhaXJlZF8xLmZhc3RxLmd6ID4gZ3N1dGlsX291dHB1dC50eHQgMj4mMQojIFJlY29yZCB0aGUgZXhpdCBjb2RlIG9mIHRoZSBnc3V0aWwgY29tbWFuZCB3aXRob3V0IHByb2plY3QgZmxhZwpSQ19HU1VUSUw9JD8KaWYgWyAiJFJDX0dTVVRJTCIgIT0gIjAiIF07IHRoZW4KICBwcmludGYgJyVzICVzXG4nICIkKGRhdGUgLXUgJyslWS8lbS8lZCAlSDolTTolUycpIiBybVwgLWZcIFwkSE9NRS8uY29uZmlnL2djbG91ZC9nY2VcIFwmXCZcIGdzdXRpbFwgXCBjcFwgZ3M6Ly9mYy1zZWN1cmUtODAyYmY4ODAtMTZiMS00YTEwLWFkODktZGE5OGY3OTkxOWI4LzAwMjAzMzUxLWRmZjQtNDNhZS05NjY3LTQwZWQ3MDU0ZThkMy93b3JrZmxvd01vdHVzLzY0NGEyMjQzLWI4ZjAtNDFhMy05MTlhLTE1NjE4NzMzNzVjOS9jYWxsLXFjUXVhbGl0eUh1bWFuL3NoYXJkLTEwL2F0dGVtcHQtMi9CSS0xNi0wMjU3LmFkYXB0ZXJUcmltbWVkLjFfa25lYWRkYXRhX3BhaXJlZF8xLmZhc3RxLmd6XCAvY3JvbXdlbGxfcm9vdC9mYy1zZWN1cmUtODAyYmY4ODAtMTZiMS00YTEwLWFkODktZGE5OGY3OTkxOWI4LzAwMjAzMzUxLWRmZjQtNDNhZS05NjY3LTQwZWQ3MDU0ZThkMy93b3JrZmxvd01vdHVzLzY0NGEyMjQzLWI4ZjAtNDFhMy05MTlhLTE1NjE4NzMzNzVjOS9jYWxsLXFjUXVhbGl0eUh1bWFuL3NoYXJkLTEwL2F0dGVtcHQtMi9CSS0xNi0wMjU3LmFkYXB0ZXJUcmltbWVkLjFfa25lYWRkYXRhX3BhaXJlZF8xLmZhc3RxLmd6XCBmYWlsZWQKICAjIFByaW50IHRoZSByZWFzb24gb2YgdGhlIGZhaWx1cmUKICBjYXQgZ3N1dGlsX291dHB1dC50eHQKCiAgIyBDaGVjayBpZiBpdCBtYXRjaGVzIHRoZSBCdWNrZXRJc1JlcXVlc3RlclBheXNFcnJvck1lc3NhZ2UKICBpZiBncmVwIC1xICJCdWNrZXQgaXMgcmVxdWVzdGVyIHBheXMgYnVja2V0IGJ1dCBubyB1c2VyIHByb2plY3QgcHJvdmlkZWQuIiBnc3V0aWxfb3V0cHV0LnR4dDsgdGhlbgogICAgcHJpbnRmICclcyAlc1xuJyAiJChkYXRlIC11ICcrJVkvJW0vJWQgJUg6JU06JVMnKSIgUmV0cnlpbmdcIHdpdGhcIHVzZXJcIHByb2plY3QKICAgIHJtIC1mICRIT01FLy5jb25maWcvZ2Nsb3VkL2djZSAmJiBnc3V0aWwgLXUgcmp4bWljcm9iaW9tZSBjcCBnczovL2ZjLXNlY3VyZS04MDJiZjg4MC0xNmIxLTRhMTAtYWQ4OS1kYTk4Zjc5OTE5YjgvMDAyMDMzNTEtZGZmNC00M2FlLTk2NjctNDBlZDcwNTRlOGQzL3dvcmtmbG93TW90dXMvNjQ0YTIyNDMtYjhmMC00MWEzLTkxOWEtMTU2MTg3MzM3NWM5L2NhbGwtcWNRdWFsaXR5SHVtYW4vc2hhcmQtMTAvYXR0ZW1wdC0yL0JJLTE2LTAyNTcuYWRhcHRlclRyaW1tZWQuMV9rbmVhZGRhdGFfcGFpcmVkXzEuZmFzdHEuZ3ogL2Nyb213ZWxsX3Jvb3QvZmMtc2VjdXJlLTgwMmJmODgwLTE2YjEtNGExMC1hZDg5LWRhOThmNzk5MTliOC8wMDIwMzM1MS1kZmY0LTQzYWUtOTY2Ny00MGVkNzA1NGU4ZDMvd29ya2Zsb3dNb3R1cy82NDRhMjI0My1iOGYwLTQxYTMtOTE5YS0xNTYxODczMzc1YzkvY2FsbC1xY1F1YWxpdHlIdW1hbi9zaGFyZC0xMC9hdHRlbXB0LTIvQkktMTYtMDI1Ny5hZGFwdGVyVHJpbW1lZC4xX2tuZWFkZGF0YV9wYWlyZWRfMS5mYXN0cS5negogIGVsc2UKICAgIGV4aXQgIiRSQ19HU1VUSUwiCiAgZmkKZWxzZQogIGV4aXQgMApmaQogICkKICBSQz0kPwogIGlmIFsgIiRSQyIgPSAiMCIgXTsgdGhlbgogICAgYnJlYWsKICBmaQogIGlmIFsgJGkgLWx0IDMgXTsgdGhlbgogICAgcHJpbnRmICclcyAlc1xuJyAiJChkYXRlIC11ICcrJVkvJW0vJWQgJUg6JU06JVMnKSIgV2FpdGluZ1wgNVwgc2Vjb25kc1wgYW5kXCByZXRyeWluZwogICAgc2xlZXAgNQogIGZpCmRvbmUKZXhpdCAiJFJDIg==\"));' > /tmp/186edd92-1537-4b7d-9210-1ee73c34502b.sh && chmod u+x /tmp/186edd92-1537-4b7d-9210-1ee73c34502b.sh && sh /tmp/186edd92-1537-4b7d-9210-1ee73c34502b.sh"]: exit status 125 (standard error: "docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"failed to write 7071 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/docker/f901fdd0393e9c46d91c97d5dab10713ff6ca3877c3f1c3417b88f1a88946e69/cgroup.procs: invalid argument\\\"\".\n")
Comments
5 comments
Hi Damian,
We will look into it and get back to you as soon as possible!
Hi Sushma,
Thank you. Another piece of info for this run. Two shards were running for way too long (10h vs 1h) and upon inspection I saw they didn't move beyond container initiation step - so 4 shards in total with weird behavior (two failed from above and two stuck). I aborted that run and after re-running everything is fine. Still interesting to know what glitch it was.
Damian
Great, thank you for the additional information! Would you also be able to confirm the date of this submission?
Damian,
The team has reported that PAPIv2 sometimes doesn’t correctly detect preemptions and so Cromwell, the execution engine, thinks a preempted task has failed when in fact it has not failed, rather it was preempted. The workaround we have for now is to add a `maxRetries` field to the `runtime` section while we speak with the Google team for a permanent fix to this issue.
Thanks.
Please sign in to leave a comment.