Ongoing workflow failures with error “key not found”

Edited October 16, 2019 22:56
8 comments

Starting on October 15, workflows in Terra that use images from Docker Hub began experiencing sporadic failures due to ongoing Docker Hub instability.

Affected workflows fail with an error message referencing “key not found” or similar. Because the failures are sporadic, a workflow may run successfully the first time and then fail with the error the next time. The Terra team is unable to do anything to improve the situation with Docker Hub. We cannot give an estimate as to when normal operations will resume.

However, we can suggest a workaround: use images from other repositories in your workflows. User reports are unanimous that after removing Docker Hub images, workflows proceed as expected. Please note that your "Proxy Group" (listed under Profile) will need to have read access on the Google bucket where the GCR image is hosted. Here is an article that outlines the steps on pushing a Docker Image to GCR. It just so happens that pulling images from GCR is faster and cheaper!

Note that if your workflow contains multiple tasks and/or subworkflows, there will be more than one place to update. Even very generic, commonly used images like docker: “python:2.7” are coming from Docker Hub and will need to be replaced.

Finally, we can confirm that in cases where a Task looks to continue running though the Workflow presents as failed, no charge will be incurred.

Updates directly from DockerHub can be found here: status.docker.com

Comments

8 comments

Sushma Chaluvadi
- October 17, 2019 22:03
Hello All,

The 10/15 Docker Hub outage had an impact on Cromwell that extended into 10/16 and 10/17. This impact has now been resolved.

Please confirm if you are able to successfully run Workflows.

1
James Gatter
- October 17, 2019 23:24
Working fine for me! Thanks!

0
Adam Nichols
- October 18, 2019 14:16
Hi James - thank you for the feedback, and apologies for the inconvenience.

1
Adam Haber
- Edited October 18, 2019 16:38
Hi, still getting errors (see below)
done : true

error : code : 2

details : Array[0] [] message : "Execution failed: pulling image: docker pull: running ["docker" "pull" "cellranger:3.0.2"]: exit status 1 (standard error: "Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\n")"

metadata : @type : "type.googleapis.com/google.genomics.v2alpha1.Metadata" createTime : "2019-10-18T14:17:15.849863Z" endTime : "2019-10-18T14:19:04.808435277Z"

events : Array[6] [{"description":"Worker released","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.WorkerReleasedEvent","instance":"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164","zone":"us-west1-b"},"timestamp":"2019-10-18T14:19:04.808435277Z"},{"description":"Execution failed: pulling image: docker pull: running [\"docker\" \"pull\" \"cellranger:3.0.2\"]: exit status 1 (standard error: \"Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\\n\")","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.FailedEvent","cause":"Execution failed: pulling image: docker pull: running [\"docker\" \"pull\" \"cellranger:3.0.2\"]: exit status 1 (standard error: \"Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\\n\")","code":"UNKNOWN"},"timestamp":"2019-10-18T14:19:03.532399522Z"},{"description":"Started pulling \"cellranger:3.0.2\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent","imageUri":"cellranger:3.0.2"},"timestamp":"2019-10-18T14:18:25.017346022Z"},{"description":"Stopped pulling \"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent","imageUri":"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim"},"timestamp":"2019-10-18T14:18:23.226581956Z"},{"description":"Started pulling \"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent","imageUri":"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim"},"timestamp":"2019-10-18T14:18:02.126281492Z"},{"description":"Worker \"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164\" assigned in \"us-west1-b\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.WorkerAssignedEvent","instance":"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164","machineType":"custom-1-2048","zone":"us-west1-b"},"timestamp":"2019-10-18T14:17:16.761218289Z"}]

labels : Object {"cromwell-workflow-id":"cromwell-0f658af3-51d4-4ada-a5ee-b0a51bd8c02a","terra-submission-id":"terra-16957d9a-9f58-4c2d-aed1-175cd42bc496","wdl-task-name":"generate-bcl-csv"}

0
James Gatter
- October 18, 2019 16:43
Try fixing your docker parameter in the WDL:

docker: "cumulusprod/cellranger:3.0.2"

0
Adam Nichols
- October 18, 2019 18:18
Hi Adam - thanks for your post.

While it definitely looks like something is going wrong with your workflow, the "key not found" error is not present so I would suggest that you make a new thread.

0
Charlie Hatton
- October 21, 2019 17:47
I think I may be having a problem related to this issue. On 10/16 and 10/17, I started three jobs (each running a workflow on each of 5 inputs) and got a "key not found" error and Done/Failed status for each. I only realized later that I've been incurring high computing costs since then (with no other analyses running), and tracked it to 15 Compute Engine VMs that appear to have been running since I started those jobs. I can see these in Google's console, but evidently I don't have permission to delete them (though I own the workspace that created them and the job history shows each process as "Failed").

Viewing the Job Manager for each job, I see that some steps in each job do appear to still be running -- but I see no way to kill those from the interface. What would be the best way to abort these jobs/VMs? Thanks in advance for any advice.

0
Adam Nichols
- January 06, 2020 15:21
Hi Charlie,

If you can believe this, I received the Zendesk email notification for your message at 12:35 PM on January 3, 2020.

I'm guessing you figured out the issue by now, but I wanted to apologize for the lack of response.

Adam

0

Please sign in to leave a comment.