Ongoing workflow failures with error “key not found”
Starting on October 15, workflows in Terra that use images from Docker Hub began experiencing sporadic failures due to ongoing Docker Hub instability.
Affected workflows fail with an error message referencing “key not found” or similar. Because the failures are sporadic, a workflow may run successfully the first time and then fail with the error the next time. The Terra team is unable to do anything to improve the situation with Docker Hub. We cannot give an estimate as to when normal operations will resume.
However, we can suggest a workaround: use images from other repositories in your workflows. User reports are unanimous that after removing Docker Hub images, workflows proceed as expected. Please note that your "Proxy Group" (listed under Profile) will need to have read access on the Google bucket where the GCR image is hosted. Here is an article that outlines the steps on pushing a Docker Image to GCR. It just so happens that pulling images from GCR is faster and cheaper!
Note that if your workflow contains multiple tasks and/or subworkflows, there will be more than one place to update. Even very generic, commonly used images like docker: “python:2.7” are coming from Docker Hub and will need to be replaced.
Finally, we can confirm that in cases where a Task looks to continue running though the Workflow presents as failed, no charge will be incurred.
Updates directly from DockerHub can be found here: status.docker.com
Comments
8 comments
Hello All,
The 10/15 Docker Hub outage had an impact on Cromwell that extended into 10/16 and 10/17. This impact has now been resolved.
Please confirm if you are able to successfully run Workflows.
Working fine for me! Thanks!
Hi James - thank you for the feedback, and apologies for the inconvenience.
Hi, still getting errors (see below)
done : true
error : code : 2
details : Array[0] [] message : "Execution failed: pulling image: docker pull: running ["docker" "pull" "cellranger:3.0.2"]: exit status 1 (standard error: "Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\n")"
metadata : @type : "type.googleapis.com/google.genomics.v2alpha1.Metadata" createTime : "2019-10-18T14:17:15.849863Z" endTime : "2019-10-18T14:19:04.808435277Z"
events : Array[6] [{"description":"Worker released","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.WorkerReleasedEvent","instance":"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164","zone":"us-west1-b"},"timestamp":"2019-10-18T14:19:04.808435277Z"},{"description":"Execution failed: pulling image: docker pull: running [\"docker\" \"pull\" \"cellranger:3.0.2\"]: exit status 1 (standard error: \"Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\\n\")","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.FailedEvent","cause":"Execution failed: pulling image: docker pull: running [\"docker\" \"pull\" \"cellranger:3.0.2\"]: exit status 1 (standard error: \"Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\\n\")","code":"UNKNOWN"},"timestamp":"2019-10-18T14:19:03.532399522Z"},{"description":"Started pulling \"cellranger:3.0.2\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent","imageUri":"cellranger:3.0.2"},"timestamp":"2019-10-18T14:18:25.017346022Z"},{"description":"Stopped pulling \"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent","imageUri":"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim"},"timestamp":"2019-10-18T14:18:23.226581956Z"},{"description":"Started pulling \"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent","imageUri":"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim"},"timestamp":"2019-10-18T14:18:02.126281492Z"},{"description":"Worker \"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164\" assigned in \"us-west1-b\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.WorkerAssignedEvent","instance":"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164","machineType":"custom-1-2048","zone":"us-west1-b"},"timestamp":"2019-10-18T14:17:16.761218289Z"}]
labels : Object {"cromwell-workflow-id":"cromwell-0f658af3-51d4-4ada-a5ee-b0a51bd8c02a","terra-submission-id":"terra-16957d9a-9f58-4c2d-aed1-175cd42bc496","wdl-task-name":"generate-bcl-csv"}
Try fixing your docker parameter in the WDL:
docker: "cumulusprod/cellranger:3.0.2"
Hi Adam - thanks for your post.
While it definitely looks like something is going wrong with your workflow, the "key not found" error is not present so I would suggest that you make a new thread.
I think I may be having a problem related to this issue. On 10/16 and 10/17, I started three jobs (each running a workflow on each of 5 inputs) and got a "key not found" error and Done/Failed status for each. I only realized later that I've been incurring high computing costs since then (with no other analyses running), and tracked it to 15 Compute Engine VMs that appear to have been running since I started those jobs. I can see these in Google's console, but evidently I don't have permission to delete them (though I own the workspace that created them and the job history shows each process as "Failed").
Viewing the Job Manager for each job, I see that some steps in each job do appear to still be running -- but I see no way to kill those from the interface. What would be the best way to abort these jobs/VMs? Thanks in advance for any advice.
Hi Charlie,
If you can believe this, I received the Zendesk email notification for your message at 12:35 PM on January 3, 2020.
I'm guessing you figured out the issue by now, but I wanted to apologize for the lack of response.
Adam
Please sign in to leave a comment.