Ongoing workflow failures with error “key not found”

Post author
Sushma Chaluvadi

Starting on October 15, workflows in Terra that use images from Docker Hub began experiencing sporadic failures due to ongoing Docker Hub instability.

Affected workflows fail with an error message referencing “key not found” or similar. Because the failures are sporadic, a workflow may run successfully the first time and then fail with the error the next time. The Terra team is unable to do anything to improve the situation with Docker Hub. We cannot give an estimate as to when normal operations will resume.

However, we can suggest a workaround: use images from other repositories in your workflows. User reports are unanimous that after removing Docker Hub images, workflows proceed as expected. Please note that your "Proxy Group" (listed under Profile) will need to have read access on the Google bucket where the GCR image is hosted. Here is an article that outlines the steps on pushing a Docker Image to GCR. It just so happens that pulling images from GCR is faster and cheaper! 

Note that if your workflow contains multiple tasks and/or subworkflows, there will be more than one place to update. Even very generic, commonly used images like docker: “python:2.7” are coming from Docker Hub and will need to be replaced.

Finally, we can confirm that in cases where a Task looks to continue running though the Workflow presents as failed, no charge will be incurred.

Updates directly from DockerHub can be found here: status.docker.com

Comments

8 comments

  • Comment author
    Sushma Chaluvadi

    Hello All,

     

    The 10/15 Docker Hub outage had an impact on Cromwell that extended into 10/16 and 10/17. This impact has now been resolved.

     

    Please confirm if you are able to successfully run Workflows.

    1
  • Comment author
    James Gatter

    Working fine for me! Thanks!

    0
  • Comment author
    Adam Nichols

    Hi James - thank you for the feedback, and apologies for the inconvenience.

    1
  • Comment author
    Adam Haber
    • Edited

    Hi, still getting errors (see below)
    done
     true

     

    error : code 2

     

    details Array[0] [] message "Execution failed: pulling image: docker pull: running ["docker" "pull" "cellranger:3.0.2"]: exit status 1 (standard error: "Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\n")"

     

    metadata : @type "type.googleapis.com/google.genomics.v2alpha1.Metadata" createTime "2019-10-18T14:17:15.849863Z" endTime "2019-10-18T14:19:04.808435277Z"

     

    events Array[6] [{"description":"Worker released","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.WorkerReleasedEvent","instance":"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164","zone":"us-west1-b"},"timestamp":"2019-10-18T14:19:04.808435277Z"},{"description":"Execution failed: pulling image: docker pull: running [\"docker\" \"pull\" \"cellranger:3.0.2\"]: exit status 1 (standard error: \"Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\\n\")","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.FailedEvent","cause":"Execution failed: pulling image: docker pull: running [\"docker\" \"pull\" \"cellranger:3.0.2\"]: exit status 1 (standard error: \"Error response from daemon: pull access denied for cellranger, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\\n\")","code":"UNKNOWN"},"timestamp":"2019-10-18T14:19:03.532399522Z"},{"description":"Started pulling \"cellranger:3.0.2\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent","imageUri":"cellranger:3.0.2"},"timestamp":"2019-10-18T14:18:25.017346022Z"},{"description":"Stopped pulling \"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent","imageUri":"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim"},"timestamp":"2019-10-18T14:18:23.226581956Z"},{"description":"Started pulling \"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent","imageUri":"gcr.io/google.com/cloudsdktool/cloud-sdk:264.0.0-slim"},"timestamp":"2019-10-18T14:18:02.126281492Z"},{"description":"Worker \"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164\" assigned in \"us-west1-b\"","details":{"@type":"type.googleapis.com/google.genomics.v2alpha1.WorkerAssignedEvent","instance":"google-pipelines-worker-239222145a5c2a5edd4aa7e6d1097164","machineType":"custom-1-2048","zone":"us-west1-b"},"timestamp":"2019-10-18T14:17:16.761218289Z"}]

     

    labels Object {"cromwell-workflow-id":"cromwell-0f658af3-51d4-4ada-a5ee-b0a51bd8c02a","terra-submission-id":"terra-16957d9a-9f58-4c2d-aed1-175cd42bc496","wdl-task-name":"generate-bcl-csv"}

    0
  • Comment author
    James Gatter

    Try fixing your docker parameter in the WDL:

    docker: "cumulusprod/cellranger:3.0.2"

    0
  • Comment author
    Adam Nichols

    Hi Adam - thanks for your post.

    While it definitely looks like something is going wrong with your workflow, the "key not found" error is not present so I would suggest that you make a new thread.

    0
  • Comment author
    Charlie Hatton

    I think I may be having a problem related to this issue. On 10/16 and 10/17, I started three jobs (each running a workflow on each of 5 inputs) and got a "key not found" error and Done/Failed status for each. I only realized later that I've been incurring high computing costs since then (with no other analyses running), and tracked it to 15 Compute Engine VMs that appear to have been running since I started those jobs. I can see these in Google's console, but evidently I don't have permission to delete them (though I own the workspace that created them and the job history shows each process as "Failed").

    Viewing the Job Manager for each job, I see that some steps in each job do appear to still be running -- but I see no way to kill those from the interface. What would be the best way to abort these jobs/VMs? Thanks in advance for any advice.

    0
  • Comment author
    Adam Nichols

    Hi Charlie,

    If you can believe this, I received the Zendesk email notification for your message at 12:35 PM on January 3, 2020.

    I'm guessing you figured out the issue by now, but I wanted to apologize for the lack of response.

    Adam

    0

Please sign in to leave a comment.