Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, DataSTAGE. Learn more.

Workflow error, key not found

Comments

42 comments

  • Avatar
    James Gatter

    I'm also getting this for a completely unrelated workflow. Docker went out yesterday and their website still seems somewhat unresponsive. That's just my suspicion anyway. Following this.

    2
    Comment actions Permalink
  • Avatar
    Justin Rhoades

    I was also wondering that but I believe the task where I get this error is pulling a docker image from GCR.  Wondering if it's still related somehow.

    1
    Comment actions Permalink
  • Avatar
    James Gatter

    Huh, interesting. Also I can still pull the docker image locally. This seems to be something that we'll have to wait on.

    0
    Comment actions Permalink
  • Avatar
    Samuel Freeman

    I'm also getting this error with a docker that is being pulled from docker hub. The docker was pulled and the jobs ran for a while but they all eventually failed with the same "key not found" error.

    3
    Comment actions Permalink
  • Avatar
    jtsuji

    I'm also getting this error from workflows that use docker images in dockerhub..

     
    2
    Comment actions Permalink
  • Avatar
    Samuel Freeman

    Additionally, even though the workflows are failing, the tasks are still running and I am unable to abort any of them because the workflow is listed as failed. Some of my tasks have completed all the way through delocalization, but the tasks are still listed as running. I'm unsure whether I am still being charged for compute on these tasks even though I can't stop them.

    2
    Comment actions Permalink
  • Avatar
    Adam Nichols

    Hi all – Cromwell developer here.

    We've never seen this error before either and are investigating in the direction of possible Docker Hub issues, especially in light of their major outage yesterday.

    2
    Comment actions Permalink
  • Avatar
    Gilad Evrony

    Any updates on this issue? Same thing is happening to me. Terra is not usable until this is fixed.

    2
    Comment actions Permalink
  • Avatar
    fleharty

    I'm also experiencing this issue, and my workflow does NOT use Docker Hub.  I'm using GCR, and quay.io.

     

    ** Turns out I was wrong, my workflow does reference Docker Hub in a sub workflow **

    1
    Comment actions Permalink
  • Avatar
    Adam Nichols

    If you submit new workflows, do they all still get stuck – or is the problem more along the lines of existing workflows never finishing?

    Appreciate any info you can provide along these lines.

    0
    Comment actions Permalink
  • Avatar
    Gilad Evrony

    As you can see, based on the fact that there are tasks that fail and tasks that work that both use the same docker, I suspect the issue might be something on the side of cromwell. Please help... we are totally stuck and unable to use Terra.

    3
    Comment actions Permalink
  • Avatar
    Adam Nichols

    We hear you! Unfortunately, it looks like any workflow that uses a Docker Hub image is experiencing sporadic failures due to Docker Hub unreliability. This is not under our control to fix, though we can suggest using more reliable image repositories like GCR or Quay.io in the future.

    We worked with Mark Fleharty above to determine that there is actually a Docker Hub image referenced deep down in his subworkflows, the same may be true for you.

    1
    Comment actions Permalink
  • Avatar
    Adam Nichols

    See status.docker.com

    0
    Comment actions Permalink
  • Avatar
    Samuel Freeman

    When I submitted these workflows, they were able to start running a task, but the workflow later failed with the "key not found" error. For all of my workflows, the task runs and eventually completes through delocalization, but the task seems to be stuck in the "running" state because the workflow has already failed.

    0
    Comment actions Permalink
  • Avatar
    Gilad Evrony

    Thanks, but I don't think this is a good solution. Almost every GATK workflow, including featured workspaces, use docker hub dockers. You all would need to completely rewrite most of your Terra and GATK workflows to move them to GCR or other hubs if this is an issue that persists.

    Also, I suspect something different than just sporadic docker hub issues. That doesn't make sense because per my pending comment above, there are tasks that use the exact same docker image that are consistently working fine, while other wdl tasks that use the same docker image consistently fail.

    1
    Comment actions Permalink
  • Avatar
    Adam Nichols

    Sporadic means, sometimes the same Docker image succeeds and sometimes it fails – based on whether the Docker Hub API returns quickly enough (which is random).

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    We are experiencing the same failures.

     

    The docker image used in the method is not from DockerHub but Google Cloud Container Registry (us.gcr.io/broad-gatk/gatk:4.1.2.0)

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    0
    Comment actions Permalink
  • Avatar
    Adam Nichols

    Thanks for the screenshot – while you're definitely using GCR for the GATK Docker, many workflows reference additional, hard-coded images that are not obvious from looking at the inputs.

    In the case of joint genotyping, the copy I have available to me also uses docker: "python:2.7" for one of its tasks, which would be coming from Docker Hub.

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    Thank you for pointing that out. I didn't realize that there is a docker hub image is used there. 

    I agree with GE it is not sustainable to use DockerHub, especially for workflows in the featured workspaces. 

     

    0
    Comment actions Permalink
  • Avatar
    Liudmila Elagina

    It shows that workflow is still running while clearly it is failed. No option to abort either. Are they acquiring cost?

     

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    Hello All,

    This is our Featured Post that contains a summary of the issue that is described here by Adam. Going forward, we will be updating the Featured article to localize all information.

    0
    Comment actions Permalink
  • Avatar
    Adam Nichols

    No additional cost, the tasks exit in the normal amount of time and do not incur any more expense than usual. The still-running status is simply an artifact of the workflow status failing to update.

    0
    Comment actions Permalink
  • Avatar
    Gilad Evrony

    I'm concerned this won't get resolved. Because the status.docker.com page indicates that as far as they are concerned, everything is back to normal. Are you all speaking to docker to make sure they know about this problem?

    I would also urge you to investigate more from the side of cromwell. I am 100% confident that the issue is not sporadic. It is systematic. Certain tasks always work perfectly fine. Other tasks are failing every time. I think it is highly unlikely the issue is on the side of docker, because how would docker hub know which task is pulling the docker image?

    0
    Comment actions Permalink
  • Avatar
    Gilad Evrony

    Hi all, I also discovered something important. My jobs that failed with this error are not just showing as if they are running -- they are ACTUALLY still running. I can clearly see this because they are continuing to write and update their output logs several hours after the jobs status has shown up as failed. I can send example, but again, I urge you to take a deeper look at what is going on and not just blame docker hub.

    0
    Comment actions Permalink
  • Avatar
    Gilad Evrony

    Furthermore, the fact that the workflows are actually still running despite this error and the job marked as Failed (and this is the case in 5 of the workflows that had this error - they are all still running) indicates that the docker was successfully pulled from docker hub.

    So the issue must be something in the communication between docker hub and Terra, and Terra falsely flagging an issue in pulling the docker images.

    0
    Comment actions Permalink
  • Avatar
    Hattie Chung

    I had a jupyter notebook runtime based on a custom Docker image which worked fine for the past few months. Today, when I tried to launch a new runtime instance for the notebook with this Docker image, I get errors for jupyter. However, I have no issues creating a new runtime instance using the default Docker. Is there an ongoing issue for pulling from Dockerhub? 

    + JUPYTER_NOTEBOOK_FRONTEND_CONFIG=notebook.json
    + docker cp /etc/notebook.json jupyter-server:/etc/jupyter/nbconfig/
    no such directory

    0
    Comment actions Permalink
  • Avatar
    Gilad Evrony

    Any update on this issue that has completely disabled Terra?

    0
    Comment actions Permalink
  • Avatar
    Adam Nichols

    Sorry, we do not have an ETA.

    Your options are to:

    1. Wait until the issue with Docker Hub images passes
    2. Use images from GCR, Quay, or another repo – users report a 100% success rate after migrating
    0
    Comment actions Permalink
  • Avatar
    Ruchi Munshi

    @GE @Samuel Freeman -- Would you mind sharing workspaces where you see the tasks are running? I'm trying to recreate what you see -- which may take longer than using your workspace as an example to extract information and posting an update here. 

     

    Edit: Or -- if you share which docker hub images you always see succeed and which ones you always see fail -- I can run a test workflow in a workspace and share the results with you to see if that's a reproducible observation.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk