Workflow error, key not found
I'm trying to rerun a pair set through the Mutect2-GATK4 workflow I've run several times before and am getting a workflow error that states "key not found: DockerInfoRequest(DockerImageIdentifierWithoutHash(None,None,python,2.7),List(PipelinesApiDockerCredentials(...". I also see that the workflow shows as failed but the task in job manager shows it as still running. My questions are 1) can you check if these failed workflows still have tasks running or not and 2) what does this error message mean and how can I debug it? Thanks for your help! The workspace should already be shared.
Workspace: blood-biopsy/early_stage_BC_whole_genome_analysis
Submission ID: e689bc7c-acc6-4168-bcab-a23300b8888f or 8d529a76-8e62-48b2-a79d-a7f731af7f18
Best,
Justin
Comments
43 comments
I'm also getting this for a completely unrelated workflow. Docker went out yesterday and their website still seems somewhat unresponsive. That's just my suspicion anyway. Following this.
I was also wondering that but I believe the task where I get this error is pulling a docker image from GCR. Wondering if it's still related somehow.
Huh, interesting. Also I can still pull the docker image locally. This seems to be something that we'll have to wait on.
I'm also getting this error with a docker that is being pulled from docker hub. The docker was pulled and the jobs ran for a while but they all eventually failed with the same "key not found" error.
I'm also getting this error from workflows that use docker images in dockerhub..
Additionally, even though the workflows are failing, the tasks are still running and I am unable to abort any of them because the workflow is listed as failed. Some of my tasks have completed all the way through delocalization, but the tasks are still listed as running. I'm unsure whether I am still being charged for compute on these tasks even though I can't stop them.
Hi all – Cromwell developer here.
We've never seen this error before either and are investigating in the direction of possible Docker Hub issues, especially in light of their major outage yesterday.
Any updates on this issue? Same thing is happening to me. Terra is not usable until this is fixed.
I'm also experiencing this issue, and my workflow does NOT use Docker Hub. I'm using GCR, and quay.io.
** Turns out I was wrong, my workflow does reference Docker Hub in a sub workflow **
If you submit new workflows, do they all still get stuck – or is the problem more along the lines of existing workflows never finishing?
Appreciate any info you can provide along these lines.
As you can see, based on the fact that there are tasks that fail and tasks that work that both use the same docker, I suspect the issue might be something on the side of cromwell. Please help... we are totally stuck and unable to use Terra.
We hear you! Unfortunately, it looks like any workflow that uses a Docker Hub image is experiencing sporadic failures due to Docker Hub unreliability. This is not under our control to fix, though we can suggest using more reliable image repositories like GCR or Quay.io in the future.
We worked with Mark Fleharty above to determine that there is actually a Docker Hub image referenced deep down in his subworkflows, the same may be true for you.
See status.docker.com
When I submitted these workflows, they were able to start running a task, but the workflow later failed with the "key not found" error. For all of my workflows, the task runs and eventually completes through delocalization, but the task seems to be stuck in the "running" state because the workflow has already failed.
Thanks, but I don't think this is a good solution. Almost every GATK workflow, including featured workspaces, use docker hub dockers. You all would need to completely rewrite most of your Terra and GATK workflows to move them to GCR or other hubs if this is an issue that persists.
Also, I suspect something different than just sporadic docker hub issues. That doesn't make sense because per my pending comment above, there are tasks that use the exact same docker image that are consistently working fine, while other wdl tasks that use the same docker image consistently fail.
Sporadic means, sometimes the same Docker image succeeds and sometimes it fails – based on whether the Docker Hub API returns quickly enough (which is random).
We are experiencing the same failures.
The docker image used in the method is not from DockerHub but Google Cloud Container Registry (us.gcr.io/broad-gatk/gatk:4.1.2.0)
Thanks for the screenshot – while you're definitely using GCR for the GATK Docker, many workflows reference additional, hard-coded images that are not obvious from looking at the inputs.
In the case of joint genotyping, the copy I have available to me also uses docker: "python:2.7" for one of its tasks, which would be coming from Docker Hub.
Thank you for pointing that out. I didn't realize that there is a docker hub image is used there.
I agree with GE it is not sustainable to use DockerHub, especially for workflows in the featured workspaces.
It shows that workflow is still running while clearly it is failed. No option to abort either. Are they acquiring cost?
Hello All,
This is our Featured Post that contains a summary of the issue that is described here by Adam. Going forward, we will be updating the Featured article to localize all information.
No additional cost, the tasks exit in the normal amount of time and do not incur any more expense than usual. The still-running status is simply an artifact of the workflow status failing to update.
I'm concerned this won't get resolved. Because the status.docker.com page indicates that as far as they are concerned, everything is back to normal. Are you all speaking to docker to make sure they know about this problem?
I would also urge you to investigate more from the side of cromwell. I am 100% confident that the issue is not sporadic. It is systematic. Certain tasks always work perfectly fine. Other tasks are failing every time. I think it is highly unlikely the issue is on the side of docker, because how would docker hub know which task is pulling the docker image?
Hi all, I also discovered something important. My jobs that failed with this error are not just showing as if they are running -- they are ACTUALLY still running. I can clearly see this because they are continuing to write and update their output logs several hours after the jobs status has shown up as failed. I can send example, but again, I urge you to take a deeper look at what is going on and not just blame docker hub.
Furthermore, the fact that the workflows are actually still running despite this error and the job marked as Failed (and this is the case in 5 of the workflows that had this error - they are all still running) indicates that the docker was successfully pulled from docker hub.
So the issue must be something in the communication between docker hub and Terra, and Terra falsely flagging an issue in pulling the docker images.
I had a jupyter notebook runtime based on a custom Docker image which worked fine for the past few months. Today, when I tried to launch a new runtime instance for the notebook with this Docker image, I get errors for jupyter. However, I have no issues creating a new runtime instance using the default Docker. Is there an ongoing issue for pulling from Dockerhub?
+ JUPYTER_NOTEBOOK_FRONTEND_CONFIG=notebook.json
+ docker cp /etc/notebook.json jupyter-server:/etc/jupyter/nbconfig/
no such directory
Any update on this issue that has completely disabled Terra?
Sorry, we do not have an ETA.
Your options are to:
@GE @Samuel Freeman -- Would you mind sharing workspaces where you see the tasks are running? I'm trying to recreate what you see -- which may take longer than using your workspace as an example to extract information and posting an update here.
Edit: Or -- if you share which docker hub images you always see succeed and which ones you always see fail -- I can run a test workflow in a workspace and share the results with you to see if that's a reproducible observation.
Please sign in to leave a comment.