Workflow error, key not found

October 16, 2019 13:54
43 comments

I'm trying to rerun a pair set through the Mutect2-GATK4 workflow I've run several times before and am getting a workflow error that states "key not found: DockerInfoRequest(DockerImageIdentifierWithoutHash(None,None,python,2.7),List(PipelinesApiDockerCredentials(...". I also see that the workflow shows as failed but the task in job manager shows it as still running. My questions are 1) can you check if these failed workflows still have tasks running or not and 2) what does this error message mean and how can I debug it? Thanks for your help! The workspace should already be shared.

Workspace: blood-biopsy/early_stage_BC_whole_genome_analysis

Submission ID: e689bc7c-acc6-4168-bcab-a23300b8888f or 8d529a76-8e62-48b2-a79d-a7f731af7f18

Best,

Justin

Comments

43 comments

James Gatter
- October 16, 2019 14:58
I'm also getting this for a completely unrelated workflow. Docker went out yesterday and their website still seems somewhat unresponsive. That's just my suspicion anyway. Following this.

2
Justin Rhoades
- October 16, 2019 15:09
I was also wondering that but I believe the task where I get this error is pulling a docker image from GCR. Wondering if it's still related somehow.

1
James Gatter
- October 16, 2019 15:15
Huh, interesting. Also I can still pull the docker image locally. This seems to be something that we'll have to wait on.

0
Samuel Freeman
- October 16, 2019 15:24
I'm also getting this error with a docker that is being pulled from docker hub. The docker was pulled and the jobs ran for a while but they all eventually failed with the same "key not found" error.

3
jtsuji
- October 16, 2019 15:40
I'm also getting this error from workflows that use docker images in dockerhub..

2
Samuel Freeman
- October 16, 2019 17:46
Additionally, even though the workflows are failing, the tasks are still running and I am unable to abort any of them because the workflow is listed as failed. Some of my tasks have completed all the way through delocalization, but the tasks are still listed as running. I'm unsure whether I am still being charged for compute on these tasks even though I can't stop them.

2
Adam Nichols
- October 16, 2019 17:55
Hi all – Cromwell developer here.

We've never seen this error before either and are investigating in the direction of possible Docker Hub issues, especially in light of their major outage yesterday.

2
GE
- October 16, 2019 19:12
Any updates on this issue? Same thing is happening to me. Terra is not usable until this is fixed.

2
Mark Fleharty
- Edited October 16, 2019 20:57
I'm also experiencing this issue, and my workflow does NOT use Docker Hub. I'm using GCR, and quay.io.

** Turns out I was wrong, my workflow does reference Docker Hub in a sub workflow **

1
Adam Nichols
- October 16, 2019 20:24
If you submit new workflows, do they all still get stuck – or is the problem more along the lines of existing workflows never finishing?

Appreciate any info you can provide along these lines.

0
GE
- October 16, 2019 20:36
As you can see, based on the fact that there are tasks that fail and tasks that work that both use the same docker, I suspect the issue might be something on the side of cromwell. Please help... we are totally stuck and unable to use Terra.

3
Adam Nichols
- October 16, 2019 20:39
We hear you! Unfortunately, it looks like any workflow that uses a Docker Hub image is experiencing sporadic failures due to Docker Hub unreliability. This is not under our control to fix, though we can suggest using more reliable image repositories like GCR or Quay.io in the future.

We worked with Mark Fleharty above to determine that there is actually a Docker Hub image referenced deep down in his subworkflows, the same may be true for you.

1
Adam Nichols
- October 16, 2019 20:40
See status.docker.com

0
Samuel Freeman
- October 16, 2019 20:41
When I submitted these workflows, they were able to start running a task, but the workflow later failed with the "key not found" error. For all of my workflows, the task runs and eventually completes through delocalization, but the task seems to be stuck in the "running" state because the workflow has already failed.

0
GE
- October 16, 2019 20:44
Thanks, but I don't think this is a good solution. Almost every GATK workflow, including featured workspaces, use docker hub dockers. You all would need to completely rewrite most of your Terra and GATK workflows to move them to GCR or other hubs if this is an issue that persists.

Also, I suspect something different than just sporadic docker hub issues. That doesn't make sense because per my pending comment above, there are tasks that use the exact same docker image that are consistently working fine, while other wdl tasks that use the same docker image consistently fail.

1
Adam Nichols
- October 16, 2019 20:46
Sporadic means, sometimes the same Docker image succeeds and sometimes it fails – based on whether the Docker Hub API returns quickly enough (which is random).

0
Liudmila Elagina
- October 16, 2019 21:00
We are experiencing the same failures.

The docker image used in the method is not from DockerHub but Google Cloud Container Registry (us.gcr.io/broad-gatk/gatk:4.1.2.0)

0
Liudmila Elagina
- October 16, 2019 21:22
0
Adam Nichols
- October 16, 2019 21:29
Thanks for the screenshot – while you're definitely using GCR for the GATK Docker, many workflows reference additional, hard-coded images that are not obvious from looking at the inputs.

In the case of joint genotyping, the copy I have available to me also uses docker: "python:2.7" for one of its tasks, which would be coming from Docker Hub.

0
Liudmila Elagina
- October 16, 2019 21:52
Thank you for pointing that out. I didn't realize that there is a docker hub image is used there.

I agree with GE it is not sustainable to use DockerHub, especially for workflows in the featured workspaces.

0
Liudmila Elagina
- October 16, 2019 22:03
It shows that workflow is still running while clearly it is failed. No option to abort either. Are they acquiring cost?

0
Sushma Chaluvadi
- October 16, 2019 22:12
Hello All,

This is our Featured Post that contains a summary of the issue that is described here by Adam. Going forward, we will be updating the Featured article to localize all information.

0
Adam Nichols
- October 16, 2019 22:13
No additional cost, the tasks exit in the normal amount of time and do not incur any more expense than usual. The still-running status is simply an artifact of the workflow status failing to update.

0
GE
- Edited October 17, 2019 00:28
I'm concerned this won't get resolved. Because the status.docker.com page indicates that as far as they are concerned, everything is back to normal. Are you all speaking to docker to make sure they know about this problem?

I would also urge you to investigate more from the side of cromwell. I am 100% confident that the issue is not sporadic. It is systematic. Certain tasks always work perfectly fine. Other tasks are failing every time. I think it is highly unlikely the issue is on the side of docker, because how would docker hub know which task is pulling the docker image?

0
GE
- October 17, 2019 00:31
Hi all, I also discovered something important. My jobs that failed with this error are not just showing as if they are running -- they are ACTUALLY still running. I can clearly see this because they are continuing to write and update their output logs several hours after the jobs status has shown up as failed. I can send example, but again, I urge you to take a deeper look at what is going on and not just blame docker hub.

0
GE
- October 17, 2019 00:36
Furthermore, the fact that the workflows are actually still running despite this error and the job marked as Failed (and this is the case in 5 of the workflows that had this error - they are all still running) indicates that the docker was successfully pulled from docker hub.

So the issue must be something in the communication between docker hub and Terra, and Terra falsely flagging an issue in pulling the docker images.

0
Hattie Chung
- Edited October 17, 2019 01:24
I had a jupyter notebook runtime based on a custom Docker image which worked fine for the past few months. Today, when I tried to launch a new runtime instance for the notebook with this Docker image, I get errors for jupyter. However, I have no issues creating a new runtime instance using the default Docker. Is there an ongoing issue for pulling from Dockerhub?

+ JUPYTER_NOTEBOOK_FRONTEND_CONFIG=notebook.json
+ docker cp /etc/notebook.json jupyter-server:/etc/jupyter/nbconfig/
no such directory

0
GE
- October 17, 2019 14:34
Any update on this issue that has completely disabled Terra?

0
Adam Nichols
- October 17, 2019 14:48
Sorry, we do not have an ETA.

Your options are to:
1. Wait until the issue with Docker Hub images passes
2. Use images from GCR, Quay, or another repo – users report a 100% success rate after migrating
0
Ruchi Munshi
- Edited October 17, 2019 15:22
@GE @Samuel Freeman -- Would you mind sharing workspaces where you see the tasks are running? I'm trying to recreate what you see -- which may take longer than using your workspace as an example to extract information and posting an update here.

Edit: Or -- if you share which docker hub images you always see succeed and which ones you always see fail -- I can run a test workflow in a workspace and share the results with you to see if that's a reproducible observation.

0

Please sign in to leave a comment.