GPU task error: "Could not load UVM kernel module. Is nvidia-modprobe installed?"
I am trying to run a task that uses a GPU. The docker image is based off "nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04" and the runtime block looks like this:
runtime {
docker: "quay.io/aryeelab/guppy-gpu"
disks: "local-disk ${disk_size} HDD"
gpuType: "nvidia-tesla-k80"
gpuCount: 1
zones: "us-central1-c"
}
The task fails with this error in the log:
. . .
. . .
2019/04/30 05:15:39 I: Switching to status: running-docker 2019/04/30 05:15:39 I: Calling SetOperationStatus(running-docker) 2019/04/30 05:15:39 I: SetOperationStatus(running-docker) succeeded 2019/04/30 05:15:39 I: Setting these data volumes on the docker container: [-v /tmp/ggp-497335614:/tmp/ggp-497335614 -v /mnt/local-disk:/cromwell_root] 2019/04/30 05:15:39 I: Running command: nvidia-docker run -v /tmp/ggp-497335614:/tmp/ggp-497335614 -v /mnt/local-disk:/cromwell_root -e glob-cc114658a8822d7c830bde57175c0d3a.list=/cromwell_root/glob-cc114658a8822d7c830bde57175c0d3a.list -e preprocess_flowcell.basecall_and_demultiplex.fast5_zip-0=/cromwell_root/fc-80f4cc33-3098-4ff0-a07a-46412f1d1df5/fast5_raw/test-run-1.zip -e stderr=/cromwell_root/stderr -e __extra_config_gcs_path=gs://cromwell-auth-aryee-merkin/f6a177d3-77cc-47a6-9f47-d716623932f0_auth.json -e stdout=/cromwell_root/stdout -e guppy_basecaller/sequencing_summary.txt=/cromwell_root/guppy_basecaller/sequencing_summary.txt -e guppy_basecaller.log=/cromwell_root/guppy_basecaller.log -e exec=/cromwell_root/script -e rc=/cromwell_root/rc -e glob-cc114658a8822d7c830bde57175c0d3a/=/cromwell_root/glob-cc114658a8822d7c830bde57175c0d3a/* quay.io/aryeelab/guppy-gpu@sha256:34bb50445f8975438407c439e04a1ab5a48b35e728739fdf8647b343eb7b7e69 /tmp/ggp-497335614 2019/04/30 05:15:40 E: command failed: nvidia-docker | 2019/04/30 05:15:40 Error: Could not load UVM kernel module. Is nvidia-modprobe installed? (exit status 1)
I would like to know if
- anyone has seen this specific error before, or
- if there's a way to work out exactly what machien type / image is being used to run this task so that I can fire one up and try to debug further myself
Thanks
Comments
19 comments
Thank you for the question, I have reached out to our development team for more information.
Hello,
Thanks for the information you provided. Based on a round of very informal research, I would suggest checking out https://github.com/NVIDIA/nvidia-docker for support with this tool.
A recommendation I picked up from trawling their old issues is to run a WDL task that prints the result of
nvidia-modprobe --version [1]
to check that the program is installed and also
lspci | grep -i vga [2]
to see whether there is actually a GPU installed in your machine.
- Adam
[1] https://github.com/NVIDIA/nvidia-docker/issues/396
[2] https://github.com/NVIDIA/nvidia-docker/issues/319
@AdamNichols - Thanks for the detailed information.
As for @aryee's other part of the question, about seeing what type of machine was used to run this job, this information is currently found in FireCloud UI.
If you navigate into the individual workflow, then navigate into the specific call in question, you'll see an `Operation` link. Click on that link to get the metadata for the instance that ran the call.
If the concern is about the presence/absence of GPUs on this instance, look for the `acceleratorCount` and `acceleratorType` values in that operation details
That's pretty cool, I did not know you could get operation description that easily!
@AdamNichols - I tried submitting a simpler workflow requesting a GPU with just the two commands you suggested:
The workflow fails without any task logs with this error:
message: Workflow failed causedBy:
message: Task cuda_test.get_machine_info:NA:3 failed. The job was stopped before the command finished. PAPI error code 10. 13: VM ggp-8124558412071415724 shut down unexpectedly.
This is the workflow: https://portal.firecloud.org/#workspaces/aryee-merkin/cuda_test/monitor/eaa7a302-fa54-45e8-91e5-891786e4c119/ec922cd4-08eb-49c3-94ff-f68e001722da
Update: I think there are two separate problems here. I resubmitted the job above (without any changes) and this time it got past the PAPI error 10.13 stage but then fails again with the nvidia-docker error I had initially. It seems it's not even able to load the docker image. I verified in the operations log (through Firecloud) that the machine does have a GPU attached.
We are undergoing a transition from Firecloud to Terra today, but I created a ticket for follow up tomorrow so we can see if the problem is related to the docker itself. Would you mind sharing your workspace with the following group?
GROUP_FireCloud-Support@firecloud.org
Thank you,
Great - Thanks. I've shared the workspace with GROUP_FireCloud-Support@firecloud.org.
Example failed job:
https://job-manager.dsde-prod.broadinstitute.org/jobs/485faec9-7fc0-485b-9872-429e988d2da1
Hi @aryee, can we test that the permissions on your account and project are set up correctly by attempting to run the following workflow? It should require no inputs. It's one of our test cases, and I've just verified that it works when I run it- so I'd like to check that you're at least able to run the "GPU hello world" workflow:
That workflow fails for me with this in the task stdout:
This is the job: https://portal.firecloud.org/#workspaces/aryee-merkin/cuda_test/monitor/3942d48e-bef6-4d04-ab20-60b105a117a2/c58a3938-3f82-4094-9141-b756070d7160
Oh, oops! I bet that our hard-coded project name is causing the problem:
Would you be able to swap that out for your google project name and try again?
Update: I've copied and tried the WDL that Adam suggested above and have recreated your issue. I suspect that this is an issue with the driver versions being used. I'm continuing to investigate from here.
PS: Don't worry about trying to run the WDL I posted above, I think you already confirmed that your instances are being set up correctly, which is all that WDL would really confirm. Thanks!
Hi aryee -
I've done a little experimentation myself and talked to a few other users who have been using GPUs in their workflows. It appears that the way to make GPU support work is to use PAPIv2 rather than PAPIv1 (PAPI is the Google interface we use to submit jobs into their cloud - and v2 is the new version of that interface). Right now FireCloud has an opt-in beta support to allow projects to use PAPIv2 rather than PAPIv1, but within a couple of weeks that's going to change and everybody will be using v2 by default.
So there are two options here -
I hope that all made sense - if you want to explore the PAPIv2 whitelist option let us know and I'll see if I can work out how to connect you with that scheme.
Thanks -
Chris
It would be great to be added to the PAPIv2 beta/whitelist. I'm not running any production workflows and can tolerate the risk.
Thanks!
Martin
Hi Martin -
Chris is working out how to make the request. We really appreciate the feedback. I am sure other users trying to mount VM's with GPUs will benefit from your request.
Adelaide
Hi Martin -
I just want to confirm this with you one last time before we push the button on this.
Thanks -
Chris
Let’s do it for aryee-merkin. Thanks!
Hi Martin,
I updated the aryee-merkin project to use PAPIv2.
Let us know if you have any problems.
Thanks,
Doug
Great - That did the trick. Our GPU tasks are now running happily in Terra.
Thanks very much,
Martin
Please sign in to leave a comment.