GPU task error: "Could not load UVM kernel module. Is nvidia-modprobe installed?"

Post author
Martin Aryee

I am trying to run a task that uses a GPU. The docker image is based off "nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04" and the runtime block looks like this:

 runtime {
  docker: "quay.io/aryeelab/guppy-gpu"
  disks: "local-disk ${disk_size} HDD"
  gpuType: "nvidia-tesla-k80"
  gpuCount: 1
  zones: "us-central1-c"
}

The task fails with this error in the log: 

. . .
. . .
2019/04/30 05:15:39 I: Switching to status: running-docker 2019/04/30 05:15:39 I: Calling SetOperationStatus(running-docker) 2019/04/30 05:15:39 I: SetOperationStatus(running-docker) succeeded 2019/04/30 05:15:39 I: Setting these data volumes on the docker container: [-v /tmp/ggp-497335614:/tmp/ggp-497335614 -v /mnt/local-disk:/cromwell_root] 2019/04/30 05:15:39 I: Running command: nvidia-docker run -v /tmp/ggp-497335614:/tmp/ggp-497335614 -v /mnt/local-disk:/cromwell_root -e glob-cc114658a8822d7c830bde57175c0d3a.list=/cromwell_root/glob-cc114658a8822d7c830bde57175c0d3a.list -e preprocess_flowcell.basecall_and_demultiplex.fast5_zip-0=/cromwell_root/fc-80f4cc33-3098-4ff0-a07a-46412f1d1df5/fast5_raw/test-run-1.zip -e stderr=/cromwell_root/stderr -e __extra_config_gcs_path=gs://cromwell-auth-aryee-merkin/f6a177d3-77cc-47a6-9f47-d716623932f0_auth.json -e stdout=/cromwell_root/stdout -e guppy_basecaller/sequencing_summary.txt=/cromwell_root/guppy_basecaller/sequencing_summary.txt -e guppy_basecaller.log=/cromwell_root/guppy_basecaller.log -e exec=/cromwell_root/script -e rc=/cromwell_root/rc -e glob-cc114658a8822d7c830bde57175c0d3a/=/cromwell_root/glob-cc114658a8822d7c830bde57175c0d3a/* quay.io/aryeelab/guppy-gpu@sha256:34bb50445f8975438407c439e04a1ab5a48b35e728739fdf8647b343eb7b7e69 /tmp/ggp-497335614 2019/04/30 05:15:40 E: command failed: nvidia-docker | 2019/04/30 05:15:40 Error: Could not load UVM kernel module. Is nvidia-modprobe installed? (exit status 1)

I would like to know if

  1.  anyone has seen this specific error before, or
  2. if there's a way to work out exactly what machien type / image is being used to run this task so that I can fire one up and try to debug further myself

Thanks

Comments

19 comments

  • Comment author
    Adelaide Rhodes

    Thank you for the question, I have reached out to our development team for more information.

    0
  • Comment author
    Adam Nichols

    Hello,

    Thanks for the information you provided. Based on a round of very informal research, I would suggest checking out https://github.com/NVIDIA/nvidia-docker for support with this tool.

    A recommendation I picked up from trawling their old issues is to run a WDL task that prints the result of

    nvidia-modprobe --version [1]

    to check that the program is installed and also

    lspci | grep -i vga [2]

    to see whether there is actually a GPU installed in your machine.

    - Adam

     

    [1] https://github.com/NVIDIA/nvidia-docker/issues/396

    [2] https://github.com/NVIDIA/nvidia-docker/issues/319

    0
  • Comment author
    Adelaide Rhodes
    • Edited

    @AdamNichols - Thanks for the detailed information.

    As for @aryee's other part of the question, about seeing what type of machine was used to run this job, this information is currently found in FireCloud UI.

    If you navigate into the individual workflow, then navigate into the specific call in question, you'll see an `Operation` link. Click on that link to get the metadata for the instance that ran the call.

    If the concern is about the presence/absence of GPUs on this instance, look for the `acceleratorCount` and `acceleratorType` values in that operation details

    0
  • Comment author
    Adam Nichols

    That's pretty cool, I did not know you could get operation description that easily!

    0
  • Comment author
    Martin Aryee

    @AdamNichols - I tried submitting a simpler workflow requesting a GPU with just the two commands you suggested:

    workflow cuda-test {
    call get_machine_info
    }

    task get_machine_info {
    command <<<
    nvidia-modprobe --version
    lspci | grep -i vga
    >>>

    runtime {
    docker: "nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04"
    gpuType: "nvidia-tesla-k80"
    gpuCount: 1
    zones: "us-central1-c"
    }

    output {
    }
    }

     

    The workflow fails without any task logs with this error:

    message: Workflow failed causedBy:
    message: Task cuda_test.get_machine_info:NA:3 failed. The job was stopped before the command finished. PAPI error code 10. 13: VM ggp-8124558412071415724 shut down unexpectedly.

     

    This is the workflow: https://portal.firecloud.org/#workspaces/aryee-merkin/cuda_test/monitor/eaa7a302-fa54-45e8-91e5-891786e4c119/ec922cd4-08eb-49c3-94ff-f68e001722da

    0
  • Comment author
    Martin Aryee

    Update: I think there are two separate problems here. I resubmitted the job above (without any changes) and this time it got past the PAPI error 10.13 stage but then fails again with the nvidia-docker error I had initially. It seems it's not even able to load the docker image. I verified in the operations log (through Firecloud) that the machine does have a GPU attached.


    "resources": {
    "acceleratorCount": "1",
    "acceleratorType": "nvidia-tesla-k80",
    "bootDiskSizeGb": 10,
    "disks": [{
    "autoDelete": false,
    "mountPoint": "/cromwell_root",
    "name": "local-disk",
    "readOnly": false,
    "sizeGb": 10,
    "source": "",
    "type": "PERSISTENT_SSD"
    }],
    "minimumCpuCores": 1,
    "minimumRamGb": 2,
    "noAddress": false,
    "preemptible": false,
    "zones": ["us-central1-c"]
    }
    0
  • Comment author
    Adelaide Rhodes

    We are undergoing a transition from Firecloud to Terra today, but I created a ticket for follow up tomorrow so we can see if the problem is related to the docker itself.  Would you mind sharing your workspace with the following group?

    GROUP_FireCloud-Support@firecloud.org

    Thank you,

    0
  • Comment author
    Martin Aryee
    • Edited

    Great - Thanks. I've shared the workspace with GROUP_FireCloud-Support@firecloud.org.

    Example failed job:
    https://job-manager.dsde-prod.broadinstitute.org/jobs/485faec9-7fc0-485b-9872-429e988d2da1

    0
  • Comment author
    Chris Llanwarne

    Hi @aryee, can we test that the permissions on your account and project are set up correctly by attempting to run the following workflow? It should require no inputs. It's one of our test cases, and I've just verified that it works when I run it- so I'd like to check that you're at least able to run the "GPU hello world" workflow:

     

    task task_with_gpu {
    String gpuTypeInput

    command {
    curl "https://www.googleapis.com/compute/v1/projects/broad-dsde-cromwell-dev/zones/us-central1-c/instances/$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")?fields=guestAccelerators" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -H 'Accept: application/json' --compressed
    }

    output {
    Object metadata = read_json(stdout())
    Int gpuCount = metadata.guestAccelerators[0].acceleratorCount
    String gpuType = metadata.guestAccelerators[0].acceleratorType
    }

    runtime {
    gpuCount: 1
    gpuType: gpuTypeInput
    docker: "google/cloud-sdk:slim"
    zones: ["us-central1-c"]
    }
    }

    workflow gpu_on_papi {
    call task_with_gpu as task_with_tesla_k80 { input: gpuTypeInput = "nvidia-tesla-k80" }
    call task_with_gpu as task_with_tesla_p100 { input: gpuTypeInput = "nvidia-tesla-p100" }

    output {
    Int tesla80GpuCount = task_with_tesla_k80.gpuCount
    String tesla80GpuType = task_with_tesla_k80.gpuType
    Int tesla100GpuCount = task_with_tesla_p100.gpuCount
    String tesla100GpuType = task_with_tesla_p100.gpuType
    }
    }
    0
  • Comment author
    Martin Aryee

    That workflow fails for me with this in the task stdout:

    { "error": { "errors": [ 
    { "domain": "global",
    "reason": "forbidden",
    "message": "Required 'compute.instances.get' permission for 'projects/broad-dsde-cromwell-dev/zones/us-central1-c/instances/ggp-9384032044027324051'" } ],
    "code": 403,
    "message": "Required 'compute.instances.get' permission for 'projects/broad-dsde-cromwell-dev/zones/us-central1-c/instances/ggp-9384032044027324051'" } }

     

    This is the job: https://portal.firecloud.org/#workspaces/aryee-merkin/cuda_test/monitor/3942d48e-bef6-4d04-ab20-60b105a117a2/c58a3938-3f82-4094-9141-b756070d7160

    0
  • Comment author
    Chris Llanwarne

    Oh, oops! I bet that our hard-coded project name is causing the problem:

    broad-dsde-cromwell-dev

    Would you be able to swap that out for your google project name and try again?

    0
  • Comment author
    Chris Llanwarne

    Update: I've copied and tried the WDL that Adam suggested above and have recreated your issue. I suspect that this is an issue with the driver versions being used. I'm continuing to investigate from here.

     

    PS: Don't worry about trying to run the WDL I posted above, I think you already confirmed that your instances are being set up correctly, which is all that WDL would really confirm. Thanks!

    0
  • Comment author
    Chris Llanwarne

    Hi aryee -

    I've done a little experimentation myself and talked to a few other users who have been using GPUs in their workflows. It appears that the way to make GPU support work is to use PAPIv2 rather than PAPIv1 (PAPI is the Google interface we use to submit jobs into their cloud - and v2 is the new version of that interface). Right now FireCloud has an opt-in beta support to allow projects to use PAPIv2 rather than PAPIv1, but within a couple of weeks that's going to change and everybody will be using v2 by default. 

    So there are two options here - 

    1. You can be added to the PAPIv2 whitelist, which means that all of your workflows will start being submitted to PAPIv2 immediately, and you'll get a head start on being able to use all of the features which are "PAPIv2 only". In theory this is a transparent change to you (apart from the addition of new features of course!), but there's a chance you'll see slightly different error messages than you're used to, and maybe you'll find a few bugs which we're still working to iron out.
    2. You can wait a few more weeks until you get upgraded onto PAPIv2 along with everybody else.

    I hope that all made sense - if you want to explore the PAPIv2 whitelist option let us know and I'll see if I can work out how to connect you with that scheme.

    Thanks -

    Chris

    0
  • Comment author
    Martin Aryee

    It would be great to be added to the PAPIv2 beta/whitelist. I'm not running any production workflows and can tolerate the risk.

    Thanks!

    Martin

     

    0
  • Comment author
    Adelaide Rhodes

    Hi Martin -

    Chris is working out how to make the request.  We really appreciate the feedback.  I am sure other users trying to mount VM's with GPUs will benefit from your request.

    Adelaide

    0
  • Comment author
    Chris Llanwarne

    Hi Martin -

    I just want to confirm this with you one last time before we push the button on this.

    • The change will update all workspaces in the project to use PAPIv2 instead of PAPIv1. In case that's a shared project, you may want to check with any collaborators that this is something that they're happy with.
    • I can apply this to either the aryee-merkin or aryee-lab project, let me know which you'd like me to apply this to.

    Thanks - 

    Chris

    0
  • Comment author
    Martin Aryee

    Let’s do it for aryee-merkin. Thanks!

    0
  • Comment author
    Doug Voet

    Hi Martin, 

    I updated the aryee-merkin project to use PAPIv2.

    Let us know if you have any problems.

    Thanks,

    Doug

    0
  • Comment author
    Martin Aryee

    Great - That did the trick. Our GPU tasks are now running happily in Terra.

    Thanks very much,

    Martin

     

    0

Please sign in to leave a comment.