-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make --gpus work with nvidia gpus #21180
Conversation
This feels very nvidia specific, but not sure how to make it GPU agnostic. |
@Gnyblast Please try this out. |
Ephemeral COPR build failed. @containers/packit-build please check. |
cmd/podman/common/create.go
Outdated
gpuFlagName := "gpus" | ||
createFlags.String(gpuFlagName, "", "GPU devices to add to the container ('all' to pass all GPUs)") | ||
_ = cmd.RegisterFlagCompletionFunc(gpuFlagName, completion.AutocompleteNone) | ||
_ = createFlags.MarkHidden("gpus") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hide this flag if we're now supporting it?
cmd/podman/containers/create.go
Outdated
@@ -121,6 +121,10 @@ func commonFlags(cmd *cobra.Command) error { | |||
if cmd.Flags().Changed("image-volume") { | |||
cliVals.ImageVolume = cmd.Flag("image-volume").Value.String() | |||
} | |||
|
|||
if cmd.Flags().Changed("gpus") { | |||
cliVals.Devices = append(cliVals.Devices, "nvidia.com/gpu="+cmd.Flag("gpus").Value.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, did docker and nvidia really hardcode "GPU = nvidia"? The docs seem to say so...
But I can't find a reference to nvidia.com
in the docker sources; are we sure that's not actually in the nvidia container plugin?
Looking a bit more closely, this seems to map to a gpu
capability: https://github.com/docker/cli/blob/master/opts/gpus_test.go#L24
Shall I compile this version from the source and try? |
Yes We have other of hidden Docker compatibility flags that we rework to make Podman handle it, without encouraging users to use these flags, which is why I kept --gpus hidden. Nvidia seems to document the Podman way of doing this. |
Note this only touches the cli level, #21156 explicitly talks about using the API so I don't think this can work. |
This argument is correct, these application are using podman.sock to communicate, Thus it communicates with API from the looks of it. Having it working from the CLI is good but not something that will solve my problem. |
Ok we could move it to the API level. But still didn't answer @Luap99 question. |
If you mean this:
As far as I see, the way you mapped it on the CLI code was OK since you basically append the option argument to the end of |
9ea889a
to
3d0230c
Compare
Ok I documented --gpus (No longer hidden) |
Thanks @rhatdan , Thanks a lot. |
Build is broken, looks like a docs error. On the flag itself: looks like there will be a hard requirement on a CDI config for the associated GPU be present. We'll have to document that. |
I'll be awaiting the build issue to get fixed before going forward and try to test it since obviously I will not be able to compile it. |
You should be fine, actually - that's just our CI making sure we don't commit without working manpages. Local builds with |
For CLI it works: guney@test:~$ ./podman run --rm --gpus all ubuntu:22.04 nvidia-smi -L
GPU 0: NVIDIA A40 (UUID: GPU-393d7bf9-e15f-c3ec-4216-b16ce9040376)
GPU 1: NVIDIA A40 (UUID: GPU-51e6cab9-7be7-14f0-0bde-939a9f1f76b6)
GPU 2: NVIDIA A40 (UUID: GPU-17cc7282-7b75-873a-78b4-48fb6448b4e4)
GPU 3: NVIDIA A40 (UUID: GPU-3f54228f-419b-6dfc-d0cc-1ee4902f1deb) For API I couldn't test it yet because looks like podman/docker API changed the way it responds to a specific request. It was retuning status code 200 for This is from Podman 4.5.1 guney@test:/usr/local/bin$ curl -XPOST -I --unix-socket /run/user/${UID}/podman/podman.sock -H content-type:application/json "http://d/v1.41/images/create?tag=latest&fromImage=localhost%2Ftest-image"
HTTP/1.1 200 OK
Api-Version: 1.41
Libpod-Api-Version: 4.5.1
Server: Libpod/4.5.1 (linux)
X-Reference-Id: 0xc0001aa100
Date: Wed, 10 Jan 2024 10:38:53 GMT
Transfer-Encoding: chunked This is from podman 5.0.0-dev that I compiled guney@test:/usr/local/bin$ curl -XPOST -I --unix-socket /run/user/${UID}/podman/podman.sock -H content-type:application/json "http://d/v1.41/images/create?tag=latest&fromImage=localhost%2Ftest-image"
HTTP/1.1 500 Internal Server Error
Api-Version: 1.41
Libpod-Api-Version: 5.0.0-dev
Server: Libpod/5.0.0-dev (linux)
X-Reference-Id: 0xc0000f6c10
Date: Wed, 10 Jan 2024 11:27:15 GMT
Transfer-Encoding: chunked |
Somewhat documented here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html https://stackoverflow.com/questions/25185405/using-gpu-from-a-docker-container Fixes: containers#21156 Don't have access to nvidia GPUS, relying on contributor testing. Signed-off-by: Daniel J Walsh <[email protected]>
I am not sure why it would fail as an API, it should be following the same route. |
Looks like I am getting the same error from both. |
Hi @rhatdan , It's not that API is not working. API started to return proper response codes since this commit:
And this breaks how I have another question: device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
] above code let's say for Python to attach GPU to the container from API. This is the development you did on API side right? |
Does you machine has GPU? Did you followed the steps here in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html so that you generated the CDI specification file to get used when GPUs getting attached to the container? |
No my laptop does not have GPUs, the errors are what I expected, I was just showing that the APIs seem to be returning the same error with the nvidia.com/gpu modification. |
Yes but Rest API return also HTTP Status Codes which changed on the above commit. Anyway, the other thing I noticed, Rest API SDKs are actually using something called |
If podman run --gpus ..., worked, then podman run --remote would work when the gpus flag is passed to the server. So I don't see why this would not work, if you pass the flag via the API. |
There is nothing in this PR that would effect the endpoint, but podman 5.0 is the active upstream repo. Meaning if you build podman you are testing against a 5.0 endpoint. |
@mheon @Luap99 @giuseppe @vrothberg PTANL I don't see any other changes, can we get this in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, rhatdan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Gnyblast @cgwalters @Luap99 PTAL |
Well this does not fix the API and as such does not fix #21156. I am not against merging but just pointing out that it does not address the issue and likely only ever works for the |
I'm confirming that this development works with device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
] I'm also not sure how Docker or Podman SDK converts this code to be readable by API. if it's get translated to API as |
@Luap99 Remind me how this does not fix the API, the GPUS is being passed to the server side, and handled there as I recall. |
c41c30b
into
containers:main
Because docker sends the data via |
Just dropping this here for reference. NVIDIA has actively discouraged the use of |
Somewhat documented here:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html https://stackoverflow.com/questions/25185405/using-gpu-from-a-docker-container
Fixes: #21156
Don't have access to nvidia GPUS, relying on contributor testing.
Does this PR introduce a user-facing change?