-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option in toolkit container to enable CDI in runtime #838
base: main
Are you sure you want to change the base?
Conversation
182d161
to
0aaafb4
Compare
tools/container/runtime/runtime.go
Outdated
@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag { | |||
EnvVars: []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"}, | |||
Hidden: true, | |||
}, | |||
&cli.BoolFlag{ | |||
Name: "runtime-enable-cdi", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let the flag be passed as --enable-cdi
. The envar and the Option variable can have the Runtime/RUNTIME
prefix. This is consistent with the other flags like socket
, restart-mode
etc
Name: "runtime-enable-cdi", | |
Name: "enable-cdi", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to be consistent with the option for the nvidia-ctk runtime configure
command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we already have a cdi-enabled
/ enable-cdi
flag defined in the toolkit options which triggers the generation of a CDI spec:
nvidia-container-toolkit/tools/container/toolkit/toolkit.go
Lines 173 to 179 in 8d869ac
&cli.BoolFlag{ | |
Name: "cdi-enabled", | |
Aliases: []string{"enable-cdi"}, | |
Usage: "enable the generation of a CDI specification", | |
Destination: &opts.cdiEnabled, | |
EnvVars: []string{"CDI_ENABLED", "ENABLE_CDI"}, | |
}, |
Any ideas on what the name of the new flag should be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification.
One question I would have is whether we need a separate flag? Would it not make sense for the same flag / envvar (e.g. CDI_ENABLED
) to be use to also trigger enabling CDI in the runtime? Do older versions of containerd complain about invalid config options?
If we feel we want finer control, then I'm ok to use RUNTIME_ENABLE_CDI
or ENABLE_CDI_IN_RUNTIME
to be used. (maybe the latter).
tools/container/runtime/runtime.go
Outdated
@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag { | |||
EnvVars: []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"}, | |||
Hidden: true, | |||
}, | |||
&cli.BoolFlag{ | |||
Name: "runtime-enable-cdi", | |||
Usage: "Enable CDI in the configured runtime", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage: "Enable CDI in the configured runtime", | |
Usage: "Enable Container Device Interface (CDI) in the configured runtime", |
@@ -163,3 +163,7 @@ func (c *ConfigV1) GetRuntimeConfig(name string) (engine.RuntimeConfig, error) { | |||
tree: runtimeData, | |||
}, nil | |||
} | |||
|
|||
func (c *ConfigV1) EnableCDI() { | |||
c.Set("enable_cdi", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to cast c
to a Config
type and call EnableCDI
here instead of reimplementing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments.
0aaafb4
to
d4b8bc5
Compare
d4b8bc5
to
edeeaa4
Compare
edeeaa4
to
67d6718
Compare
This change adds an EnableCDI method to the container engine config files and Updates the 'nvidia-ctk runtime configure' command to use this new method. Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
67d6718
to
2b417c1
Compare
@klueska as discussed today, I have rebased this PR. |
As discussed yesterday, we should have the
Note that the following edge case exists:
I think it should be sufficient to have |
// EnableCDI sets the enable_cdi field in the Containerd config to true. | ||
func (c *Config) EnableCDI() { | ||
config := *c.Tree | ||
config.SetPath([]string{"plugins", c.CRIRuntimePluginName, "enable_cdi"}, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: to add an explicit log for this enablement. something like
c.Logger.Infof("enabled CDI in %v ", c.CRIRuntimePluginName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add a debug or log statement, it should be at the call site.
|
||
func (c *ConfigV1) EnableCDI() { | ||
config := *c.Tree | ||
config.SetPath([]string{"plugins", "cri", "containerd", "enable_cdi"}, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too c.Logger.Infof("enabled CDI in cri and containerd")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add a debug or log statement, it should be at the call site.
@elezar thanks for writing this up. I am aligned with this and agree that we can have the |
This change also enables CDI in the configured runtime when the toolkit is installed with CDI enabled. Signed-off-by: Evan Lezar <[email protected]>
This change adds a basic unit test for the nvidia-ckt-installer command. Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
8ad9c6a
to
d8cd543
Compare
if !c.IsSet("enable-cdi-in-runtime") { | ||
opts.EnableCDI = to.CDI.Enabled | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it again, we don't need any operator change with this in place. As you said in a prior comment:
I think it should be sufficient to have CDI_ENABLED imply RUNTIME_ENABLE_CDI with an explicit setting of RUNTIME_ENABLE_CDI overriding the inferred value.
So by default, if a user sets cdi.enabled=true
in the operator, we enable CDI in the runtime (e.g. containerd) while allowing them to opt-out of this behavior by manually configuring the RUNTIME_ENABLE_CDI
environment variable in the toolkit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So cdi.default
becomes no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This PR does not change the semantics of the cdi.default
field. When cdi.default=true
the "default" nvidia
runtime class will be configured in "cdi" mode. Meaning that any GPUs injected by the nvidia
runtime class will have been done via CDI.
In the GPU Operator, cdi.enabled=true
triggers the creation of additional NVIDIA runtime class, named nvidia-cdi
, and configures additional envvars in the toolkit / device-plugin so that CDI specs get generated (amongst other things). If users want to leverage CDI for device injection, they can use the nvidia-cdi
runtime class in their pod spec. If users want CDI to be used by default, then they would set cdi.default=true
.
This PR makes it so that setting cdi.enabled=true
in the operator also enables CDI in the runtime (e.g. containerd) without requiring any additional operator changes.
@elezar the changes you added lgtm. Can we merge this PR? |
These changes add an option to enable CDI in container engines (containerd, crio, docker) from the toolkit contaienr.
This can be triggered through the
--enable-cdi-in-runtime
command line option orRUNTIME_ENABLE_CDI
environment variable.