Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option in toolkit container to enable CDI in runtime #838

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cdesiniotis
Copy link
Contributor

@cdesiniotis cdesiniotis commented Dec 18, 2024

These changes add an option to enable CDI in container engines (containerd, crio, docker) from the toolkit contaienr.

This can be triggered through the --enable-cdi-in-runtime command line option or RUNTIME_ENABLE_CDI environment variable.

@cdesiniotis cdesiniotis force-pushed the enable-cdi-toolkit-container branch 2 times, most recently from 182d161 to 0aaafb4 Compare December 18, 2024 20:28
@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag {
EnvVars: []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"},
Hidden: true,
},
&cli.BoolFlag{
Name: "runtime-enable-cdi",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the flag be passed as --enable-cdi. The envar and the Option variable can have the Runtime/RUNTIME prefix. This is consistent with the other flags like socket, restart-mode etc

Suggested change
Name: "runtime-enable-cdi",
Name: "enable-cdi",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to be consistent with the option for the nvidia-ctk runtime configure command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we already have a cdi-enabled / enable-cdi flag defined in the toolkit options which triggers the generation of a CDI spec:

&cli.BoolFlag{
Name: "cdi-enabled",
Aliases: []string{"enable-cdi"},
Usage: "enable the generation of a CDI specification",
Destination: &opts.cdiEnabled,
EnvVars: []string{"CDI_ENABLED", "ENABLE_CDI"},
},

Any ideas on what the name of the new flag should be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.

One question I would have is whether we need a separate flag? Would it not make sense for the same flag / envvar (e.g. CDI_ENABLED) to be use to also trigger enabling CDI in the runtime? Do older versions of containerd complain about invalid config options?

If we feel we want finer control, then I'm ok to use RUNTIME_ENABLE_CDI or ENABLE_CDI_IN_RUNTIME to be used. (maybe the latter).

@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag {
EnvVars: []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"},
Hidden: true,
},
&cli.BoolFlag{
Name: "runtime-enable-cdi",
Usage: "Enable CDI in the configured runtime",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Usage: "Enable CDI in the configured runtime",
Usage: "Enable Container Device Interface (CDI) in the configured runtime",

@@ -163,3 +163,7 @@ func (c *ConfigV1) GetRuntimeConfig(name string) (engine.RuntimeConfig, error) {
tree: runtimeData,
}, nil
}

func (c *ConfigV1) EnableCDI() {
c.Set("enable_cdi", true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to cast c to a Config type and call EnableCDI here instead of reimplementing it.

pkg/config/engine/api.go Outdated Show resolved Hide resolved
elezar
elezar previously requested changes Dec 20, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

@elezar elezar force-pushed the enable-cdi-toolkit-container branch from 0aaafb4 to d4b8bc5 Compare January 16, 2025 13:58
@elezar elezar force-pushed the enable-cdi-toolkit-container branch from d4b8bc5 to edeeaa4 Compare January 16, 2025 14:39
@elezar elezar force-pushed the enable-cdi-toolkit-container branch from edeeaa4 to 67d6718 Compare January 27, 2025 12:29
cdesiniotis and others added 4 commits January 27, 2025 15:51
This change adds an EnableCDI method to the container engine config files and
Updates the 'nvidia-ctk runtime configure' command to use this new method.

Signed-off-by: Christopher Desiniotis <[email protected]>
@elezar elezar force-pushed the enable-cdi-toolkit-container branch from 67d6718 to 2b417c1 Compare January 27, 2025 14:52
@elezar elezar dismissed their stale review January 27, 2025 15:01

Changes applied.

@elezar
Copy link
Member

elezar commented Jan 27, 2025

@klueska as discussed today, I have rebased this PR.

@elezar
Copy link
Member

elezar commented Jan 28, 2025

As discussed yesterday, we should have the cdi.enabled = true option in the GPU Operator CRD also trigger enabling CDI in runtimes where relevant (e.g. Containerd). In most cases this should not conflict with using the nvidia-container-runtime.cdi runtime due to the fact that:

  • The nvidia-container-runtime.cdi is configured to react to the nvidia.cdi.k8s.io/ annotation prefixes.
  • The k8s-device-plugin uses the nvidia.cdi.k8s.io/ annotation prefix.
  • The nvidia-container-runtime.cdi is configured to use the management.nvidia.com/gpu CDI kind by default.

Note that the following edge case exists:

  • If the k8s-device-plugin is configured to inject CDI devices using the CRI fields (not possible using the current GPU Operator API), certain kubelet versions will map these requests to cdi.k8s.io/ annotations in addition to passing the requested CDI devices in the CRI field. This should still not be an issue unless a user has explicitly added cdi.k8s.io/ annotation to the list of allowe annotations.

I think it should be sufficient to have CDI_ENABLED imply RUNTIME_ENABLE_CDI with an explicit setting of RUNTIME_ENABLE_CDI overriding the inferred value.

// EnableCDI sets the enable_cdi field in the Containerd config to true.
func (c *Config) EnableCDI() {
config := *c.Tree
config.SetPath([]string{"plugins", c.CRIRuntimePluginName, "enable_cdi"}, true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: to add an explicit log for this enablement. something like
c.Logger.Infof("enabled CDI in %v ", c.CRIRuntimePluginName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to add a debug or log statement, it should be at the call site.


func (c *ConfigV1) EnableCDI() {
config := *c.Tree
config.SetPath([]string{"plugins", "cri", "containerd", "enable_cdi"}, true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too c.Logger.Infof("enabled CDI in cri and containerd")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to add a debug or log statement, it should be at the call site.

@cdesiniotis
Copy link
Contributor Author

As discussed yesterday, we should have the cdi.enabled = true option in the GPU Operator CRD also trigger enabling CDI in runtimes where relevant (e.g. Containerd). In most cases this should not conflict with using the nvidia-container-runtime.cdi runtime due to the fact that:

  • The nvidia-container-runtime.cdi is configured to react to the nvidia.cdi.k8s.io/ annotation prefixes.
  • The k8s-device-plugin uses the nvidia.cdi.k8s.io/ annotation prefix.
  • The nvidia-container-runtime.cdi is configured to use the management.nvidia.com/gpu CDI kind by default.

Note that the following edge case exists:

  • If the k8s-device-plugin is configured to inject CDI devices using the CRI fields (not possible using the current GPU Operator API), certain kubelet versions will map these requests to cdi.k8s.io/ annotations in addition to passing the requested CDI devices in the CRI field. This should still not be an issue unless a user has explicitly added cdi.k8s.io/ annotation to the list of allowe annotations.

I think it should be sufficient to have CDI_ENABLED imply RUNTIME_ENABLE_CDI with an explicit setting of RUNTIME_ENABLE_CDI overriding the inferred value.

@elezar thanks for writing this up. I am aligned with this and agree that we can have the cdi.enabled=true option also trigger the enablement of CDI in runtimes. I will work on the operator changes.

This change also enables CDI in the configured runtime when the toolkit
is installed with CDI enabled.

Signed-off-by: Evan Lezar <[email protected]>
This change adds a basic unit test for the nvidia-ckt-installer command.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the enable-cdi-toolkit-container branch from 8ad9c6a to d8cd543 Compare January 29, 2025 14:37
Comment on lines +112 to +114
if !c.IsSet("enable-cdi-in-runtime") {
opts.EnableCDI = to.CDI.Enabled
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it again, we don't need any operator change with this in place. As you said in a prior comment:

I think it should be sufficient to have CDI_ENABLED imply RUNTIME_ENABLE_CDI with an explicit setting of RUNTIME_ENABLE_CDI overriding the inferred value.

So by default, if a user sets cdi.enabled=true in the operator, we enable CDI in the runtime (e.g. containerd) while allowing them to opt-out of this behavior by manually configuring the RUNTIME_ENABLE_CDI environment variable in the toolkit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cdi.default becomes no-op?

Copy link
Contributor Author

@cdesiniotis cdesiniotis Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This PR does not change the semantics of the cdi.default field. When cdi.default=true the "default" nvidia runtime class will be configured in "cdi" mode. Meaning that any GPUs injected by the nvidia runtime class will have been done via CDI.

In the GPU Operator, cdi.enabled=true triggers the creation of additional NVIDIA runtime class, named nvidia-cdi, and configures additional envvars in the toolkit / device-plugin so that CDI specs get generated (amongst other things). If users want to leverage CDI for device injection, they can use the nvidia-cdi runtime class in their pod spec. If users want CDI to be used by default, then they would set cdi.default=true.

This PR makes it so that setting cdi.enabled=true in the operator also enables CDI in the runtime (e.g. containerd) without requiring any additional operator changes.

@cdesiniotis
Copy link
Contributor Author

@elezar the changes you added lgtm. Can we merge this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants