Add option in toolkit container to enable CDI in runtime #838

cdesiniotis · 2024-12-18T20:20:35Z

These changes add an option to enable CDI in container engines (containerd, crio, docker) from the toolkit contaienr.

This can be triggered through the --enable-cdi-in-runtime command line option or RUNTIME_ENABLE_CDI environment variable.

tariq1890 · 2024-12-19T23:30:22Z

tools/container/runtime/runtime.go

@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag {
 			EnvVars:     []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"},
 			Hidden:      true,
 		},
+		&cli.BoolFlag{
+			Name:        "runtime-enable-cdi",


Let the flag be passed as --enable-cdi. The envar and the Option variable can have the Runtime/RUNTIME prefix. This is consistent with the other flags like socket, restart-mode etc

Suggested change

Name: "runtime-enable-cdi",

Name: "enable-cdi",

Let's try to be consistent with the option for the nvidia-ctk runtime configure command.

The problem is that we already have a cdi-enabled / enable-cdi flag defined in the toolkit options which triggers the generation of a CDI spec:

nvidia-container-toolkit/tools/container/toolkit/toolkit.go

Lines 173 to 179 in 8d869ac

&cli.BoolFlag{

Name: "cdi-enabled",

Aliases: []string{"enable-cdi"},

Usage: "enable the generation of a CDI specification",

Destination: &opts.cdiEnabled,

EnvVars: []string{"CDI_ENABLED", "ENABLE_CDI"},

},

Any ideas on what the name of the new flag should be?

Thanks for the clarification.

One question I would have is whether we need a separate flag? Would it not make sense for the same flag / envvar (e.g. CDI_ENABLED) to be use to also trigger enabling CDI in the runtime? Do older versions of containerd complain about invalid config options?

If we feel we want finer control, then I'm ok to use RUNTIME_ENABLE_CDI or ENABLE_CDI_IN_RUNTIME to be used. (maybe the latter).

tariq1890 · 2024-12-19T23:30:59Z

tools/container/runtime/runtime.go

@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag {
 			EnvVars:     []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"},
 			Hidden:      true,
 		},
+		&cli.BoolFlag{
+			Name:        "runtime-enable-cdi",
+			Usage:       "Enable CDI in the configured runtime",


Suggested change

Usage: "Enable CDI in the configured runtime",

Usage: "Enable Container Device Interface (CDI) in the configured runtime",

elezar · 2024-12-20T14:35:00Z

pkg/config/engine/containerd/config_v1.go

@@ -163,3 +163,7 @@ func (c *ConfigV1) GetRuntimeConfig(name string) (engine.RuntimeConfig, error) {
 		tree: runtimeData,
 	}, nil
 }
+
+func (c *ConfigV1) EnableCDI() {
+	c.Set("enable_cdi", true)


I think we should be able to cast c to a Config type and call EnableCDI here instead of reimplementing it.

pkg/config/engine/api.go

tools/container/container.go

elezar

Some minor comments.

tools/container/runtime/docker/docker.go

tools/container/runtime/runtime.go

pkg/config/engine/docker/docker.go

This change adds an EnableCDI method to the container engine config files and Updates the 'nvidia-ctk runtime configure' command to use this new method. Signed-off-by: Christopher Desiniotis <[email protected]>

Signed-off-by: Evan Lezar <[email protected]>

Signed-off-by: Christopher Desiniotis <[email protected]>

Signed-off-by: Evan Lezar <[email protected]>

Changes applied.

elezar · 2025-01-27T15:02:11Z

@klueska as discussed today, I have rebased this PR.

elezar · 2025-01-28T20:00:30Z

As discussed yesterday, we should have the cdi.enabled = true option in the GPU Operator CRD also trigger enabling CDI in runtimes where relevant (e.g. Containerd). In most cases this should not conflict with using the nvidia-container-runtime.cdi runtime due to the fact that:

The nvidia-container-runtime.cdi is configured to react to the nvidia.cdi.k8s.io/ annotation prefixes.
The k8s-device-plugin uses the nvidia.cdi.k8s.io/ annotation prefix.
The nvidia-container-runtime.cdi is configured to use the management.nvidia.com/gpu CDI kind by default.

Note that the following edge case exists:

If the k8s-device-plugin is configured to inject CDI devices using the CRI fields (not possible using the current GPU Operator API), certain kubelet versions will map these requests to cdi.k8s.io/ annotations in addition to passing the requested CDI devices in the CRI field. This should still not be an issue unless a user has explicitly added cdi.k8s.io/ annotation to the list of allowe annotations.

I think it should be sufficient to have CDI_ENABLED imply RUNTIME_ENABLE_CDI with an explicit setting of RUNTIME_ENABLE_CDI overriding the inferred value.

guptaNswati · 2025-01-28T20:49:29Z

pkg/config/engine/containerd/config.go

+// EnableCDI sets the enable_cdi field in the Containerd config to true.
+func (c *Config) EnableCDI() {
+	config := *c.Tree
+	config.SetPath([]string{"plugins", c.CRIRuntimePluginName, "enable_cdi"}, true)


nitpick: to add an explicit log for this enablement. something like
c.Logger.Infof("enabled CDI in %v ", c.CRIRuntimePluginName)

If we want to add a debug or log statement, it should be at the call site.

guptaNswati · 2025-01-28T20:50:56Z

pkg/config/engine/containerd/config_v1.go

+
+func (c *ConfigV1) EnableCDI() {
+	config := *c.Tree
+	config.SetPath([]string{"plugins", "cri", "containerd", "enable_cdi"}, true)


here too c.Logger.Infof("enabled CDI in cri and containerd")

If we want to add a debug or log statement, it should be at the call site.

cmd/nvidia-ctk-installer/container/runtime/runtime.go

cdesiniotis · 2025-01-28T23:59:39Z

As discussed yesterday, we should have the cdi.enabled = true option in the GPU Operator CRD also trigger enabling CDI in runtimes where relevant (e.g. Containerd). In most cases this should not conflict with using the nvidia-container-runtime.cdi runtime due to the fact that:

The nvidia-container-runtime.cdi is configured to react to the nvidia.cdi.k8s.io/ annotation prefixes.

The k8s-device-plugin uses the nvidia.cdi.k8s.io/ annotation prefix.

The nvidia-container-runtime.cdi is configured to use the management.nvidia.com/gpu CDI kind by default.

Note that the following edge case exists:

If the k8s-device-plugin is configured to inject CDI devices using the CRI fields (not possible using the current GPU Operator API), certain kubelet versions will map these requests to cdi.k8s.io/ annotations in addition to passing the requested CDI devices in the CRI field. This should still not be an issue unless a user has explicitly added cdi.k8s.io/ annotation to the list of allowe annotations.

I think it should be sufficient to have CDI_ENABLED imply RUNTIME_ENABLE_CDI with an explicit setting of RUNTIME_ENABLE_CDI overriding the inferred value.

@elezar thanks for writing this up. I am aligned with this and agree that we can have the cdi.enabled=true option also trigger the enablement of CDI in runtimes. I will work on the operator changes.

This change also enables CDI in the configured runtime when the toolkit is installed with CDI enabled. Signed-off-by: Evan Lezar <[email protected]>

This change adds a basic unit test for the nvidia-ckt-installer command. Signed-off-by: Evan Lezar <[email protected]>

Signed-off-by: Evan Lezar <[email protected]>

cdesiniotis · 2025-01-29T17:05:52Z

cmd/nvidia-ctk-installer/container/runtime/runtime.go

+	if !c.IsSet("enable-cdi-in-runtime") {
+		opts.EnableCDI = to.CDI.Enabled
+	}


Now that I think about it again, we don't need any operator change with this in place. As you said in a prior comment:

I think it should be sufficient to have CDI_ENABLED imply RUNTIME_ENABLE_CDI with an explicit setting of RUNTIME_ENABLE_CDI overriding the inferred value.

So by default, if a user sets cdi.enabled=true in the operator, we enable CDI in the runtime (e.g. containerd) while allowing them to opt-out of this behavior by manually configuring the RUNTIME_ENABLE_CDI environment variable in the toolkit.

So cdi.default becomes no-op?

No. This PR does not change the semantics of the cdi.default field. When cdi.default=true the "default" nvidia runtime class will be configured in "cdi" mode. Meaning that any GPUs injected by the nvidia runtime class will have been done via CDI.

In the GPU Operator, cdi.enabled=true triggers the creation of additional NVIDIA runtime class, named nvidia-cdi, and configures additional envvars in the toolkit / device-plugin so that CDI specs get generated (amongst other things). If users want to leverage CDI for device injection, they can use the nvidia-cdi runtime class in their pod spec. If users want CDI to be used by default, then they would set cdi.default=true.

This PR makes it so that setting cdi.enabled=true in the operator also enables CDI in the runtime (e.g. containerd) without requiring any additional operator changes.

cdesiniotis · 2025-01-31T05:59:25Z

@elezar the changes you added lgtm. Can we merge this PR?

cdesiniotis force-pushed the enable-cdi-toolkit-container branch 2 times, most recently from 182d161 to 0aaafb4 Compare December 18, 2024 20:28

cdesiniotis requested review from elezar, tariq1890, klueska and guptaNswati December 18, 2024 20:31

tariq1890 reviewed Dec 19, 2024

View reviewed changes

elezar reviewed Dec 20, 2024

View reviewed changes

pkg/config/engine/api.go Outdated Show resolved Hide resolved

elezar reviewed Dec 20, 2024

View reviewed changes

tools/container/container.go Outdated Show resolved Hide resolved

elezar previously requested changes Dec 20, 2024

View reviewed changes

elezar force-pushed the enable-cdi-toolkit-container branch from 0aaafb4 to d4b8bc5 Compare January 16, 2025 13:58

elezar reviewed Jan 16, 2025

View reviewed changes

tools/container/runtime/docker/docker.go Outdated Show resolved Hide resolved

elezar reviewed Jan 16, 2025

View reviewed changes

tools/container/runtime/runtime.go Outdated Show resolved Hide resolved

elezar reviewed Jan 16, 2025

View reviewed changes

pkg/config/engine/docker/docker.go Outdated Show resolved Hide resolved

elezar force-pushed the enable-cdi-toolkit-container branch from d4b8bc5 to edeeaa4 Compare January 16, 2025 14:39

elezar force-pushed the enable-cdi-toolkit-container branch from edeeaa4 to 67d6718 Compare January 27, 2025 12:29

cdesiniotis and others added 4 commits January 27, 2025 15:51

Add EnableCDI() method to engine.Interface

df73db7

This change adds an EnableCDI method to the container engine config files and Updates the 'nvidia-ctk runtime configure' command to use this new method. Signed-off-by: Christopher Desiniotis <[email protected]>

Remove Set from engine config API

f625242

Signed-off-by: Evan Lezar <[email protected]>

Add option in toolkit container to enable CDI in runtime

e89be14

Signed-off-by: Christopher Desiniotis <[email protected]>

Fix overwriting docker feature flags

2b417c1

Signed-off-by: Evan Lezar <[email protected]>

elezar force-pushed the enable-cdi-toolkit-container branch from 67d6718 to 2b417c1 Compare January 27, 2025 14:52

guptaNswati reviewed Jan 28, 2025

View reviewed changes

cdesiniotis commented Jan 28, 2025

View reviewed changes

cmd/nvidia-ctk-installer/container/runtime/runtime.go Show resolved Hide resolved

elezar added 3 commits January 29, 2025 15:37

Enable CDI in runtime if CDI_ENABLED is set

a7786d4

This change also enables CDI in the configured runtime when the toolkit is installed with CDI enabled. Signed-off-by: Evan Lezar <[email protected]>

[no-relnote] Add unit test for installer command

5ed25bb

This change adds a basic unit test for the nvidia-ckt-installer command. Signed-off-by: Evan Lezar <[email protected]>

[no-relnote] Add basic cdi-enabled tests

d8cd543

Signed-off-by: Evan Lezar <[email protected]>

elezar force-pushed the enable-cdi-toolkit-container branch from 8ad9c6a to d8cd543 Compare January 29, 2025 14:37

cdesiniotis commented Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option in toolkit container to enable CDI in runtime #838

Add option in toolkit container to enable CDI in runtime #838

cdesiniotis commented Dec 18, 2024 •

edited by elezar

Loading

tariq1890 Dec 19, 2024

elezar Dec 20, 2024

cdesiniotis Dec 20, 2024

elezar Jan 16, 2025

tariq1890 Dec 19, 2024

elezar Dec 20, 2024

elezar left a comment

elezar commented Jan 27, 2025

elezar commented Jan 28, 2025

guptaNswati Jan 28, 2025

cdesiniotis Jan 28, 2025

guptaNswati Jan 28, 2025

cdesiniotis Jan 28, 2025

cdesiniotis commented Jan 28, 2025

cdesiniotis Jan 29, 2025

tariq1890 Jan 29, 2025

cdesiniotis Jan 29, 2025 •

edited

Loading

cdesiniotis commented Jan 31, 2025

	&cli.BoolFlag{
	Name: "cdi-enabled",
	Aliases: []string{"enable-cdi"},
	Usage: "enable the generation of a CDI specification",
	Destination: &opts.cdiEnabled,
	EnvVars: []string{"CDI_ENABLED", "ENABLE_CDI"},
	},

	Usage: "Enable CDI in the configured runtime",
	Usage: "Enable Container Device Interface (CDI) in the configured runtime",

Add option in toolkit container to enable CDI in runtime #838

Are you sure you want to change the base?

Add option in toolkit container to enable CDI in runtime #838

Conversation

cdesiniotis commented Dec 18, 2024 • edited by elezar Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elezar left a comment

Choose a reason for hiding this comment

elezar commented Jan 27, 2025

elezar commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdesiniotis commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdesiniotis Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

cdesiniotis commented Jan 31, 2025

cdesiniotis commented Dec 18, 2024 •

edited by elezar

Loading

cdesiniotis Jan 29, 2025 •

edited

Loading