sync sandboxes and containers after starting the pre-installed plugins #43

Iceber · 2023-05-18T06:23:05Z

For plugins registered with nri.sock, NRI will send the full set of pods and containers after starting.

Lines 405 to 430 in 8329599

    
           conn, err := l.Accept() 
        
           if err != nil { 
        
           	log.Infof(ctx, "stopped accepting plugin connections (%v)", err) 
        
           	return 
        
           } 
        
           p, err := r.newExternalPlugin(conn) 
        
           if err != nil { 
        
           	log.Errorf(ctx, "failed to create external plugin: %v", err) 
        
           	continue 
        
           } 
        
           if err := p.start(r.name, r.version); err != nil { 
        
           	log.Errorf(ctx, "failed to start external plugin: %v", err) 
        
           	continue 
        
           } 
        
           r.Lock() 
        
           err = r.syncFn(ctx, p.synchronize) 
        
           if err != nil { 
        
           	log.Infof(ctx, "failed to synchronize plugin: %v", err) 
        
           } else { 
        
           	r.plugins = append(r.plugins, p) 
        
           	r.sortPlugins() 
        
           }

For pre-installed plugins, it is still necessary to synchronize the sandboxes and containers data to the plugin after launch，otherwise the plugin will never be able to get information about existing resources

codecov-commenter · 2023-05-18T06:25:17Z

Codecov Report

Patch coverage: 27.27% and project coverage change: -0.08 ⚠️

Comparison is base (2a8b655) 63.83% compared to head (8fa7ae9) 63.75%.

❗ Current head 8fa7ae9 differs from pull request most recent head 75ad210. Consider uploading reports for the commit 75ad210 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #43      +/-   ##
==========================================
- Coverage   63.83%   63.75%   -0.08%     
==========================================
  Files           9        9              
  Lines        1800     1810      +10     
==========================================
+ Hits         1149     1154       +5     
- Misses        500      503       +3     
- Partials      151      153       +2

Impacted Files	Coverage Δ
pkg/adaptation/adaptation.go	`70.27% <27.27%> (-1.62%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

klihub

Thanks for the fix ! LGTM.

Iceber · 2023-05-18T10:21:27Z

@klihub I changed the call to syncFn after each pre-install plugin launch to only call syncFn once after all plugins are launched, which avoids unnecessary multiple calls to syncFn in the case of a large number of plugins
https://github.com/containerd/nri/compare/104c8e844981be230d53ffac4f9eee5d1db0634e..6f7af44d11899f94a6f619bab0bba6b012bd8e4e

PTAL, Thanks

klihub · 2023-05-18T10:32:48Z

@klihub I changed the call to syncFn after each pre-install plugin launch to only call syncFn once after all plugins are launched, which avoids unnecessary multiple calls to syncFn in the case of a large number of plugins
https://github.com/containerd/nri/compare/104c8e844981be230d53ffac4f9eee5d1db0634e..6f7af44d11899f94a6f619bab0bba6b012bd8e4e

PTAL, Thanks

True, better to do it that way.
LGTM.

mikebrow

see questions

mikebrow · 2023-05-23T21:02:46Z

pkg/adaptation/adaptation.go

+		for _, plugin := range plugins {
+			us, err := plugin.synchronize(ctx, pods, containers)
+			if err != nil {
+				return nil, fmt.Errorf("failed to sync NRI Plugin %q: %w", plugin.name(), err)


should this be a warning instead and continue?

There are two possible approaches to consider: skipping the synchronization if it fails or returning an error.

I chose to return an error for the following reason: pre-install plugins and external plugins are different in some ways.
The pre-install plugin binary is launched by the NRI adaptation, so it is difficult for users to know whether the binary plugin has been launched successfully, apart from the logs.
Therefore, users may overlook the fact that the plugin binary is not actually working. That's why I require that all pre-install plugins requested by the user must be launched(started, synced) successfully.
This is similar to starting and returning an error if it fails.

nri/pkg/adaptation/adaptation.go

Lines 336 to 338 in fa3ab8f

if err := p.start(r.name, r.version); err != nil {

return err

}

On the other hand, external plugins are usually registered by external binaries, and we could perceive more clearly whether the NRI registration and synchronization are successful or not.

I suppose two other patterns would be:
3) to introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch
4) attempt to start and synch all and report all errors

that said should we stop any that we started if there's an error..

to introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch

Could you introduce more details? thanks

current containerd plugins have a requires section for specifying which plugins must be init'd.. before initing a plugin

we probably need an additional state beyond init for the containerd plugins that would cover ready for work.. for example in the cri case ready may mean all containers/pods restarted ready for grpc calls

nri might need a requires list for the nri plugins (unless we go with external management via systemd? or whatnot) and then within that you could have the concept of ready which might at least cover init and sync ^

additionally we might want to have a flag in the plugin config regarding if sync is required for "ready"

@dmcgowan thoughts?

pkg/adaptation/adaptation.go

champtar · 2024-03-16T00:47:22Z

Can we merge this as is ? It does fix a blocking bug

champtar · 2024-05-10T14:26:16Z

ping @mikebrow @dmcgowan this PR is almost 1 year old

mikebrow · 2024-05-22T22:20:26Z

@klihub let's chat with @Iceber if still interested .. maybe at the next or subsequent containerd call

Iceber · 2024-06-04T11:08:19Z

Hi @mikebrow @klihub Do you have any new ideas for this pr?

I think synchronizing all and reporting back all errors is a good way which gives a more complete picture of the errors that may occur

Of course if still think that skipping and printing warnings is a good way to go, I'll adopt it

Also for more complex configurations my thoughts are:

pre-installed binaries are required by default, I don't think it's necessary for the user to place a bunch of plugin binaries in the directory that don't care if they are successfully launched or not
you can enable/disable some pre-installed plugins by configuring the plugin name, so that only some plugins can be started, or some plugins won't be started at all
I think this feature can be placed in another pr

ci fails from errors.Join, It comes from 1.20, we might consider upgrading go.mod to 1.20

klihub · 2024-06-04T17:17:04Z

ci fails from errors.Join, It comes from 1.20, we might consider upgrading go.mod to 1.20

I filed PR #88 to bump the minimum golang requirement to 1.20. It should be fine for all of our downstream packages. You can then rebase this PR if #88 gets merged.

klihub · 2024-06-04T17:35:35Z

Hi @mikebrow @klihub Do you have any new ideas for this pr?

I think synchronizing all and reporting back all errors is a good way which gives a more complete picture of the errors that may occur

I think synchronizing pre-launched plugins (pre-installed plugins in your terminology) would be essential to make sure that we have an identical handshake/initial message flow, regardless of whether a plugin is pre-launched or not. I made a mistake and that is why I completely overlooked that pre-launched plugins do not get properly synchronized during registration. It was not intentional. It would be better to send an empty sync message than no message at all. There are packages out there which do expect the synchronization message to come before they consider themselves ready to start processing other NRI messages.

That said, I think synchronizing pre-launched plugins is only/really relevant for the case when an already active runtime (with existing containers) gets restarted (after all, on initial startup the set of pods and containers to sync with the plugins is empty). So a related question is whether the runtime itself is able to properly determine/rediscover the state of pods and containers by the time pre-launched plugins are being started and what to do if it is not. And this needs to be checked/tested both with containerd and cri-o because they might behave differently. I have a vague recollection that containerd itself would behave slightly differently for 2.0 than 1.7, but I haven't had time right no to check/test this in practice.

Iceber · 2024-06-06T09:01:48Z

That said, I think synchronizing pre-launched plugins is only/really relevant for the case when an already active runtime (with existing containers) gets restarted (after all, on initial startup the set of pods and containers to sync with the plugins is empty). So a related question is whether the runtime itself is able to properly determine/rediscover the state of pods and containers by the time pre-launched plugins are being started and what to do if it is not. And this needs to be checked/tested both with containerd and cri-o because they might behave differently. I have a vague recollection that containerd itself would behave slightly differently for 2.0 than 1.7, but I haven't had time right no to check/test this in practice.

yeah, this is very important, and can be very bad if the synchronization is incomplete with pods and containers.

In containerd 2.0, the recover function initializes the data before the nri is initialized.
https://github.com/containerd/containerd/blob/45bc430dd12539ee93f908750ce24e52e1122e9a/internal/cri/server/service.go#L257-L300

mikebrow

I like this change just some comments to increase logging to help the plugin admin, and also continue vs return early for failure to new/start a plugin.

** In 1.1 let's consider a more formal process with a manager plugin or 3) introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch

pkg/adaptation/adaptation.go

mikebrow · 2024-09-18T22:45:54Z

@Iceber wdyt about the logging changes..

Iceber · 2024-09-19T01:35:04Z

@Iceber wdyt about the logging changes..

Sorry I missed the email on this pr, I'll take care of it soon

Signed-off-by: Iceber Gu <[email protected]>

Iceber · 2024-09-19T08:41:42Z

@mikebrow I've added a new commit for better viewing of the changes

I've returned plugins that failed to start to syncFn via syncPlugins. after all, those plugins that didn't start are also sync failures/unsyncable plugins

mikebrow

LGTM

Thanks!

mikebrow · 2024-09-20T16:45:43Z

@klihub did you want to take one more look with the additional changes?

klihub · 2024-09-20T16:49:11Z

Yes, I'll do that.

klihub · 2024-09-20T17:30:11Z

Yes, I'll do that.

It'll take me a bit more time though. I'm not at the keyboard ATM.

klihub

It looks pretty good. I spotted a few wrongly formatted log messages (error wrapping vs. default value format %verb), and one oddly returned (value, error) combo.

Then I have a question to @mikebrow about what we should do if a pre-installed plugin failed to start up or get synchronized. The original code errors out and this PR keeps that behavior intact. But now I started to have second thoughts about that. Do we want to prevent the runtime from starting up in such a case, or just log errors, ignore failing plugins and keep going ?

If we get the few log formatting problems fixed, maybe also change that odd return value combo to the standard convention of 'return a nil-value with a non-nil error', then I think this is good to go in.

But if @mikebrow is of the opinion that we should err on the side of caution and ignore failing plugins instead of preventing the runtime from starting up, then we should also change that behavior.

klihub · 2024-09-20T18:30:08Z

pkg/adaptation/adaptation.go

 		p, err := r.newLaunchedPlugin(r.pluginPath, id, name, configs[i])
 		if err != nil {
-			return fmt.Errorf("failed to start NRI plugin %q: %w", name, err)
+			errs = append(errs, fmt.Errorf("[%s] %w", name, err))
+			log.Warnf(noCtx, "failed to initialize pre-installed NRI plugin %q: %w", name, err)


With log.Warnf() we should use %v (value in default format) instead of %w (wrap error, but only understood by fmt.Errorf()).

klihub · 2024-09-20T18:30:45Z

pkg/adaptation/adaptation.go

 		}

 		if err := p.start(r.name, r.version); err != nil {
-			return err
+			errs = append(errs, fmt.Errorf("[%s] %w", name, err))
+			log.Warnf(noCtx, "failed to start pre-installed NRI plugin %q: %w", name, err)


Ditto for this one.

klihub · 2024-09-20T18:31:24Z

pkg/adaptation/adaptation.go

+			if err != nil {
+				plugin.stop()
+				errs = append(errs, fmt.Errorf("[%s] %w", plugin.name(), err))
+				log.Warnf(noCtx, "failed to synchronize pre-installed NRI plugin %q: %w", plugin.name(), err)


Same here, log.*f() should use %v, not %w.

klihub · 2024-09-20T18:41:03Z

pkg/adaptation/adaptation.go

+			updates = append(updates, us...)
+			log.Infof(noCtx, "pre-installed NRI plugin %q synchronization success", plugin.name())
+		}
+		return updates, errors.Join(errs...)


I think it would be cleaner/clearer if we'd always returned either updates, nil, or nil, errors.Join(non_nil_errors). IOW, if errors.Join() returns non-nil, I think we should return nil for updates.

If I understand @mikebrow correctly, syncPlugins returns successful updates and failed plugins&errors, and it's up to syncFn to decide whether or not to ignore a particular wrong plugin

klihub · 2024-09-20T18:43:56Z

pkg/adaptation/adaptation.go

+		return updates, errors.Join(errs...)
+	}
+	if err := r.syncFn(noCtx, syncPlugins); err != nil {
+		return fmt.Errorf("failed to synchronize pre-installed NRI Plugins: %w", err)


This is another thing I now start to have 2nd thoughts about. Do we really want to prevent the runtime from starting up, if any pre-installed plugin fails to start or synchronize ? Or would it be better to just log the errors, ignore failed plugins and continue... Any thoughts on that @mikebrow ?

klihub · 2024-09-21T07:01:26Z

ping @Iceber Apart from a few small nits it looks very good.

mikebrow · 2024-09-23T17:51:07Z

It looks pretty good. I spotted a few wrongly formatted log messages (error wrapping vs. default value format %verb), and one oddly returned (value, error) combo.

Then I have a question to @mikebrow about what we should do if a pre-installed plugin failed to start up or get synchronized. The original code errors out and this PR keeps that behavior intact. But now I started to have second thoughts about that. Do we want to prevent the runtime from starting up in such a case, or just log errors, ignore failing plugins and keep going ?

If we get the few log formatting problems fixed, maybe also change that odd return value combo to the standard convention of 'return a nil-value with a non-nil error', then I think this is good to go in.

But if @mikebrow is of the opinion that we should err on the side of caution and ignore failing plugins instead of preventing the runtime from starting up, then we should also change that behavior.

In r.next 1.1 I'd like us to consider adding plugin config to describe ordering(after:plugin)/dependencies(requires:plugin)/runtime required(MUST) type information
that we would use to order the plugin bring up and whether or not a plugin is required to run a container/pod which would give us a good reason, for example, to not report CRI plugin success and thus block pods on the node. Until then I'd prefer to log errors and continue for failing plugins vs bring the runtime down when there is an unknown / unhandled issue with a plugin.

We simply can't know if the plugin is required always or if it is only required for certain nodes or if the error is related to some unknown resource device dependency that is only met on some of the worker nodes. This gives the admin the ability to test out deployment of new plugins on a running machine, remove the failing ones report logs etc.

Would be easy enough to add the MUST run flag now if you like. Dependency ordering stuff would be harder.

klihub · 2024-09-23T18:30:25Z

In r.next 1.1 I'd like us to consider adding plugin config to describe ordering(after:plugin)/dependencies(requires:plugin)/runtime required(MUST) type information that we would use to order the plugin bring up and whether or not a plugin is required to run a container/pod which would give us a good reason, for example, to not report CRI plugin success and thus block pods on the node. Until then I'd prefer to log errors and continue for failing plugins vs bring the runtime down when there is an unknown / unhandled issue with a plugin.

We simply can't know if the plugin is required always or if it is only required for certain nodes or if the error is related to some unknown resource device dependency that is only met on some of the worker nodes. This gives the admin the ability to test out deployment of new plugins on a running machine, remove the failing ones report logs etc.

Sounds like a good plan to me.

@Iceber Could you update the PR to still log errors from plugin startup or synchronization failures, but return no error so we don't prevent the runtime from starting up.

Signed-off-by: Iceber Gu <[email protected]>

Iceber · 2024-09-24T06:30:51Z

@mikebrow @klihub updated to print logs only, PTAL, Thanks

klihub

LGTM. Thank you @Iceber !

Iceber requested a review from klihub May 18, 2023 06:23

Iceber force-pushed the pre_installed_plugin branch from 16171e9 to 104c8e8 Compare May 18, 2023 06:25

Iceber changed the title ~~sync sandboxs and containers after starting the pre-installed plugin~~ sync sandboxes and containers after starting the pre-installed plugin May 18, 2023

klihub approved these changes May 18, 2023

View reviewed changes

Iceber force-pushed the pre_installed_plugin branch 2 times, most recently from 6f7af44 to 75ad210 Compare May 18, 2023 10:17

Iceber requested a review from klihub May 18, 2023 10:21

Iceber changed the title ~~sync sandboxes and containers after starting the pre-installed plugin~~ sync sandboxes and containers after starting the pre-installed plugins May 18, 2023

klihub approved these changes May 18, 2023

View reviewed changes

Iceber mentioned this pull request May 19, 2023

fix bug NRI plugin isn't invoked after containerd restart because local state is transient containerd/containerd#8523

Closed

mikebrow reviewed May 23, 2023

View reviewed changes

mikebrow mentioned this pull request May 23, 2024

How can I ensure that containerd cannot create a container if the NRI plugin is unavailable? #85

Open

Iceber force-pushed the pre_installed_plugin branch from 75ad210 to f914a93 Compare June 4, 2024 11:03

Iceber force-pushed the pre_installed_plugin branch from f914a93 to 708b260 Compare June 12, 2024 07:22

mikebrow added this to the 1.0 milestone Aug 1, 2024

mikebrow added the v1.0 required label Aug 1, 2024

klihub requested a review from mikebrow August 19, 2024 07:14

mikebrow requested changes Aug 19, 2024

View reviewed changes

pkg/adaptation/adaptation.go Outdated Show resolved Hide resolved

pkg/adaptation/adaptation.go Outdated Show resolved Hide resolved

pkg/adaptation/adaptation.go Outdated Show resolved Hide resolved

sync sandboxes and containers after starting the pre-installed plugins

b881bc4

Signed-off-by: Iceber Gu <[email protected]>

Iceber force-pushed the pre_installed_plugin branch from 708b260 to 1dae14b Compare September 19, 2024 08:38

mikebrow approved these changes Sep 20, 2024

View reviewed changes

klihub requested changes Sep 20, 2024

View reviewed changes

Iceber force-pushed the pre_installed_plugin branch from bb4431b to 15a1a8f Compare September 23, 2024 08:04

ignore pre-installed plugins that did not sync successfully

eada085

Signed-off-by: Iceber Gu <[email protected]>

Iceber force-pushed the pre_installed_plugin branch from 15a1a8f to eada085 Compare September 24, 2024 06:29

klihub self-requested a review September 24, 2024 07:01

klihub approved these changes Sep 24, 2024

View reviewed changes

mikebrow merged commit dda9682 into containerd:main Sep 24, 2024
9 checks passed

	conn, err := l.Accept()
	if err != nil {
	log.Infof(ctx, "stopped accepting plugin connections (%v)", err)
	return
	}

	p, err := r.newExternalPlugin(conn)
	if err != nil {
	log.Errorf(ctx, "failed to create external plugin: %v", err)
	continue
	}

	if err := p.start(r.name, r.version); err != nil {
	log.Errorf(ctx, "failed to start external plugin: %v", err)
	continue
	}

	r.Lock()

	err = r.syncFn(ctx, p.synchronize)
	if err != nil {
	log.Infof(ctx, "failed to synchronize plugin: %v", err)
	} else {
	r.plugins = append(r.plugins, p)
	r.sortPlugins()
	}

	if err := p.start(r.name, r.version); err != nil {
	return err
	}

sync sandboxes and containers after starting the pre-installed plugins #43

sync sandboxes and containers after starting the pre-installed plugins #43

Conversation

Iceber commented May 18, 2023 • edited Loading

codecov-commenter commented May 18, 2023 • edited Loading

Codecov Report

klihub left a comment

Choose a reason for hiding this comment

Iceber commented May 18, 2023

klihub commented May 18, 2023

mikebrow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

champtar commented Mar 16, 2024

champtar commented May 10, 2024

mikebrow commented May 22, 2024

Iceber commented Jun 4, 2024 • edited Loading

klihub commented Jun 4, 2024 • edited Loading

klihub commented Jun 4, 2024 • edited Loading

Iceber commented Jun 6, 2024

mikebrow left a comment • edited Loading

Choose a reason for hiding this comment

mikebrow commented Sep 18, 2024

Iceber commented Sep 19, 2024

Iceber commented Sep 19, 2024

mikebrow left a comment

Choose a reason for hiding this comment

mikebrow commented Sep 20, 2024

klihub commented Sep 20, 2024

klihub commented Sep 20, 2024

klihub left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klihub commented Sep 21, 2024

mikebrow commented Sep 23, 2024

klihub commented Sep 23, 2024

Iceber commented Sep 24, 2024

klihub left a comment

Choose a reason for hiding this comment

Iceber commented May 18, 2023 •

edited

Loading

codecov-commenter commented May 18, 2023 •

edited

Loading

mikebrow Jun 1, 2023 •

edited

Loading

Iceber commented Jun 4, 2024 •

edited

Loading

klihub commented Jun 4, 2024 •

edited

Loading

klihub commented Jun 4, 2024 •

edited

Loading

mikebrow left a comment •

edited

Loading

klihub left a comment •

edited

Loading