-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync sandboxes and containers after starting the pre-installed plugins #43
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #43 +/- ##
==========================================
- Coverage 63.83% 63.75% -0.08%
==========================================
Files 9 9
Lines 1800 1810 +10
==========================================
+ Hits 1149 1154 +5
- Misses 500 503 +3
- Partials 151 153 +2
☔ View full report in Codecov by Sentry. |
16171e9
to
104c8e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix ! LGTM.
6f7af44
to
75ad210
Compare
@klihub I changed the call to syncFn after each pre-install plugin launch to only call syncFn once after all plugins are launched, which avoids unnecessary multiple calls to syncFn in the case of a large number of plugins PTAL, Thanks |
True, better to do it that way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see questions
pkg/adaptation/adaptation.go
Outdated
for _, plugin := range plugins { | ||
us, err := plugin.synchronize(ctx, pods, containers) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to sync NRI Plugin %q: %w", plugin.name(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a warning instead and continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two possible approaches to consider: skipping the synchronization if it fails or returning an error.
I chose to return an error for the following reason: pre-install plugins
and external plugins
are different in some ways.
The pre-install plugin binary
is launched by the NRI adaptation, so it is difficult for users to know whether the binary plugin has been launched successfully, apart from the logs.
Therefore, users may overlook the fact that the plugin binary
is not actually working. That's why I require that all pre-install plugins
requested by the user must be launched(started, synced) successfully.
This is similar to starting and returning an error if it fails.
nri/pkg/adaptation/adaptation.go
Lines 336 to 338 in fa3ab8f
if err := p.start(r.name, r.version); err != nil { | |
return err | |
} |
On the other hand, external plugins are usually registered by external binaries, and we could perceive more clearly whether the NRI registration and synchronization are successful or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose two other patterns would be:
3) to introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch
4) attempt to start and synch all and report all errors
that said should we stop any that we started if there's an error..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- to introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch
Could you introduce more details? thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current containerd plugins have a requires section for specifying which plugins must be init'd.. before initing a plugin
we probably need an additional state beyond init for the containerd plugins that would cover ready for work.. for example in the cri case ready may mean all containers/pods restarted ready for grpc calls
nri might need a requires list for the nri plugins (unless we go with external management via systemd? or whatnot) and then within that you could have the concept of ready which might at least cover init and sync ^
additionally we might want to have a flag in the plugin config regarding if sync is required for "ready"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmcgowan thoughts?
Can we merge this as is ? It does fix a blocking bug |
75ad210
to
f914a93
Compare
Hi @mikebrow @klihub Do you have any new ideas for this pr? I think synchronizing all and reporting back all errors is a good way which gives a more complete picture of the errors that may occur Of course if still think that skipping and printing warnings is a good way to go, I'll adopt it Also for more complex configurations my thoughts are:
ci fails from |
I filed PR #88 to bump the minimum golang requirement to 1.20. It should be fine for all of our downstream packages. You can then rebase this PR if #88 gets merged. |
I think synchronizing pre-launched plugins (pre-installed plugins in your terminology) would be essential to make sure that we have an identical handshake/initial message flow, regardless of whether a plugin is pre-launched or not. I made a mistake and that is why I completely overlooked that pre-launched plugins do not get properly synchronized during registration. It was not intentional. It would be better to send an empty sync message than no message at all. There are packages out there which do expect the synchronization message to come before they consider themselves ready to start processing other NRI messages. That said, I think synchronizing pre-launched plugins is only/really relevant for the case when an already active runtime (with existing containers) gets restarted (after all, on initial startup the set of pods and containers to sync with the plugins is empty). So a related question is whether the runtime itself is able to properly determine/rediscover the state of pods and containers by the time pre-launched plugins are being started and what to do if it is not. And this needs to be checked/tested both with containerd and cri-o because they might behave differently. I have a vague recollection that containerd itself would behave slightly differently for 2.0 than 1.7, but I haven't had time right no to check/test this in practice. |
yeah, this is very important, and can be very bad if the synchronization is incomplete with pods and containers. In containerd 2.0, the |
f914a93
to
708b260
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change just some comments to increase logging to help the plugin admin, and also continue vs return early for failure to new/start a plugin.
** In 1.1 let's consider a more formal process with a manager plugin or 3) introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch
@Iceber wdyt about the logging changes.. |
Sorry I missed the email on this pr, I'll take care of it soon |
Signed-off-by: Iceber Gu <[email protected]>
708b260
to
1dae14b
Compare
@mikebrow I've added a new commit for better viewing of the changes I've returned plugins that failed to start to syncFn via syncPlugins. after all, those plugins that didn't start are also sync failures/unsyncable plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks!
@klihub did you want to take one more look with the additional changes? |
Yes, I'll do that. |
It'll take me a bit more time though. I'm not at the keyboard ATM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty good. I spotted a few wrongly formatted log messages (error wrapping vs. default value format %verb), and one oddly returned (value, error) combo.
Then I have a question to @mikebrow about what we should do if a pre-installed plugin failed to start up or get synchronized. The original code errors out and this PR keeps that behavior intact. But now I started to have second thoughts about that. Do we want to prevent the runtime from starting up in such a case, or just log errors, ignore failing plugins and keep going ?
If we get the few log formatting problems fixed, maybe also change that odd return value combo to the standard convention of 'return a nil-value with a non-nil error', then I think this is good to go in.
But if @mikebrow is of the opinion that we should err on the side of caution and ignore failing plugins instead of preventing the runtime from starting up, then we should also change that behavior.
pkg/adaptation/adaptation.go
Outdated
p, err := r.newLaunchedPlugin(r.pluginPath, id, name, configs[i]) | ||
if err != nil { | ||
return fmt.Errorf("failed to start NRI plugin %q: %w", name, err) | ||
errs = append(errs, fmt.Errorf("[%s] %w", name, err)) | ||
log.Warnf(noCtx, "failed to initialize pre-installed NRI plugin %q: %w", name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With log.Warnf()
we should use %v
(value in default format) instead of %w
(wrap error, but only understood by fmt.Errorf()
).
pkg/adaptation/adaptation.go
Outdated
} | ||
|
||
if err := p.start(r.name, r.version); err != nil { | ||
return err | ||
errs = append(errs, fmt.Errorf("[%s] %w", name, err)) | ||
log.Warnf(noCtx, "failed to start pre-installed NRI plugin %q: %w", name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for this one.
pkg/adaptation/adaptation.go
Outdated
if err != nil { | ||
plugin.stop() | ||
errs = append(errs, fmt.Errorf("[%s] %w", plugin.name(), err)) | ||
log.Warnf(noCtx, "failed to synchronize pre-installed NRI plugin %q: %w", plugin.name(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, log.*f()
should use %v
, not %w
.
pkg/adaptation/adaptation.go
Outdated
updates = append(updates, us...) | ||
log.Infof(noCtx, "pre-installed NRI plugin %q synchronization success", plugin.name()) | ||
} | ||
return updates, errors.Join(errs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner/clearer if we'd always returned either updates, nil
, or nil, errors.Join(non_nil_errors)
. IOW, if errors.Join()
returns non-nil, I think we should return nil for updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand @mikebrow correctly, syncPlugins returns successful updates and failed plugins&errors, and it's up to syncFn to decide whether or not to ignore a particular wrong plugin
return updates, errors.Join(errs...) | ||
} | ||
if err := r.syncFn(noCtx, syncPlugins); err != nil { | ||
return fmt.Errorf("failed to synchronize pre-installed NRI Plugins: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another thing I now start to have 2nd thoughts about. Do we really want to prevent the runtime from starting up, if any pre-installed plugin fails to start or synchronize ? Or would it be better to just log the errors, ignore failed plugins and continue... Any thoughts on that @mikebrow ?
ping @Iceber Apart from a few small nits it looks very good. |
bb4431b
to
15a1a8f
Compare
In r.next 1.1 I'd like us to consider adding plugin config to describe ordering(after:plugin)/dependencies(requires:plugin)/runtime required(MUST) type information We simply can't know if the plugin is required always or if it is only required for certain nodes or if the error is related to some unknown resource device dependency that is only met on some of the worker nodes. This gives the admin the ability to test out deployment of new plugins on a running machine, remove the failing ones report logs etc. Would be easy enough to add the MUST run flag now if you like. Dependency ordering stuff would be harder. |
Sounds like a good plan to me. @Iceber Could you update the PR to still log errors from plugin startup or synchronization failures, but return no error so we don't prevent the runtime from starting up. |
Signed-off-by: Iceber Gu <[email protected]>
15a1a8f
to
eada085
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @Iceber !
For plugins registered with nri.sock, NRI will send the full set of pods and containers after starting.
nri/pkg/adaptation/adaptation.go
Lines 405 to 430 in 8329599
For pre-installed plugins, it is still necessary to synchronize the sandboxes and containers data to the plugin after launch,otherwise the plugin will never be able to get information about existing resources