-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: make gRPC client creation more robust #12057
Conversation
df76de4
to
0917995
Compare
a582d24
to
b490349
Compare
b490349
to
6b27231
Compare
6b27231
to
7d0f22d
Compare
7d0f22d
to
4193588
Compare
4193588
to
f3d72e6
Compare
f3d72e6
to
bda257d
Compare
f8f6440
to
33e03fc
Compare
33e03fc
to
a27bd89
Compare
The plugin supervisor registers the plugin in the `Poststart` hook, so the task itself should be running. If the plugin can't communicate with us after 30s, exit and mark the task as unhealthy so that it can be restarted.
a27bd89
to
4804d2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; just the one question about kill
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Related to #11784. It doesn't fix it but makes debugging it more legible.
I've broken this up into mostly bite-sized commits for review.
Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.
If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started.
plugin manager instance the ability to retry the gRPC client
connection until success.
forever waiting for a plugin that will never come back up.
Minor improvements:
and then throws it away. Instead, reuse the client as we do for the
plugin manager.
timeout applies to the connection and not the rest of the client
lifetime.