-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: make readiness probe timeout configurable to support faulty plugins #13179
Comments
Hi @calvarado2004 can you provide the specific log line you're seeing? It sounds like you're saying that after the plugin container starts that it takes 15min for the plugin to successfully respond to probes. But if that's the case, there's almost certainly something broken with your plugin's setup (or possibly the Portworx plugin has a bad bug; it definitely should not be taking 15min to start!). If so, can you please provide the full plugin allocation logs? |
It's been a while since we've used Portworx but IIRC, when you run the portworx container on a machine that's never had it, it runs a full initialization process that involves unpacking an OCI image onto the host and also identifying and provisioning the disks that it was told to use for storage. I vaguely remember going through this process on our clients which were 8 core 32GB nodes and I remember the initial setup taking some minutes to actually become "ready" (it's not ready until some threshold of nodes have joined each other to create a cluster). I can totally see this taking quite a bit of time in more resource constrained environments (though, 15 minutes does seem like it would be on the far end of the spectrum lol). That said, after the initial run on the host, subsequent runs/restarts of portworx take considerably less time to start up. It may be worth first running the Nomad job without the CSI registration set in the job spec to let it do all the initial start-up stuff and then once the cluster finishes coming online, stop the job and resubmit it with the CSI stanza specified so that it registers. |
That sounds like a reasonable workaround @RickyGrassmuck. Thanks! But we should probably make the plugin supervisor timeout (ref |
@RickyGrassmuck Yes that's still accurate, thanks for adding that context here. It normally only takes a few minutes to pull the image/unpack/start PX up depending on network speed. @tgross is anyone working on this issue? If not, I can assist here. Where were you thinking of making the timeout configurable? In the CSI Plugin stanza? |
It's on our short-term todo list but no one is immediately assigned to it. I'd be happy to review a PR if you're interested.
Yeah, that seems like the right place for it and that struct is available where the readiness probe timeout happens without too much trouble. |
Sounds good, I'll send a PR soon. |
#13340 has been merged and will ship in Nomad 1.3.2. Thanks @ggriffiths! |
Awesome! Can we backport this to 1.1.x and 1.2.x as well @tgross? We hit this on Nomad 1.1.14 originally in our lab and on 1.2.6 with a customer cluster. |
We don't generally backport new features. But I get that it makes compatibility with Portworx tricky. Lemme see what it would take to get the backport thru (it's a headache because the code changes out from under us). |
That makes sense. Hopefully we can get an exception here as the 30 second timeout was a breaking change for drivers that take longer than that to start up. |
Ah yeah that's right, the timeout was backported, so that makes it a regression to fix. I'll make the automated backport, and then follow-up with a PR to fixup the changelog entry to make it clear we're fixing a regression. |
Awesome, thank you @tgross! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
Nomad v1.3.1 (2b054e3)
Operating system and Environment details
Ubuntu Linux 20.04
Issue
CSI Plugin on Nomad 1.3 have a very short timeout. Affects Portworx which can take 15 minutes to be installed.
Reproduction steps
Try to install Portworx on Nomad as a job following this guide:
https://docs.portworx.com/install-with-other/nomad/installation/install-as-a-nomad-job/
Expected Result
Portworx running on Nomad
Actual Result
CSI plugin timeout
The text was updated successfully, but these errors were encountered: