-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] How to force CSI Node and Plugin jobs start before any other workload #8994
Comments
If the CSI Node task isn't running, then it won't have fingerprinted with the servers and should not be showing up as a valid scheduling target. Are the plugins being shutdown cleanly so that they deregister themselves with the servers?
In your plugin configuration on the clients you can either:
See my first answer above. The CSI plugin doesn't get registered until after the plugin container is running.
Unfortunately that's not going to work. The directory with the CSI control socket needs to be within the Nomad allocation directory that gets created by Nomad just before the container gets built. |
Nice insight , thanks. Controller job
Node job
We will try to implement a controlled shutdown with a drain without forcing system job out and after that stop these jobs, will that, in your opinion solve the plugin de-registration issue ? Sometimes the plugin stats get all strange with odd numbers like having more nodes expected than nodes present. Regarding Azuredisk , you still do not have any documentation. It is very frustrating only seeing documentation for k8s and trying to port everything to nomad. Please add more examples.. |
I don't have an Azure environment handy to verify them, but those jobs look reasonable. Depending on the size of your deployment, I'd consider running multiple controllers spread across hosts so that you're less likely to have one of them on a node that you're in the middle of draining.
It should. See this note in the
For your dev environment:
When the servers come up (or if they're left running), they won't have a heartbeat from the client agents in however many hours and will consider them lost. And that'll result in them trying to reschedule all the workloads, including the CSI plugins. If you want to shutdown the whole cluster you'll definitely need to shutdown all the volume-consuming workloads so that volume claims get released. You should avoid #8609 because you're shutting down rather than draining off to another client. That being said, taking off my Nomad developer hat for a moment, if you're running on cloud infrastructure anyways it might be worth your while to automate recreating the developer environment from scratch every day. I've done that sort of thing to prevent developers from leaving all kinds of undocumented things laying around development environments that later bite you when you get to production.
We have a couple of open bugs on that (#8948, #8628, #8034). There's also some eventual consistency because it takes a while for plugins to come live: #7296. I've got some time coming up in the next couple weeks to work on some of those items.
Please keep in mind that Nomad's CSI integration is still marked as beta, so you'll need to do a bit of lift here. There are ~100 or different CSI drivers, so we won't be able to provide documentation for them all. There's been a few other folks who've given it a go, if you search through issues for storage and Azure |
As per our research we implemented some guardrails on our pipeline to stop the jobs and check if the disks are detached normally. Having a controlled shutdown and startup did the trick. My colleagues and I will continue to ask stuff and do code or documentation contributions so that it gets easier for the next guy. Thanks for all your help, you are doing a great job. |
Ha ha! I totally missed that 😆 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
How to force CSI Node and Plugin jobs start before any other workload ?
After an infrastructure restart, workloads are being scheduled into nodes that have not yet the CSI Node task running as it takes some time to download and start.
We get a lot of errors related to exhausted write claims also . If node and controller plugins are running ok before all tasks, all is good but if It fails or takes too long to start, everything start to get pretty unstable.
We are using azuredisk CSI Driver , mounting azure managed disks to nomad clients.
The text was updated successfully, but these errors were encountered: