-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad server panic with SIGSEGV signal when using CSI volumes. #11174
Comments
Hi @CarbonCollins, thanks for the report. I tried to reproduce this issue, but no luck so far. It seems like your server data may may be corrupted somehow?
To give more details, the panic happens when Nomad tries to remove CSI plugin information from its internal database. In order to do this, it iterates over the list of registered plugins, and removes the ones that don't have a node or controller associated with it. The problem is that, altering the table in the same transaction as an iteration is happening can cause the iterator state to become invalid. Consul actually had a similar error reported a while back (Nomad and Consul use the same database internally) and provided a fix. Would you be able to build and test Nomad from this branch that contains an updated version of Also, if you don't have any sensitive information if your cluster, would you mind sending us one of your server's Thank you! |
Ops, GithHub closed this by accident. |
No worries. Priority number 1 is getting things running 😄
Hum...that's interesting. I wonder if this could be the cause for things like #10927, where Nomad thinks that the volume is still being used. A plugin allocation count is also used to determine if the plugin is still being used, so a miscount there could also cause your initial problem, where the plugin gets GC while the list of plugins is iterated on.
I think at this point it won't make a difference. I wanted to see if it would help prevent the I will close this issue for now, since you were able to recreate your cluster, but feel free to reach out if you ever hit that |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
1.1.4
Operating system and Environment details
OS: Rasbian 10 (buster)
Arch: armv7l
Kernel: 5.10.60-v7l+
Running on RasPi 4 with SSD boot drive.
Issue
I’m having trouble with my 3 Nomad servers where each of them is refusing to run for more than a few seconds before going into a panic. I recently started doing some testing with CSI volumes which as far as I can tell seems to be what’s causing the panic as its the last error in the logs before the SIGSEGV signal is thrown.
My clients nodes seem to be fine (minus the fact that there are no active servers right now).
The 3 servers boot and seem to fail with very similar errors (see server logs below)
Reproduction steps
Unfortunately I do not know exactly how to reproduce as I did not notice the issue was occurring until I was failing to submit jobs. I had recently restarted the machines (not all at once but they would have been in succession through an ansible playbook). Once I had noticed the error it occurs every time shortly after the server starts.
Expected Result
Nomad server to start without exiting with SIGSEGV panic
Actual Result
Nomad server runs for a few seconds while it initialises before panicking with a runtime error and then exiting with the SIGSEGV signal.
Job file (if appropriate)
CSI controller job:
CSI node job
These jobs are slight variations from the
democratic-csi
docs page: https://github.com/democratic-csi/democratic-csi/blob/master/docs/nomad.mdNomad Server logs (if appropriate)
Nomad Server config
All servers are Debian 10 armv7l
I have excluded the vault, consul, advertise, and ports stanzas from the above config.
Nomad Client logs (if appropriate)
no logs for client
Nomad Client config
The clients have a range of different architectures (amd64, armv7l, and aarch64) and OS's (debian 10, Manjaro, Manjaro ARM, and macOS Big Sur)
I have excluded the vault, consul, advertise, ports, host volumes, and network stanzas from the above config.
Other
I originally raised a discussion on the forum so for cross linking purposes: https://discuss.hashicorp.com/t/nomad-servers-crash-within-a-few-seconds-of-starting-with-sigsegv-panic/29397
The text was updated successfully, but these errors were encountered: