-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disable_host_node_id": false does not work as advertised on Linux #4914
Comments
Bump `shirou/gopsutil` to include shirou/gopsutil#603 This will allow to have consistent node-id even when machine is reinstalled when using `"disable_host_node_id": false` It will fix hashicorp#4914 and allow having the same node-id even when reinstalling a node from scratch. However, it is only compatible with a single OS (installing to Windows will change the node-id, but it seems acceptable).
This seems like a sensible upgrade. It works on more machines. No reason not to want to merge that change. I'm still trying to get my head around the expectations and deployment modes for all of these different UUID options. I'm struggling to see actually what value host node IDs have over the random one we generate and store in the agent state. Can you fill me in on why they are important in your use-case? It seems to me the only time they are beneficial is if you wipe away all the consul agent state on a host and then start Consul again, but keeping the same hostname and IP? Is that a real thing you need to do? If not then what value does a stable host ID provide? |
@banks In order to be able to rename a node (something we want to do, because our original naming is using the FQDN which is not compatible with DNS labels), we need constantId, but the change I introduced in #4415 had another nice effect: In our platform, we try to re-install often bare-metal nodes very often while keeping the same name (re-installl take less than 1 hour), so after #4415 I figured out that nodes had conflicts when being re-installed. I found then two bugs:
Hence this PR to mitigate the issue and being closer to specification. Note that on Windows, I figured out that the value returned is not stable neither (it depends on Installation). We are finally using the same algorithm as Consul, but generated by chef using the value returned by dmidecode since during our process nodes go from centos6 to centos7 (thus, we are sure to have a constant ID during the full re-install process), so we are now doing something like that: def generate_node_id_for_consul
uuid = node['dmi']['system']['uuid']
raise "Cannot get node['dmi']['system']['uuid']" unless uuid
raise "Wrong format for UUID #{uuid}" unless uuid =~ /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/i
# Same algorithm as Consul
h = Digest::SHA512.hexdigest(uuid.downcase)
"#{h[0..7]}-#{h[8..11]}-#{h[12..15]}-#{h[16..19]}-#{h[20..31]}"
end But this is an issue for anyone having bare-metal nodes re-installing often and keeping same node name |
@banks To add a few examples to those kind of issues, we also have issues with IPs for instance, image this machine renamed with the following pattern (imagine those changes are quite quick - for instance during the process of validation hardware): machine1.inventory.acme.com -> machine1.preprod.acme.com -> machine1.prod.acme.com If we cannot rename, we have issues with IPs (if their IP is not changing), Consul does not like very much when machines do keep their IPs but are actually different nodes, thus, the mechanism allow us to avoid having such issues. Another typical use-case we have is the following: HW with serial00001 aka node machine1.prod -> the machine breaks Someone fixes serial00001 -> the machine is back in the game... Hence the review #5008 in order to be able to choose the "best" use case for your DC and operations |
…4926) Bump `shirou/gopsutil` to include shirou/gopsutil#603 This will allow to have consistent node-id even when machine is reinstalled when using `"disable_host_node_id": false` It will fix #4914 and allow having the same node-id even when reinstalling a node from scratch. However, it is only compatible with a single OS (installing to Windows will change the node-id, but it seems acceptable).
On Linux, the file
/sys/class/dmi/id/product_uuid
cannot be read when Consul does not run as root.Thus, the constant uuid does not work as it uses
kernel.random.boot_id
(a constant value for a given boot).The big advantage of this value is to provide a constant identifier, even with Windows machines (since product_uuid was computed the same way).
Another problem is that
/sys/class/dmi/id/product_uuid
implementation does change from kernel to kernel as explained here: https://lkml.org/lkml/2012/3/26/331 => from centos 6 to Centos 7 for instance, this value is gonna change (thus it can create issues when re-installing/upgrading quickly the OS from Centos 6 to 7)I made a fix to the underlying library: shirou/gopsutil#603 => at least it allows to have a fixed uuid for all Linux machines (we do not consider there the "upgrade" from Linux to Windows).
I propose to upgrade the lib shirou/gopsutil once shirou/gopsutil#603 is merged, so it will work as advertised for Linux hosts.
@mkeeler what do you think?
References:
The text was updated successfully, but these errors were encountered: