Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable_host_node_id": false does not work as advertised on Linux #4914

Closed
pierresouchay opened this issue Nov 7, 2018 · 3 comments · Fixed by #4926
Closed

disable_host_node_id": false does not work as advertised on Linux #4914

pierresouchay opened this issue Nov 7, 2018 · 3 comments · Fixed by #4926
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature

Comments

@pierresouchay
Copy link
Contributor

On Linux, the file /sys/class/dmi/id/product_uuid cannot be read when Consul does not run as root.

Thus, the constant uuid does not work as it uses kernel.random.boot_id (a constant value for a given boot).

The big advantage of this value is to provide a constant identifier, even with Windows machines (since product_uuid was computed the same way).

Another problem is that /sys/class/dmi/id/product_uuid implementation does change from kernel to kernel as explained here: https://lkml.org/lkml/2012/3/26/331 => from centos 6 to Centos 7 for instance, this value is gonna change (thus it can create issues when re-installing/upgrading quickly the OS from Centos 6 to 7)

I made a fix to the underlying library: shirou/gopsutil#603 => at least it allows to have a fixed uuid for all Linux machines (we do not consider there the "upgrade" from Linux to Windows).

I propose to upgrade the lib shirou/gopsutil once shirou/gopsutil#603 is merged, so it will work as advertised for Linux hosts.

@mkeeler what do you think?

References:

pierresouchay added a commit to pierresouchay/consul that referenced this issue Nov 8, 2018
Bump `shirou/gopsutil` to include shirou/gopsutil#603

This will allow to have consistent node-id even when machine is reinstalled
when using `"disable_host_node_id": false`

It will fix hashicorp#4914 and allow having
the same node-id even when reinstalling a node from scratch. However,
it is only compatible with a single OS (installing to Windows will change
the node-id, but it seems acceptable).
@banks
Copy link
Member

banks commented Nov 16, 2018

This seems like a sensible upgrade. It works on more machines. No reason not to want to merge that change.

I'm still trying to get my head around the expectations and deployment modes for all of these different UUID options.

I'm struggling to see actually what value host node IDs have over the random one we generate and store in the agent state.

Can you fill me in on why they are important in your use-case? It seems to me the only time they are beneficial is if you wipe away all the consul agent state on a host and then start Consul again, but keeping the same hostname and IP? Is that a real thing you need to do? If not then what value does a stable host ID provide?

@banks banks added type/enhancement Proposed improvement or new feature theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Nov 16, 2018
@pierresouchay
Copy link
Contributor Author

pierresouchay commented Nov 16, 2018

@banks In order to be able to rename a node (something we want to do, because our original naming is using the FQDN which is not compatible with DNS labels), we need constantId, but the change I introduced in #4415 had another nice effect:

In our platform, we try to re-install often bare-metal nodes very often while keeping the same name (re-installl take less than 1 hour), so after #4415 I figured out that nodes had conflicts when being re-installed. I found then two bugs:

  • /sys/class/dmi/id/product_uuid not being stable across kernel versions (centos6 vs centos7)
  • /sys/class/dmi/id/product_uuid not being readable by non-root process

Hence this PR to mitigate the issue and being closer to specification.

Note that on Windows, I figured out that the value returned is not stable neither (it depends on Installation).

We are finally using the same algorithm as Consul, but generated by chef using the value returned by dmidecode since during our process nodes go from centos6 to centos7 (thus, we are sure to have a constant ID during the full re-install process), so we are now doing something like that:

def generate_node_id_for_consul
      uuid = node['dmi']['system']['uuid']
      raise "Cannot get node['dmi']['system']['uuid']" unless uuid
      raise "Wrong format for UUID #{uuid}" unless uuid =~ /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/i
      # Same algorithm as Consul
      h = Digest::SHA512.hexdigest(uuid.downcase)
      "#{h[0..7]}-#{h[8..11]}-#{h[12..15]}-#{h[16..19]}-#{h[20..31]}"
end

But this is an issue for anyone having bare-metal nodes re-installing often and keeping same node name

@pierresouchay
Copy link
Contributor Author

@banks To add a few examples to those kind of issues, we also have issues with IPs

for instance, image this machine renamed with the following pattern (imagine those changes are quite quick - for instance during the process of validation hardware):

machine1.inventory.acme.com -> machine1.preprod.acme.com -> machine1.prod.acme.com

If we cannot rename, we have issues with IPs (if their IP is not changing), Consul does not like very much when machines do keep their IPs but are actually different nodes, thus, the mechanism allow us to avoid having such issues.

Another typical use-case we have is the following:

HW with serial00001 aka node machine1.prod -> the machine breaks
We use a new HW temporarily with serial00042 aka node machine1.prod

Someone fixes serial00001 -> the machine is back in the game...

Hence the review #5008 in order to be able to choose the "best" use case for your DC and operations

mkeeler pushed a commit that referenced this issue Jan 10, 2019
…4926)

Bump `shirou/gopsutil` to include shirou/gopsutil#603

This will allow to have consistent node-id even when machine is reinstalled
when using `"disable_host_node_id": false`

It will fix #4914 and allow having
the same node-id even when reinstalling a node from scratch. However,
it is only compatible with a single OS (installing to Windows will change
the node-id, but it seems acceptable).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants