Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul-esm 0.6.0 & 0.6.1 actually doesn't check services from all namespaces #143

Closed
hsharma8693 opened this issue Jun 13, 2022 · 5 comments · Fixed by #191
Closed

consul-esm 0.6.0 & 0.6.1 actually doesn't check services from all namespaces #143

hsharma8693 opened this issue Jun 13, 2022 · 5 comments · Fixed by #191
Milestone

Comments

@hsharma8693
Copy link

Describe the bug
consul-esm 0.6.0 & consul-esm 0.6.1 actually doesn't check services from all namespaces

To Reproduce
Following reproduction steps what customer suggested, I tried to reproduce the steps and observe the same issue.

  • Register a node with meta external-node=true (e.g. using terraform, or manually via api)
resource "consul_node" "test" {

name = "external-test"

address = "127.0.0.1"

meta = {

"external-node" = "true"

}

}

  • Register a service against this node within a consul namespace (e.g. with terraform, or manually via api); nothing is actually running on localhost:1234
resource "consul_service" "test" {
name = "test"
node = consul_node.test
port = 1234

check {
check_id = "test:test"
name = "test"
status = "passing"
tcp = "localhost:1234"
interval = "30s"
timeout = "5s"
}
}

  • Run consul-esm with a consul policy that gives it read privileges to all namespaces, and all services in all namespaces (e.g. using a global-management token)

  • Expect the external node to be marked as critical, due to the test service not getting any response to probes on localhost:1234. Observe that the check never changes state from passing, and that logs do not show consul-esm adding a check for the test service at all.

Expected behavior

There should be more checks detected, as they have other services registered in non-default namespaces. They have added a service registration with an initial check result of passing but which is not actually running, so they would expect the service check to be quickly switched to critical by consul-esm, but that is not happening.

Environment:

  • Product Version: consul-esm 0.6.0 & 0.6.1

Server configuration file(s) and read output of any relevant mount/role/etc configuration:
Attached in repro steps

Is there a workaround? If so, how satisfied is the customer with the workaround?
NA

Additional context

Customer has highlighted following things what they observed in last ticket #53909 raised by him, where same functionality of checks for all services across namespaces was discussed. However, with consul-esm 0.6.0 & 0.6.1 it was not being addressed. Besides, customer has highlighted couple of inputs by referring to codebase.

I think what's actually been implemented is the ability for the consul-esm leader to find instances of consul-esm service across all namespaces for the purpose of dividing the check workload across all (healthy) instances:

https://github.com/hashicorp/consul-esm/blob/6302b2f315802949f7fec72b35fe4c738eaebffd/leader.go#L279 enumerates all workspaces looking for instances of a.Config.Service, which defaults to "consul-esm" (https://github.com/hashicorp/consul-esm/blob/main/config.go#L113).

However, https://github.com/hashicorp/consul-esm/blob/6302b2f315802949f7fec72b35fe4c738eaebffd/agent.go#L459 which does the work of checking the catalog for services to be monitored has an explicit comment:

"All ESM health checks are node checks and in the 'default' namespace"

Where the API to is made to fetch a list of services without specifying a namespace, so defaults to "default" namespace. It's also not carried out inside a loop that would allow it to check more than one namespace. I think liunes ~459-470 (of git master) need to be wrapped up in a loop iterating over each namespace.

Given the changelog comment
"Add support for Consul Namespaces and controlling which namespaces ESM monitors. [GH-115]"
I think it was intended that services from all namespaces be enumerated and subsequently monitored, but it was not implemented correctly.

Regards,
Himanshu Sharma
Sr. Support Engineer | Consul
HashiCorp

@eikenb eikenb added the bug label Jun 13, 2022
@eikenb eikenb added this to the v0.7.0 milestone Sep 14, 2022
@eikenb
Copy link
Contributor

eikenb commented Nov 15, 2022

Hey @hsharma8693!

Looking at ESM/Namespaces/ACLs it looks like Consul-ESM might be designed to be run in a cluster. Where you have a master/control ESM instance in the global namespace to detect and hand out the work to a separate ESM instance running with a Token that was created in that Namespace.

Token's created in a Namespace use that Namespace as their "default". So that comment line...

"All ESM health checks are node checks and in the 'default' namespace"

... means the "default" namespace granted by the Token (which falls back to global if no Namespace was given for the Token).

Do you know if your customer has tried that setup?

Based on the notes and such it doesn't look like it. It looks like they are trying to use 1 ESM global instance to watch all Namespaces.

Thanks!

@optiz0r
Copy link

optiz0r commented Nov 15, 2022

Yes, I am wanting to check services across all namespaces that the acl token is permissions for. Currently esm will form a cluster from instances of itself across all namespaces, but then only run checks for services registered in the default namespace (in which the acl token is defined).

@eikenb eikenb added enhancement and removed bug labels Nov 16, 2022
@eikenb
Copy link
Contributor

eikenb commented Nov 16, 2022

Thanks for that @optiz0r!

I was looking at this as a bug report but knowing it is actually working as intended and that this is more a change to that than a fix is a big help in making sense of things and figuring out what needs to be done.

@optiz0r
Copy link

optiz0r commented Nov 16, 2022 via email

@eikenb
Copy link
Contributor

eikenb commented Nov 16, 2022

It makes sense to me given the history. Namespace support was written as part of a larger patch set by a community member and was probably targeted at their use case and that didn't get communicated well in the documentation.

I can see it as a bug from an external point of view and didn't mean to imply any sort of deception. Just more that upon attaining a better understanding of the internals it seemed it was more a communications/documentation issue and I appreciate the confirmation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants