-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add collect_server_info config option #9298
Conversation
envoy/datadog_checks/envoy/envoy.py
Outdated
if not self.collect_server_info: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems conflicting if metadata is enabled but collect_server_info is disabled. Either we can log a debug message (like "Skipping server info collection because collect_server_info
was disabled") or we can use this as a tracker to intuitively stop attempting to collect metadata between check runs if it's unreachable? Perhaps there is a specific message we're seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the case where someone might have enable_metadata_collection: true
in their agent config and collect_server_info:false
for the envoy check? I didn't know about the agent config option, that is a good point.
The idea is to bypass metadata collection due to known limitations on some Envoy deployments, avoid log spam if user knows that this is expected to fail (because it would be configured manually), while not needing to disable metadata collection all together (perhaps they want to collect it for other checks?). Displaying a debug statement sounds like a good idea.
There is no specific message, the endpoint is just unreachable. So although providing an intuitive solution that would not require any further user configuration might be nice, it would be difficult to avoid false positives where we cannot reach the endpoint for some other reason besides the underlying known limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO if the endpoint is unreachable, then we can check the response code and log a message in the first occurrence, then skip it after. It will attempt on the first run (after restart) so customers don't need to actively maintain a config option, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach is fine but if a user's envoy temporarily fails then it will raise a false positive, no? We would need to see repeated failures, rather than going off of first failure. Even then, repeated failures are not indicators of not wanting to collect_server_info.
If envoy is down, check fails, so does metadata, and thus it gets disabled. When envoy goes back up, check will still be running but not collecting metadata anymore, right?
Its essentially a tradeoff between doing it behind the scenes and risk catching false positives for less friction of configuring your instances. If the cases for false positives seem rare to you, then I would say it makes sense to go with your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see the collect_metadata()
function is called first in the check()
. If it's moved after the stats_url, then an unavailable instance would continue not cause the collect_server_info to be set to False. But I agree that this approach is safer since we're letting the user control it.
My main concern was if this may lead to more integrations needing a config option to disable metadata collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content looks fine 👍
I have tested flag this on Istio/Kubernetes with version datadoghq/agent:7.27.0
But I am still seeing attempts to contact a server_info endpoint. Furthermore, it ignores my stats_url, which I would assume being a base_url for other endpoints.
|
Hi @boeboe what is the version of the |
What does this PR do?
Add a config option to Envoy check to disable checking the
/server_info
endpoint to collect metadata. This cannot be accessed in some configurations of Envoy, like consul connect or kubernetes deployments that only expose the/stats
endpoint.Motivation
Users will be spammed with logs otherwise:
Additional Notes
Review checklist (to be filled by reviewers)
changelog/
andintegration/
labels attached