Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add collect_server_info config option #9298

Merged
merged 5 commits into from
May 6, 2021
Merged

Conversation

luisgonzalex
Copy link
Contributor

@luisgonzalex luisgonzalex commented May 5, 2021

What does this PR do?

Add a config option to Envoy check to disable checking the /server_info endpoint to collect metadata. This cannot be accessed in some configurations of Envoy, like consul connect or kubernetes deployments that only expose the /stats endpoint.

Motivation

Users will be spammed with logs otherwise:

2021-04-27 20:44:34 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:706dc0e2f7ee4a62 | (envoy.py:156) | Envoy endpoint `http://172.17.0.9:90/server_info` responded with HTTP status code 404

Additional Notes

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached

@luisgonzalex luisgonzalex changed the title add collect_server_info config option Add collect_server_info config option May 5, 2021
@luisgonzalex luisgonzalex marked this pull request as ready for review May 5, 2021 18:04
@luisgonzalex luisgonzalex requested review from a team as code owners May 5, 2021 18:04
Comment on lines 148 to 149
if not self.collect_server_info:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems conflicting if metadata is enabled but collect_server_info is disabled. Either we can log a debug message (like "Skipping server info collection because collect_server_info was disabled") or we can use this as a tracker to intuitively stop attempting to collect metadata between check runs if it's unreachable? Perhaps there is a specific message we're seeing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the case where someone might have enable_metadata_collection: true in their agent config and collect_server_info:false for the envoy check? I didn't know about the agent config option, that is a good point.

The idea is to bypass metadata collection due to known limitations on some Envoy deployments, avoid log spam if user knows that this is expected to fail (because it would be configured manually), while not needing to disable metadata collection all together (perhaps they want to collect it for other checks?). Displaying a debug statement sounds like a good idea.

There is no specific message, the endpoint is just unreachable. So although providing an intuitive solution that would not require any further user configuration might be nice, it would be difficult to avoid false positives where we cannot reach the endpoint for some other reason besides the underlying known limitations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if the endpoint is unreachable, then we can check the response code and log a message in the first occurrence, then skip it after. It will attempt on the first run (after restart) so customers don't need to actively maintain a config option, wdyt?

Copy link
Contributor Author

@luisgonzalex luisgonzalex May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is fine but if a user's envoy temporarily fails then it will raise a false positive, no? We would need to see repeated failures, rather than going off of first failure. Even then, repeated failures are not indicators of not wanting to collect_server_info.

If envoy is down, check fails, so does metadata, and thus it gets disabled. When envoy goes back up, check will still be running but not collecting metadata anymore, right?

Its essentially a tradeoff between doing it behind the scenes and risk catching false positives for less friction of configuring your instances. If the cases for false positives seem rare to you, then I would say it makes sense to go with your suggestion.

Copy link
Contributor

@ChristineTChen ChristineTChen May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see the collect_metadata() function is called first in the check(). If it's moved after the stats_url, then an unavailable instance would continue not cause the collect_server_info to be set to False. But I agree that this approach is safer since we're letting the user control it.

My main concern was if this may lead to more integrations needing a config option to disable metadata collection.

Copy link
Contributor

@apigirl apigirl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content looks fine 👍

@luisgonzalex luisgonzalex merged commit a9bec64 into master May 6, 2021
@luisgonzalex luisgonzalex deleted the lg/envoy-collect-si branch May 6, 2021 21:09
@luisgonzalex luisgonzalex mentioned this pull request May 7, 2021
4 tasks
@boeboe
Copy link

boeboe commented May 21, 2021

I have tested flag this on Istio/Kubernetes with version datadoghq/agent:7.27.0

annotations:
  ad.datadoghq.com/fortio.check_names: '["envoy"]'
  ad.datadoghq.com/fortio.init_configs: '[{}]'
  ad.datadoghq.com/fortio.instances: '[{"stats_url": "http://%%host%%:15090/stats/prometheus", "collect_server_info": false, "parse_unknown_metrics": true}]'

But I am still seeing attempts to contact a server_info endpoint. Furthermore, it ignores my stats_url, which I would assume being a base_url for other endpoints.

│ 2021-05-21 13:15:22 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:15:37 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:15:52 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:07 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:22 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:31 UTC | CORE | INFO | (pkg/logs/input/file/scanner.go:268 in restartTailerAfterFileRotation) | Log rotation happened to  /var/log/pods/fortio_fortio-client-689fc8dc56-dkgfk_ede56b01-6f35-43d1-9e7a-73173f0fdecb/istio-proxy/0.log          │
│ 2021-05-21 13:16:31 UTC | CORE | INFO | (pkg/logs/input/file/tailer_nix.go:29 in setup) | Opening /var/log/pods/fortio_fortio-client-689fc8dc56-dkgfk_ede56b01-6f35-43d1-9e7a-73173f0fdecb/istio-proxy/0.log for tailer key /var/log/pods/fortio_fortio-client │
│ 2021-05-21 13:16:37 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:52 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404

@luisgonzalex
Copy link
Contributor Author

luisgonzalex commented May 21, 2021

Hi @boeboe what is the version of the envoy check you are running? this feature was added in 1.22.0, so you would need to follow these steps to update the check version. Additionally, the check does not support the /stats/prometheus endpoint. You can only use the /stats endpoint. Hope this helps. If you have additional issues, feel free to open the issue via our repo's github issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants