Add collect_server_info config option #9298

luisgonzalex · 2021-05-05T17:36:48Z

What does this PR do?

Add a config option to Envoy check to disable checking the /server_info endpoint to collect metadata. This cannot be accessed in some configurations of Envoy, like consul connect or kubernetes deployments that only expose the /stats endpoint.

Motivation

Users will be spammed with logs otherwise:

2021-04-27 20:44:34 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:706dc0e2f7ee4a62 | (envoy.py:156) | Envoy endpoint `http://172.17.0.9:90/server_info` responded with HTTP status code 404

Additional Notes

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have changelog/ and integration/ labels attached

ChristineTChen · 2021-05-05T18:46:36Z

envoy/datadog_checks/envoy/envoy.py

+        if not self.collect_server_info:
+            return


It seems conflicting if metadata is enabled but collect_server_info is disabled. Either we can log a debug message (like "Skipping server info collection because collect_server_info was disabled") or we can use this as a tracker to intuitively stop attempting to collect metadata between check runs if it's unreachable? Perhaps there is a specific message we're seeing?

Do you mean the case where someone might have enable_metadata_collection: true in their agent config and collect_server_info:false for the envoy check? I didn't know about the agent config option, that is a good point.

The idea is to bypass metadata collection due to known limitations on some Envoy deployments, avoid log spam if user knows that this is expected to fail (because it would be configured manually), while not needing to disable metadata collection all together (perhaps they want to collect it for other checks?). Displaying a debug statement sounds like a good idea.

There is no specific message, the endpoint is just unreachable. So although providing an intuitive solution that would not require any further user configuration might be nice, it would be difficult to avoid false positives where we cannot reach the endpoint for some other reason besides the underlying known limitations.

IMO if the endpoint is unreachable, then we can check the response code and log a message in the first occurrence, then skip it after. It will attempt on the first run (after restart) so customers don't need to actively maintain a config option, wdyt?

I think this approach is fine but if a user's envoy temporarily fails then it will raise a false positive, no? We would need to see repeated failures, rather than going off of first failure. Even then, repeated failures are not indicators of not wanting to collect_server_info.

If envoy is down, check fails, so does metadata, and thus it gets disabled. When envoy goes back up, check will still be running but not collecting metadata anymore, right?

Its essentially a tradeoff between doing it behind the scenes and risk catching false positives for less friction of configuring your instances. If the cases for false positives seem rare to you, then I would say it makes sense to go with your suggestion.

Ahh I see the collect_metadata() function is called first in the check(). If it's moved after the stats_url, then an unavailable instance would continue not cause the collect_server_info to be set to False. But I agree that this approach is safer since we're letting the user control it.

My main concern was if this may lead to more integrations needing a config option to disable metadata collection.

apigirl

content looks fine 👍

boeboe · 2021-05-21T13:20:17Z

I have tested flag this on Istio/Kubernetes with version datadoghq/agent:7.27.0

annotations:
  ad.datadoghq.com/fortio.check_names: '["envoy"]'
  ad.datadoghq.com/fortio.init_configs: '[{}]'
  ad.datadoghq.com/fortio.instances: '[{"stats_url": "http://%%host%%:15090/stats/prometheus", "collect_server_info": false, "parse_unknown_metrics": true}]'

But I am still seeing attempts to contact a server_info endpoint. Furthermore, it ignores my stats_url, which I would assume being a base_url for other endpoints.

│ 2021-05-21 13:15:22 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:15:37 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:15:52 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:07 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:22 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:31 UTC | CORE | INFO | (pkg/logs/input/file/scanner.go:268 in restartTailerAfterFileRotation) | Log rotation happened to  /var/log/pods/fortio_fortio-client-689fc8dc56-dkgfk_ede56b01-6f35-43d1-9e7a-73173f0fdecb/istio-proxy/0.log          │
│ 2021-05-21 13:16:31 UTC | CORE | INFO | (pkg/logs/input/file/tailer_nix.go:29 in setup) | Opening /var/log/pods/fortio_fortio-client-689fc8dc56-dkgfk_ede56b01-6f35-43d1-9e7a-73173f0fdecb/istio-proxy/0.log for tailer key /var/log/pods/fortio_fortio-client │
│ 2021-05-21 13:16:37 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404                │
│ 2021-05-21 13:16:52 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:124 in LogMessage) | envoy:ba219c52ec6b4d2 | (envoy.py:156) | Envoy endpoint `http://100.100.96.159:15090/stats/server_info` responded with HTTP status code 404

luisgonzalex · 2021-05-21T13:28:46Z

Hi @boeboe what is the version of the envoy check you are running? this feature was added in 1.22.0, so you would need to follow these steps to update the check version. Additionally, the check does not support the /stats/prometheus endpoint. You can only use the /stats endpoint. Hope this helps. If you have additional issues, feel free to open the issue via our repo's github issues.

add collect_server_info config option:

d08b76a

luisgonzalex added the changelog/Added label May 5, 2021

ghost added documentation integration/envoy labels May 5, 2021

luisgonzalex removed the documentation label May 5, 2021

sync config models

ab22c74

luisgonzalex added the documentation label May 5, 2021

add test for new config option, lint prev changes

005f9e8

luisgonzalex changed the title ~~add collect_server_info config option~~ Add collect_server_info config option May 5, 2021

luisgonzalex marked this pull request as ready for review May 5, 2021 18:04

luisgonzalex requested review from a team as code owners May 5, 2021 18:04

ChristineTChen reviewed May 5, 2021

View reviewed changes

luisgonzalex added 2 commits May 5, 2021 14:46

add debug log message

f76a906

fix style

bc031b4

apigirl approved these changes May 6, 2021

View reviewed changes

ChristineTChen approved these changes May 6, 2021

View reviewed changes

luisgonzalex merged commit a9bec64 into master May 6, 2021

luisgonzalex deleted the lg/envoy-collect-si branch May 6, 2021 21:09

luisgonzalex mentioned this pull request May 7, 2021

Add Envoy troubleshooting #9316

Merged

4 tasks

HadhemiDD mentioned this pull request May 25, 2023

Disable server info and version collection when collect_server_info is false #14610

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add collect_server_info config option #9298

Add collect_server_info config option #9298

luisgonzalex commented May 5, 2021 •

edited

Loading

ChristineTChen May 5, 2021

luisgonzalex May 5, 2021

ChristineTChen May 5, 2021

luisgonzalex May 5, 2021 •

edited

Loading

ChristineTChen May 6, 2021 •

edited

Loading

apigirl left a comment

boeboe commented May 21, 2021

luisgonzalex commented May 21, 2021 •

edited

Loading

Add collect_server_info config option #9298

Add collect_server_info config option #9298

Conversation

luisgonzalex commented May 5, 2021 • edited Loading

What does this PR do?

Motivation

Additional Notes

Review checklist (to be filled by reviewers)

ChristineTChen May 5, 2021

Choose a reason for hiding this comment

luisgonzalex May 5, 2021

Choose a reason for hiding this comment

ChristineTChen May 5, 2021

Choose a reason for hiding this comment

luisgonzalex May 5, 2021 • edited Loading

Choose a reason for hiding this comment

ChristineTChen May 6, 2021 • edited Loading

Choose a reason for hiding this comment

apigirl left a comment

Choose a reason for hiding this comment

boeboe commented May 21, 2021

luisgonzalex commented May 21, 2021 • edited Loading

luisgonzalex commented May 5, 2021 •

edited

Loading

luisgonzalex May 5, 2021 •

edited

Loading

ChristineTChen May 6, 2021 •

edited

Loading

luisgonzalex commented May 21, 2021 •

edited

Loading