Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[envoy integration] Metrics missing #12855

Closed
Shuanglu opened this issue Sep 2, 2022 · 13 comments
Closed

[envoy integration] Metrics missing #12855

Shuanglu opened this issue Sep 2, 2022 · 13 comments

Comments

@Shuanglu
Copy link

Shuanglu commented Sep 2, 2022

Note: If you have a feature request, you should contact support so the request can be properly tracked.

Output of the info page

root@datadog-cluster-agent-69bc84c5c-rrkch:/# datadog-cluster-agent status
Getting the status from the agent.
2022-09-02 07:00:49 UTC | CLUSTER | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec

===============================
Datadog Cluster Agent (v1.22.0)
===============================

  Status date: 2022-09-02 07:00:49.867 UTC (1662102049867)
  Agent start: 2022-08-30 08:45:54.797 UTC (1661849154797)
  Pid: 1
  Go Version: go1.17.11
  Build arch: amd64
  Agent flavor: cluster_agent
  Check Runners: 4
  Log Level: WARN

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System time: 2022-09-02 07:00:49.867 UTC (1662102049867)

  Hostnames
  =========
    ec2-hostname: ****
    host_aliases: [***]
    hostname: ****
    instance-id: ***
    socket-fqdn: datadog-cluster-agent-69bc84c5c-rrkch
    socket-hostname: datadog-cluster-agent-69bc84c5c-rrkch
    hostname provider: container
    unused hostname providers:
      aws: Unable to determine hostname from EC2: Get "http://169.254.169.254/latest/meta-data/instance-id": dial tcp 169.254.169.254:80: connect: connection refused
      azure: azure_hostname_style is set to 'os'
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: GCE metadata API error: Get "http://169.254.169.254/computeMetadata/v1/instance/hostname": dial tcp 169.254.169.254:80: connect: connection refused

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-cluster-agent-69bc84c5c-r6r98
  Last Acquisition of the lease: Fri, 26 Aug 2022 14:02:50 UTC
  Renewed leadership: Fri, 02 Sep 2022 07:00:41 UTC
  Number of leader transitions: 13 transitions

Custom Metrics Server
=====================

  Data sources
  ------------
  URL: https://api.datadoghq.com

  
  ConfigMap name: default/datadog-custom-metrics
  External Metrics
  ----------------
    Total: 0
    Valid: 0
    

Cluster Checks Dispatching
==========================
  Status: Follower, redirecting to leader at 10.42.224.6

Admission Controller
====================
  
    Webhooks info
    -------------
      MutatingWebhookConfigurations name: datadog-webhook
      Created at: 2022-06-01T07:04:25Z
      ---------
        Name: datadog.webhook.config
        CA bundle digest: 4a037a372da419e0
        Object selector: &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:admission.datadoghq.com/enabled,Operator:NotIn,Values:[false],},},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-cluster-agent-admission-controller - Port: 443 - Path: /injectconfig
      ---------
        Name: datadog.webhook.tags
        CA bundle digest: 4a037a372da419e0
        Object selector: &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:admission.datadoghq.com/enabled,Operator:NotIn,Values:[false],},},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-cluster-agent-admission-controller - Port: 443 - Path: /injecttags
  
    Secret info
    -----------
    Secret name: webhook-certificate
    Secret namespace: default
    Created at: 2022-06-01T07:04:25Z
    CA bundle digest: 4a037a372da419e0
    Duration before certificate expiration: 6528h3m34.106622362s

=========
Collector
=========

  Running Checks
  ==============
    
    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 16,860
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2022-09-02 07:00:42 UTC (1662102042000)
      Last Successful Execution Date : 2022-09-02 07:00:42 UTC (1662102042000)
      
    
    orchestrator
    ------------
      Instance ID: orchestrator:*** [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/orchestrator.d/conf.yaml.default
      Total Runs: 25,290
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2022-09-02 07:00:47 UTC (1662102047000)
      Last Successful Execution Date : 2022-09-02 07:00:47 UTC (1662102047000)
      
=========
Forwarder
=========

  Transactions
  ============
    Cluster: 0
    ClusterRole: 0
    ClusterRoleBinding: 0
    CronJob: 0
    DaemonSet: 0
    Deployment: 0
    Dropped: 0
    HighPriorityQueueFull: 0
    Ingress: 0
    Job: 0
    Node: 0
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 0
    Requeued: 300
    Retried: 94
    RetryQueueSize: 0
    Role: 0
    RoleBinding: 0
    Service: 0
    ServiceAccount: 0
    StatefulSet: 0

  Transaction Successes
  =====================
    Total number: 33719
    Successes By Endpoint:
      check_run_v1: 16,859
      intake: 1
      series_v1: 16,859

  Transaction Errors
  ==================
    Total number: 11
    Errors By Type:
      DNSErrors: 11

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 1f056

=====================
Orchestrator Explorer
=====================
  Collection Status: Clusterchecks are activated but still warming up, the collection could be running on CLC Runners. To verify that we need the clusterchecks to be warmed up.
  Cluster Name: ***
  Cluster ID: ****
  Container scrubbing: enabled

  ======================
  Orchestrator Endpoints
  ======================
    https://orchestrator.datadoghq.com - API Key ending with: *****

  Status: Follower, cluster agent leader is: datadog-cluster-agent-69bc84c5c-r6r98

Additional environment details (Operating System, Cloud provider, etc):
There is a support case 901101 but didn't make much progress

Steps to reproduce the issue:

  1. I have istio installed in my cluster and I need some metrics from envoy level hence I configured below on the app pods to scrape the envoy metrics.
        ad.datadoghq.com/istio-proxy.check_names: '["envoy"]'
        ad.datadoghq.com/istio-proxy.init_configs: '[{}]'
        ad.datadoghq.com/istio-proxy.instances: |
            [
              {
                "openmetrics_endpoint": "http://%%host%%:15090/stats/prometheus",
                "histogram_buckets_as_distributions": "true",
                "log_requests": "true",
                "extra_metrics": 
                  [
                    {
                      "envoy_cluster_upstream_rq_time": 
                        {
                          "name": "cluster.upstream_rq_time"
                          "type": "histogram"
                        }
                    }
                  ]
              }
            ]
  1. send some traffic from one pod to the other. From the metrics endpoint and prometheus

Describe the results you received:
I could find these metrics but in datadog explorer, I could not find them. Except for the 1st one, others are included in your metrics dict

  1. cluster.upstream_rq_time
  2. cluster.upstream_cx_rx_bytes_total
  3. cluster.upstream_cx_tx_bytes_total
  4. listener.downstream_cx_length_ms
  5. cluster.upstream_rq_xx (raw metrics are with specific status code. I'm guess the agent will parse it?)
  6. some metrics has the data but different from the raw metrics or prometheus scrapes. did datadog/query did some aggregation in the metrics explorer?
  7. the support requested me to add 'status_url' but I guess it won't work for v2 integration?
  8. some metrics 'type' are different from the type exposed from the pod. like the 'counter' is converted to 'rate'. Is this expected or somewhere has misconfiguration
    Describe the results you expected:
    scrape those metrics

Additional information you deem important (e.g. issue happens only occasionally):

@yzhan289
Copy link
Contributor

yzhan289 commented Sep 9, 2022

Hi 👋 , taking a look at your configuration and the metrics missing, I think the issue is that you are using the OpenMetrics implementation of the check rather than the legacy check. Except for the first one, the other metrics are listed here: https://github.com/DataDog/integrations-core/blob/7.38.2/envoy/metadata.csv.

If you want to collect those legacy metrics, can you take a look at the legacy configuration found here: https://github.com/DataDog/integrations-core/tree/7.33.x/envoy

@yzhan289
Copy link
Contributor

Let us know if you run into any issues, but I'll close this issue for now!

@Shuanglu
Copy link
Author

The 'legacy' metrics you mentioned are listed in 'PROMETHEUS_METRICS_MAP'. Do they still need 'legacy check'?
In addition, if only legacy check can work, does it mean the v2 integration collect different metrics than v1?

@burningalchemist
Copy link

@yzhan289 having the same issue I bet extra_metrics field is non-effective. I believe envoy_cluster_upstream_rq_time is important to have as a part of the integration to balance the existing envoy.http.downstream_rq_time while staying with openmetrics_endpoint. Would you mind reopening the issue?

@Shuanglu in the meantime did you find a solution?

@Shuanglu
Copy link
Author

Shuanglu commented Nov 4, 2022

@yzhan289 having the same issue I bet extra_metrics field is non-effective. I believe envoy_cluster_upstream_rq_time is important to have as a part of the integration to balance the existing envoy.http.downstream_rq_time while staying with openmetrics_endpoint. Would you mind reopening the issue?

@Shuanglu in the meantime did you find a solution?

nope... currently we use dogstatsd to submit our metrics...

@burningalchemist
Copy link

@Shuanglu Oh I see, thanks. Yeah, that's a blocker for sure. 🤔

@yzhan289
Copy link
Contributor

yzhan289 commented Nov 4, 2022

Hi, thanks @Shuanglu and @burningalchemist for bringing this up. I'll open a ticket internally to investigate the missing metrics and reopen this ticket. We will update this card if there are any new changes!

@yzhan289 yzhan289 reopened this Nov 4, 2022
@yzhan289
Copy link
Contributor

yzhan289 commented Nov 4, 2022

For listener.downstream_cx_length_ms, are you able to get listener.downstream_cx_length_ms.count? Looks like we are transforming the metrics matching downstream_cx with a .count: https://github.com/DataDog/integrations-core/blob/7.38.2/envoy/datadog_checks/envoy/check.py#L82-L92. It doesn't look like this is happening for any metrics with upstream_cx, so that may be something we need to look into.

@burningalchemist
Copy link

burningalchemist commented Nov 5, 2022

Hey @yzhan289, yes, listener.downstream_cx_length_ms.count works well 👍 It seems that upstream_rq_time metric is simply ignored.

I'd also check if we are talking about the same thing as you're mentioning connection upstream metrics while I'm referring to request upstream metrics. I've shared the link below:
https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#dynamic-http-statistics

@burningalchemist
Copy link

Hey @yzhan289, any updates on the issue? 🙂

@yzhan289
Copy link
Contributor

Hey @burningalchemist , unfortunately we don't have any updates on this.

@burningalchemist
Copy link

@yzhan289, I've managed to make the extra_metrics field in annotations work using one of the latest releases, also got required hints from the linked PR. 👍 I think the issue can be closed.

@Shuanglu, let me know if you need any help or something, happy to share. 😃

@yzhan289
Copy link
Contributor

@burningalchemist Yay glad to hear! I will close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants