Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog kafka_consumer integration flooding network IO #18983

Open
rllanger opened this issue Nov 5, 2024 · 2 comments
Open

Datadog kafka_consumer integration flooding network IO #18983

rllanger opened this issue Nov 5, 2024 · 2 comments
Assignees

Comments

@rllanger
Copy link

rllanger commented Nov 5, 2024

Hi,
We recently encountered an issue with one of our customers kafka-clusters relating to the Datadog kafka_consumer integration (https://github.com/DataDog/integrations-core/tree/master/kafka_consumer)

The kafka_consumer datadog module causes a large increase in network in and out of the instances. We identified it as the cause of the network traffic by disabling the integration and restarting the datadog agent. With network diagnostic tool ‘nethogs’ we can see the datadog agent process is receiving large amount of data (up to +500MiB/s). We have been unable to reproduce the issue in our lab cluster, although that could be due to lack of testing data / consumer groups / general load.

We’ve encountered the issue on these versions
We are running confluent kafka 7.7.1 (Apache Kafka® 3.7)
Datadog agent_version: 7.57.2
kafka_consumer (4.6.0)

We’ve tried updating to lastest versions, still seeing the same issue.
Datadog agent_version: 7.58.2
kafka_consumer (5.0.0)

I’ve included some examples of the kafka_consumer output from datadog-agent status for instances where the traffic is abnormally high:
Cluster1:

    kafka_consumer (4.6.0)
    ----------------------
      Instance ID: kafka_consumer:5835d1972801c612 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 123
      Metric Samples: Last Run: 40,000, Total: 4,873,019
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 7.449s
      Last Execution Date : 2024-11-03 15:24:46 UTC (1730647486000)
      Last Successful Execution Date : 2024-11-03 15:24:46 UTC (1730647486000)

Cluster2:

kafka_consumer (5.0.0)
    ----------------------
      Instance ID: kafka_consumer:98057d281cbc9d57 [WARNING]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 1
      Metric Samples: Last Run: 50,000, Total: 50,000
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 32.77s
      Last Execution Date : 2024-11-05 13:16:57 UTC (1730812617000)
      Last Successful Execution Date : 2024-11-05 13:16:57 UTC (1730812617000)

      Warning: Context limit reached. Skipping highwater offset collection.
      Warning: Discovered 75257 metric contexts - this exceeds the maximum number of 50000 contexts permitted by the
                check. Please narrow your target by specifying in your kafka_consumer.yaml the consumer groups, topics
                and partitions you wish to monitor.

/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml:

init_config:
  max_partition_contexts: 50000
instances:
  - kafka_connect_str: localhost:9092
    monitor_unlisted_consumer_groups: true
@HadhemiDD
Copy link
Contributor

The issue here is related to the kafka_consumer data volume, you are collecting every possible consumer group exposed using just one dd agent, there are over 75000 metrics sent from this check so the max_partition_contexts has to be further increased. But that would mean a higher resource consumption for the agent and a higher network traffic.
One way to reduce this, is to use specific consumer group names (regex or exact match).

@rllanger
Copy link
Author

rllanger commented Nov 7, 2024

We didnt see this issue in the affected cluster prior to upgrading Confluent Kafka 7.3 -> Confluent Kafka 7.7.1.
The kafka_consumer.yaml remains unchanged as well as the amount of metrics.
There could be an update within kafka that affected the consumer groups metrics and the collection networking usage of the metrics. The integration increased by more than 10x after 7.7.1. But I have not been able to identify what kind of change would be relevant.

@iliakur iliakur self-assigned this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants