Datadog kafka_consumer integration flooding network IO #18983

rllanger · 2024-11-05T14:12:00Z

Hi,
We recently encountered an issue with one of our customers kafka-clusters relating to the Datadog kafka_consumer integration (https://github.com/DataDog/integrations-core/tree/master/kafka_consumer)

The kafka_consumer datadog module causes a large increase in network in and out of the instances. We identified it as the cause of the network traffic by disabling the integration and restarting the datadog agent. With network diagnostic tool ‘nethogs’ we can see the datadog agent process is receiving large amount of data (up to +500MiB/s). We have been unable to reproduce the issue in our lab cluster, although that could be due to lack of testing data / consumer groups / general load.

We’ve encountered the issue on these versions
We are running confluent kafka 7.7.1 (Apache Kafka® 3.7)
Datadog agent_version: 7.57.2
kafka_consumer (4.6.0)

We’ve tried updating to lastest versions, still seeing the same issue.
Datadog agent_version: 7.58.2
kafka_consumer (5.0.0)

I’ve included some examples of the kafka_consumer output from datadog-agent status for instances where the traffic is abnormally high:
Cluster1:

    kafka_consumer (4.6.0)
    ----------------------
      Instance ID: kafka_consumer:5835d1972801c612 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 123
      Metric Samples: Last Run: 40,000, Total: 4,873,019
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 7.449s
      Last Execution Date : 2024-11-03 15:24:46 UTC (1730647486000)
      Last Successful Execution Date : 2024-11-03 15:24:46 UTC (1730647486000)

Cluster2:

kafka_consumer (5.0.0)
    ----------------------
      Instance ID: kafka_consumer:98057d281cbc9d57 [WARNING]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 1
      Metric Samples: Last Run: 50,000, Total: 50,000
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 32.77s
      Last Execution Date : 2024-11-05 13:16:57 UTC (1730812617000)
      Last Successful Execution Date : 2024-11-05 13:16:57 UTC (1730812617000)

      Warning: Context limit reached. Skipping highwater offset collection.
      Warning: Discovered 75257 metric contexts - this exceeds the maximum number of 50000 contexts permitted by the
                check. Please narrow your target by specifying in your kafka_consumer.yaml the consumer groups, topics
                and partitions you wish to monitor.

/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml:

init_config:
  max_partition_contexts: 50000
instances:
  - kafka_connect_str: localhost:9092
    monitor_unlisted_consumer_groups: true

The text was updated successfully, but these errors were encountered:

HadhemiDD · 2024-11-07T09:01:16Z

The issue here is related to the kafka_consumer data volume, you are collecting every possible consumer group exposed using just one dd agent, there are over 75000 metrics sent from this check so the max_partition_contexts has to be further increased. But that would mean a higher resource consumption for the agent and a higher network traffic.
One way to reduce this, is to use specific consumer group names (regex or exact match).

rllanger · 2024-11-07T09:08:19Z

We didnt see this issue in the affected cluster prior to upgrading Confluent Kafka 7.3 -> Confluent Kafka 7.7.1.
The kafka_consumer.yaml remains unchanged as well as the amount of metrics.
There could be an update within kafka that affected the consumer groups metrics and the collection networking usage of the metrics. The integration increased by more than 10x after 7.7.1. But I have not been able to identify what kind of change would be relevant.

iliakur self-assigned this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datadog kafka_consumer integration flooding network IO #18983

Datadog kafka_consumer integration flooding network IO #18983

rllanger commented Nov 5, 2024 •

edited

Loading

HadhemiDD commented Nov 7, 2024

rllanger commented Nov 7, 2024 •

edited

Loading

Datadog kafka_consumer integration flooding network IO #18983

Datadog kafka_consumer integration flooding network IO #18983

Comments

rllanger commented Nov 5, 2024 • edited Loading

HadhemiDD commented Nov 7, 2024

rllanger commented Nov 7, 2024 • edited Loading

rllanger commented Nov 5, 2024 •

edited

Loading

rllanger commented Nov 7, 2024 •

edited

Loading