Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie behavior possible in Kafka consumer #2897

Closed
rootpd opened this issue Jun 8, 2017 · 2 comments
Closed

Zombie behavior possible in Kafka consumer #2897

rootpd opened this issue Jun 8, 2017 · 2 comments
Labels
bug unexpected problem or unintended behavior

Comments

@rootpd
Copy link

rootpd commented Jun 8, 2017

Bug report

Relevant telegraf.conf:

[[outputs.influxdb]]
  urls = ["http://influxdb:8086"] 
  database = "beam" 
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"

[[inputs.kafka_consumer]]
  topics = ["beam_events"]
  zookeeper_peers = ["kafka:2181"]
  zookeeper_chroot = ""
  consumer_group = "beam_consumers"
  offset = "oldest"
  data_format = "influx"

System info:

Docker-based Telegraf running official telegraf:1.3 image.

Steps to reproduce:

  1. Start Kafka+Zookeeper and do not write anything to it yet.
  2. Start telegraf
  3. Telegraf will try to read messages from Kafka (the topic doesn't exist) and fail with the following:
telegraf_1  | 2017/06/08 07:40:11 I! Using config file: /etc/telegraf/telegraf.conf
influxdb_1  | [httpd] 172.18.0.7 - - [08/Jun/2017:07:40:07 +0000] "POST /query?q=CREATE+DATABASE+%22beam%22 HTTP/1.1" 200 62 "-" "-" b199a8cb-4c1d-11e7-8006-000000000000 2065
telegraf_1  | 2017-06-08T07:40:11Z I! Starting Telegraf (version 1.3.1)
telegraf_1  | 2017-06-08T07:40:11Z I! Loaded outputs: influxdb
telegraf_1  | 2017-06-08T07:40:11Z I! Loaded inputs: inputs.kafka_consumer
telegraf_1  | 2017-06-08T07:40:11Z I! Tags enabled: host=e84843934dd9
telegraf_1  | 2017-06-08T07:40:11Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"e84843934dd9", Flush Interval:10s
telegraf_1  | 2017-06-08T07:40:11Z I! Connected to 172.18.0.4:2181
telegraf_1  | 2017-06-08T07:40:11Z I! Authenticated: id=98101336743346179, timeout=4000
telegraf_1  | 2017-06-08T07:40:11Z I! Re-submitting `0` credentials after reconnect
telegraf_1  | 2017-06-08T07:40:11Z I! Started the kafka consumer service, peers: [kafka:2181], topics: [beam_events]
telegraf_1  | 2017-06-08T07:40:11Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming beam_events/-1: zk: node does not exist

Expected behavior:

Telegraf should attempt to reconnect later or halt completely (so it can be restarted automacally by docker/supervisor/any other tool). It should be failing until the topic is created by the producer.

Actual behavior:

Telegraf keeps running and doesn't try to connect or do any other action. Docker compose reports it as up. It becomes the zombie. It has to be restarted manually.

Additional info:

Creating the topic manually helps and telegraf stops returning the error and reads all pushed messages properly.

This is the excerpt of my docker-compose.yml.

version: "3"

services:
  kafka:
    image: "spotify/kafka"
    hostname: kafka
    environment:
        ADVERTISED_HOST: kafka
        ADVERTISED_PORT: 9092
    ports:
        - "9092"
        - "2181"
    volumes:
        - "kafka-data:/data"

  influxdb:
    image: "influxdb:1.2"
    volumes:
      - "influxdb-data:/var/lib/influxdb"
    ports:
      - "8083:8083"
      - "8086:8086"
    restart: "unless-stopped"

  telegraf:
    image: "telegraf:1.3"
    volumes:
      - "./Docker/telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro"
    depends_on:
      - "influxdb"
      - "kafka"
    restart: "unless-stopped"

volumes:
  influxdb-data:
    driver: "local"
  kafka-data:
    driver: "local"
@danielnelson
Copy link
Contributor

Thanks for the bug report. We just made some changes to the Kafka input on master in #2487, any chance you could test this using the wurstmeister/kafka image?
https://github.com/influxdata/telegraf/blob/master/Makefile#L49-L57

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Jun 8, 2017
@rootpd
Copy link
Author

rootpd commented Jul 3, 2017

Hi @danielnelson , sorry for the late response.

I have tested current master with the image you suggested and the new implementation seems to be solving the issue. Instead of zombie behavior I can see telegraf correctly returning an error with message kafka server: Offset's topic has not yet been created. and halting.

I think we can close the issue. If you need some additional info or something else related, just reopen.

@rootpd rootpd closed this as completed Jul 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants