Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/kafkametricsreceiver] collector crashes if Kafka is unavailable at startup #8349

Closed
mwear opened this issue Mar 9, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@mwear
Copy link
Member

mwear commented Mar 9, 2022

Describe the bug
If Kafka is not available when the kafka metrics receiver attempts to start, the collector fails to start.

Steps to reproduce
Configure the kafka metrics receiver and start the collector without a running Kafka instance. Alternatively, you can use this docker-compose example

If using the example, do the following:

git clone [email protected]:mwear/otel-collector-examples.git
cd otel-collector-examples/kafka-metrics-receiver
docker-compose up

What did you expect to see?
I expected a warning at the bare minimum, although, ideally, the receiver would try to reconnect with a backoff strategy.

What did you see instead?
The collector exits with an error. Specifically, I saw this:

otel-collector | 2022-03-08T22:00:01.140Z	info	service/service.go:97	Starting receivers...
otel-collector | 2022-03-08T22:00:01.140Z	info	builder/receivers_builder.go:68	Receiver is starting...	{"kind": "receiver", "name": "kafkametrics"}
kafka        | 2022-03-08 22:00:01,516 INFO spawned: 'zookeeper' with pid 8
kafka        | 2022-03-08 22:00:01,518 INFO spawned: 'kafka' with pid 9
otel-collector | Error: cannot start receivers: failed to create client while starting brokers scraper: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
otel-collector | 2022/03/08 22:00:01 collector server run finished with error: cannot start receivers: failed to create client while starting brokers scraper: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
otel-collector exited with code 1

What version did you use?
v0.46.0

What config did you use?

receivers:
  otlp:
    protocols:
      http:
      grpc:
  
  kafkametrics:
    protocol_version: 2.0.0
    brokers: kafka:9092
    scrapers:
      - brokers
      - topics
      - consumers

exporters:
  logging:

processors:
  batch:

service:
  pipelines:
    metrics:
      receivers: [otlp, kafkametrics]
      processors: [batch]
      exporters: [logging]
  telemetry:
    logs:
      level: debug

Environment
The "official" collector contrib docker image

@mwear mwear added the bug Something isn't working label Mar 9, 2022
@jpkrohling
Copy link
Member

I believe this is a duplicate of #4752 and I'm therefore closing this one. If you don't think this is the same issue, feel free to reopen.

@mwear
Copy link
Member Author

mwear commented Mar 9, 2022

These issues are for different components. #4752 appears to be an issued filed on the Kafka Exporter. This report is for the Kafka Metrics Reciever.

@mwear
Copy link
Member Author

mwear commented Mar 14, 2022

After looking into this further there are two solutions that I see (and there may be others).

Background

The kafkametricsreceiver defines three scrapers (broker scraper, consumer scraper, topic scraper) and uses the scraperhelper (from the collector core repo) to manage them.

The scraperhelper constructs a scraper controller that manages multiple scrapers.

Its start method calls start on each of the individual scrapers. If a scraper returns an error from its start method, it bubbles up and the collector fails to start.

We can fix this by logging errors instead of returning them, but ideally the receivers would periodically try to start until the services they monitor are up.

Fix in the Kafka Metrics receiver

A scraper only needs to define a scrape method. They can also define a start method but are not required to. The three scrapers provided in the kafka metrics receiver currently define start methods, but we could rename them to something like initialize and call them in scrape, rather than at start time. If the receiver has been sucessfully initialized this method can no-op. If initialization fails, the method can log the error and return early. An attempt to reinitialize the receiver will happen on the next scrape.

Fix in the Scraper Helper

We could fix this by changing the behavior of the start method in the scraper controller. Instead of returning errors when scrapers can't start, it could log them, and retry later. The controller could defer starting the scrape loop until all of the scapers successfully start.

This change would have an impact on the receivers that use it. Currently that includes: apachereceiver, couchdbreceiver, dockerstatsreceiver, elasticsearchreceiver, googlecloudspannereceiver, hostmetricsreceiver, kafkametricsreceiver, kubeletstatsreceiver, memcachedreceiver, mogodbatlasreceiver, mongodbreceiver, mysqlreceiver, nginxreceiver, podmanreceiver, postgresqlreceiver, rabbitmqreceiver, redisreceiver, windowsperfcountersreceiver, zookeeperreceiver.

We'd want to ensure that the new behavior would be acceptable for the receivers that currently use the scraper helper.

Next steps?

I wanted to a discussion going to see what approach we should take to improve the behavior of the kafkametricsreceiver, and potentially other receivers, depending on which route we take. There might be other options worth considering, so if anyone has other ideas, please feel free to suggest them.

P.S
@jpkrohling can you reopen this issue so we can continue this discussion in the context of the kafka metrics receiver?

@jpkrohling
Copy link
Member

can you reopen this issue so we can continue this discussion in the context of the kafka metrics receiver?

Sorry for missing this notification. Thanks @dmitryax for reopening!

@mwear
Copy link
Member Author

mwear commented Apr 21, 2022

This was addressed in #8817.

@mwear mwear closed this as completed Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants