Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kafka output panic at startup #41824

Merged
merged 8 commits into from
Nov 28, 2024
Merged

Conversation

cmacknz
Copy link
Member

@cmacknz cmacknz commented Nov 28, 2024

In #40794 the Connectable interface was changed to include a context.Context argument, causing all clients with a Connect() that weren't updated to have a Context(context.Context) method instead to fail the NetworkClient interface check in

if nc, ok := client.(outputs.NetworkClient); ok {
c = &netClientWorker{
worker: w,
client: nc,
logger: logger,
tracer: tracer,
}

This made it so that the Kafka and Redis outputs no longer counted as NetworkClients, bypassing the initial call to Connect() that was there previously at

err := w.client.Connect(ctx)

In the case of Kafka the Connect call never happening makes the output panic on the first call to Publish because there is no producer.

Raising without tests to make sure nothing else was missed while I figure out the best test for this that isn't totally contrived, because the problem is in the publisher pipeline and the output specific tests don't hook in at that level. This is how the problem was originally missed.

The PR has been updated with an integration test to reproduce the problem: #41824 (comment)

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 28, 2024
@cmacknz cmacknz self-assigned this Nov 28, 2024
@cmacknz cmacknz requested a review from a team as a code owner November 28, 2024 17:16
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 28, 2024
@cmacknz
Copy link
Member Author

cmacknz commented Nov 28, 2024

No changelog needed because the bug was never released. It is only on main, 8.x, and the 8.17 branch.

@cmacknz cmacknz added backport-8.x Automated backport to the 8.x branch with mergify backport-8.17 Automated backport with mergify labels Nov 28, 2024
@pierrehilbert pierrehilbert requested a review from rdner November 28, 2024 17:20
@cmacknz
Copy link
Member Author

cmacknz commented Nov 28, 2024

I think the only thing that would catch this is an integration test, I am going to try to write one that looks like

func TestESOutputRecoversFromNetworkError(t *testing.T) {

I don't think we need actual Kafka for the regression test, just to observe that we tried to connect to it. I will either bind nothing to 9092 or maybe I'll make or find a small kafka broker.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 28, 2024

Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tests in a follow up. Merging this PR should not close the reported issue.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 28, 2024

I have an integration test locally using a Mock Kafka broker that catches the problem, I am just polishing it. Will push soon.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 28, 2024

Updated with an integration test that reproduces the problem using Sarama's MockBroker to ensure we can successfully publish a message from mockbeat using the entire Beat pipeline, without needing to configure a real Kafka instance.

@@ -113,7 +113,7 @@ func newKafkaClient(
return c, nil
}

func (c *client) Connect() error {
func (c *client) Connect(_ context.Context) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you revert this change and run the integration test, it will fail with the panic in the original issue.

Comment on lines +31 to +34
// https://github.com/elastic/sarama/blob/c7eabfcee7e5bcd7d0071f0ece4d6bec8c33928a/config_test.go#L14-L17
// The version of MockBroker used when this test was written only supports the lowest protocol version by default.
// Version incompatibilities will result in message decoding errors between the mock and the beat.
kafkaVersion = sarama.MinVersion
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovering the need for this cost me 3 hours 🙃

@cmacknz cmacknz enabled auto-merge (squash) November 28, 2024 22:28
@cmacknz cmacknz merged commit e42589d into elastic:main Nov 28, 2024
139 of 141 checks passed
mergify bot pushed a commit that referenced this pull request Nov 28, 2024
* Make Kafka output satisfy NetworkClient interface.

* Make Redis output satisfy network client.

* Add initial regression integration test.

* Add an integration test to ensure connectivity.

* Fix build error in old integration test.

* Fix redis lint error.

* Fix typo in comment.

* Fix another typo.

(cherry picked from commit e42589d)
mergify bot pushed a commit that referenced this pull request Nov 28, 2024
* Make Kafka output satisfy NetworkClient interface.

* Make Redis output satisfy network client.

* Add initial regression integration test.

* Add an integration test to ensure connectivity.

* Fix build error in old integration test.

* Fix redis lint error.

* Fix typo in comment.

* Fix another typo.

(cherry picked from commit e42589d)
@cmacknz cmacknz deleted the fix-kafka-output branch November 29, 2024 01:08
cmacknz added a commit that referenced this pull request Nov 29, 2024
* Make Kafka output satisfy NetworkClient interface.

* Make Redis output satisfy network client.

* Add initial regression integration test.

* Add an integration test to ensure connectivity.

* Fix build error in old integration test.

* Fix redis lint error.

* Fix typo in comment.

* Fix another typo.

(cherry picked from commit e42589d)

Co-authored-by: Craig MacKenzie <[email protected]>
cmacknz added a commit that referenced this pull request Nov 29, 2024
* Make Kafka output satisfy NetworkClient interface.

* Make Redis output satisfy network client.

* Add initial regression integration test.

* Add an integration test to ensure connectivity.

* Fix build error in old integration test.

* Fix redis lint error.

* Fix typo in comment.

* Fix another typo.

(cherry picked from commit e42589d)

Co-authored-by: Craig MacKenzie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.17 Automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[8.17] Kafka output panics unconditionally at startup
4 participants