-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-18646: Null records in fetch response breaks librdkafka #18726
KAFKA-18646: Null records in fetch response breaks librdkafka #18726
Conversation
Ensure we always return empty records (including the cases where an error is returned). This behavior regressed in fe56fc9 (KAFKA-18269).
Failures are unrelated to this PR - rerunning to try for a green build anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ijuma, LGTM.
Took me a while to find the regression. Looks like empty records were previously set via FetchResponse.recordsOrFail(partitionData)
in the now-removed maybeConvertFetchedData method. With this patch it looks like we will only set records in processResponseCallback or when building an error response for a partition (FetchResponse#partitionResponse)
Should we add a null check on records
in FetchPartitionData to prevent reintroducing this bug?
@mumrah Thanks for sharing. You've saved me time :)
I completely agree that 4.0 should not burn out third-party libraries. However, this seems to conflict with specification |
Is the suggestion to remove nullable from this field? I can try that change and check what fails. |
We only changed this one place because it was the only place where it was incorrect outside of tests. This was confirmed after we make |
The only test failing is:
|
@@ -106,7 +106,7 @@ | |||
]}, | |||
{ "name": "PreferredReadReplica", "type": "int32", "versions": "11+", "default": "-1", "ignorable": false, "entityType": "brokerId", | |||
"about": "The preferred read replica for the consumer to use on its next fetch request."}, | |||
{ "name": "Records", "type": "records", "versions": "0+", "nullableVersions": "0+", "about": "The record data."} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing nullableVersions
would prevent Kafka from "reading" null records. This could potentially pose a risk. Therefore, we might consider a strategy where Kafka strictly enforces that null records are never "written" while still maintaining the ability to "read" them.
In summary, I favor @mumrah's suggestion of adding a null check. Furthermore, we should include a comment explaining the rationale behind preventing Kafka from writing null records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate on the risk - what is the exact use case where that would happen? I couldn't come up with one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that released versions of Kafka never return null records - if that ever happened, it would have broken librdkafka. Same for any other implementation of the Kafka protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate on the risk - what is the exact use case where that would happen? I couldn't come up with one.
I believe FetchResponse.json
should be considered part of the public interface. Consequently, modifying the "released spec" carries inherent risks. We cannot guarantee that no external implementations adhere to our specification. For instance, other server implementations might return null records, and after this PR, our 4.0 client would no longer be able to read them.
Keep in mind that released versions of Kafka never return null records
that is true and it does not violate the spec, right? I mean "apache kafka can never return null even though the spec says it is valid to return null"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, the actual spec here is that this field can never be null. As I said, there is no Kafka implementation that doesn't work with librdkafka and related clients - it constitutes 30-50% of Kafka clients in the wild.
What happened here is that the spec definition was actually incorrect - it specified the field as nullable even though it couldn't be. And that bug meant we accidentally broke a large part of the ecosystem. I don't see the benefit in not fixing the spec.
Again, let's discuss a concrete use case - you're worried that the Java consumer might break because there is an unknown Kafka broker that implements this differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, let's discuss a concrete use case - you're worried that the Java consumer might break because there is an unknown Kafka broker that implements this differently?
I have no evidence that there is a broker implementation that returns null records. If librdkafka and related clients constitute 30-50% of clients, it is acceptable to fix the specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this behavior in librdkafka has existed for a long time, and we have not seen an issue with null records prior to the recent changes, I think it is safe to say we have never returned null records.
I think fixing the protocol spec makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
the flaky will be fixed by #18735
Thanks for the reviews! We're getting close to the end of this KIP-896 saga. :) |
I added the following to KIP-896, so there is a record of this:
|
Ensure we always return empty records (including cases where an error is returned). We also remove `nullable` from `records` since it is effectively expected to be non-null by a large percentage of clients in the wild. This behavior regressed in fe56fc9 (KAFKA-18269). Empty records were previously set via `FetchResponse.recordsOrFail(partitionData)` in the now-removed `maybeConvertFetchedData` method. Added an integration test that fails without this fix and also update many tests to set `records` to `empty` instead of leaving them as `null`. Reviewers: Chia-Ping Tsai <[email protected]>, David Arthur <[email protected]>
…ibrdkafka-compressed-produce-fails * apache-github/trunk: MINOR: prevent exception from HdrHistogram (apache#18674) KAFKA-18653: Fix mocks and potential thread leak issues causing silent RejectedExecutionException in share group broker tests (apache#18725) KAFKA-18646: Null records in fetch response breaks librdkafka (apache#18726) KAFKA-18619: New consumer topic metadata events should set requireMetadata flag (apache#18668) KAFKA-18488: Improve KafkaShareConsumerTest (apache#18728)
Ensure we always return empty records (including cases where an error is returned). We also remove
nullable
fromrecords
since it is effectively expected to be non-null by a large percentage of clients in the wild.This behavior regressed in fe56fc9 (KAFKA-18269). Empty records were previously set via
FetchResponse.recordsOrFail(partitionData)
in the now-removedmaybeConvertFetchedData
method.Added an integration test that fails without this fix and also update many tests to set
records
toempty
instead of leaving them as
null
.Committer Checklist (excluded from commit message)