Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-18646: Null records in fetch response breaks librdkafka #18726

Merged
merged 3 commits into from
Jan 29, 2025

Conversation

ijuma
Copy link
Member

@ijuma ijuma commented Jan 28, 2025

Ensure we always return empty records (including cases where an error is returned). We also remove nullable from records since it is effectively expected to be non-null by a large percentage of clients in the wild.

This behavior regressed in fe56fc9 (KAFKA-18269). Empty records were previously set via FetchResponse.recordsOrFail(partitionData) in the now-removed maybeConvertFetchedData method.

Added an integration test that fails without this fix and also update many tests to set records to empty
instead of leaving them as null.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Ensure we always return empty records (including the cases where an error is returned).
This behavior regressed in fe56fc9 (KAFKA-18269).
@github-actions github-actions bot added core Kafka Broker clients small Small PRs labels Jan 28, 2025
@ijuma ijuma requested a review from chia7712 January 28, 2025 08:08
@ijuma
Copy link
Member Author

ijuma commented Jan 28, 2025

Failures are unrelated to this PR - rerunning to try for a green build anyway.

@ijuma ijuma requested a review from mumrah January 28, 2025 14:38
Copy link
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ijuma, LGTM.

Took me a while to find the regression. Looks like empty records were previously set via FetchResponse.recordsOrFail(partitionData) in the now-removed maybeConvertFetchedData method. With this patch it looks like we will only set records in processResponseCallback or when building an error response for a partition (FetchResponse#partitionResponse)

Should we add a null check on records in FetchPartitionData to prevent reintroducing this bug?

@chia7712
Copy link
Member

@mumrah Thanks for sharing. You've saved me time :)

Should we add a null check on records in FetchPartitionData to prevent reintroducing this bug?

I completely agree that 4.0 should not burn out third-party libraries. However, this seems to conflict with specification nullableVersions": "0+. I suggest adding a comment to explain the rationale behind the null check.

@ijuma
Copy link
Member Author

ijuma commented Jan 28, 2025

Is the suggestion to remove nullable from this field? I can try that change and check what fails.

@github-actions github-actions bot added the kraft label Jan 28, 2025
@ijuma
Copy link
Member Author

ijuma commented Jan 29, 2025

With this patch it looks like we will only set records in processResponseCallback or when building an error response for a partition (FetchResponse#partitionResponse)

We only changed this one place because it was the only place where it was incorrect outside of tests. This was confirmed after we make records non-nullable in the schema (only tests had to be updated).

@ijuma
Copy link
Member Author

ijuma commented Jan 29, 2025

The only test failing is:

❌ KafkaAdminClientTest > testAdminClientApisAuthenticationFailure()

@ijuma
Copy link
Member Author

ijuma commented Jan 29, 2025

@chia7712 @mumrah Let me know if this looks good to you now.

@@ -106,7 +106,7 @@
]},
{ "name": "PreferredReadReplica", "type": "int32", "versions": "11+", "default": "-1", "ignorable": false, "entityType": "brokerId",
"about": "The preferred read replica for the consumer to use on its next fetch request."},
{ "name": "Records", "type": "records", "versions": "0+", "nullableVersions": "0+", "about": "The record data."}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing nullableVersions would prevent Kafka from "reading" null records. This could potentially pose a risk. Therefore, we might consider a strategy where Kafka strictly enforces that null records are never "written" while still maintaining the ability to "read" them.

In summary, I favor @mumrah's suggestion of adding a null check. Furthermore, we should include a comment explaining the rationale behind preventing Kafka from writing null records.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate on the risk - what is the exact use case where that would happen? I couldn't come up with one.

Copy link
Member Author

@ijuma ijuma Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that released versions of Kafka never return null records - if that ever happened, it would have broken librdkafka. Same for any other implementation of the Kafka protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate on the risk - what is the exact use case where that would happen? I couldn't come up with one.

I believe FetchResponse.json should be considered part of the public interface. Consequently, modifying the "released spec" carries inherent risks. We cannot guarantee that no external implementations adhere to our specification. For instance, other server implementations might return null records, and after this PR, our 4.0 client would no longer be able to read them.

Keep in mind that released versions of Kafka never return null records

that is true and it does not violate the spec, right? I mean "apache kafka can never return null even though the spec says it is valid to return null"

Copy link
Member Author

@ijuma ijuma Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, the actual spec here is that this field can never be null. As I said, there is no Kafka implementation that doesn't work with librdkafka and related clients - it constitutes 30-50% of Kafka clients in the wild.

What happened here is that the spec definition was actually incorrect - it specified the field as nullable even though it couldn't be. And that bug meant we accidentally broke a large part of the ecosystem. I don't see the benefit in not fixing the spec.

Again, let's discuss a concrete use case - you're worried that the Java consumer might break because there is an unknown Kafka broker that implements this differently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, let's discuss a concrete use case - you're worried that the Java consumer might break because there is an unknown Kafka broker that implements this differently?

I have no evidence that there is a broker implementation that returns null records. If librdkafka and related clients constitute 30-50% of clients, it is acceptable to fix the specification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this behavior in librdkafka has existed for a long time, and we have not seen an issue with null records prior to the recent changes, I think it is safe to say we have never returned null records.

I think fixing the protocol spec makes sense.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

the flaky will be fixed by #18735

@ijuma
Copy link
Member Author

ijuma commented Jan 29, 2025

Thanks for the reviews! We're getting close to the end of this KIP-896 saga. :)

@ijuma ijuma merged commit ca5d2cf into apache:trunk Jan 29, 2025
7 of 9 checks passed
@ijuma ijuma deleted the kafka-18646-null-records-breaks-librdkafka branch January 29, 2025 15:04
@ijuma
Copy link
Member Author

ijuma commented Jan 29, 2025

I added the following to KIP-896, so there is a record of this:

Finally, we will fix a protocol specification inconsistency: FetchResponse has a records field that is tagged as nullable even though it is always non-null and all librdkafka-based clients (which cover a large percentage of clients in the wild) break if this field is set to null . We adjust the spec to match reality - this way implementors do not have to find out about this requirement during testing with real clients. This small change is not strictly related to this KIP, but it was found during the testing phase of this KIP and hence we included it here.

ijuma added a commit that referenced this pull request Jan 29, 2025
Ensure we always return empty records (including cases where an error is returned).
We also remove `nullable` from `records` since it is effectively expected to be
non-null by a large percentage of clients in the wild.

This behavior regressed in fe56fc9 (KAFKA-18269). Empty records were
previously set via `FetchResponse.recordsOrFail(partitionData)` in the
now-removed `maybeConvertFetchedData` method.

Added an integration test that fails without this fix and also update many
tests to set `records` to `empty` instead of leaving them as `null`.

Reviewers: Chia-Ping Tsai <[email protected]>, David Arthur <[email protected]>
ijuma added a commit to ijuma/kafka that referenced this pull request Jan 30, 2025
…ibrdkafka-compressed-produce-fails

* apache-github/trunk:
  MINOR: prevent exception from HdrHistogram (apache#18674)
  KAFKA-18653: Fix mocks and potential thread leak issues causing silent RejectedExecutionException in share group broker tests (apache#18725)
  KAFKA-18646: Null records in fetch response breaks librdkafka (apache#18726)
  KAFKA-18619: New consumer topic metadata events should set requireMetadata flag (apache#18668)
  KAFKA-18488: Improve KafkaShareConsumerTest (apache#18728)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clients core Kafka Broker kraft small Small PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants