Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Structured Streaming read limit support follow-up #12260

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wypoon
Copy link
Contributor

@wypoon wypoon commented Feb 13, 2025

This fixes the TODO in #4479.
Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset(Offset, ReadLimit). In testing this, a bug was found in SparkMicroBatchStream::getDefaultReadLimit() and fixed.

Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset.
In addition, fix a bug.
@github-actions github-actions bot added the spark label Feb 13, 2025
Comment on lines -461 to +505
readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!

public void testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfFiles_1()
throws Exception {
public void testReadStreamWithMaxFiles1() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed a few tests to be more concise. The old names were unwieldy and also not conforming to Java style.

Comment on lines +227 to +232
assertThat(
microBatchCount(
ImmutableMap.of(
SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1",
SparkReadOptions.STREAMING_MAX_ROWS_PER_MICRO_BATCH, "2")))
.isEqualTo(6);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails without the fix to SparkMicroBatchStream::getDefaultReadLimit(), as Spark then calls SparkMicroBatchStream::latestOffset(Offset, ReadLimit) with a CompositeReadLimit where one of the ReadLimits is a ReadMaxRows(1).

@wypoon
Copy link
Contributor Author

wypoon commented Feb 14, 2025

@singhpk234 @jackye1995 @RussellSpitzer this is a small fix; can you please review?

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM with a minor suggestion, Thanks @wypoon !

Comment on lines 325 to 330
for (int i = 0; i < limits.length; i++) {
ReadLimit limit = limits[i];
if (limit instanceof ReadMaxFiles) {
return ((ReadMaxFiles) limit).maxFiles();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] can we use this ?

Suggested change
for (int i = 0; i < limits.length; i++) {
ReadLimit limit = limits[i];
if (limit instanceof ReadMaxFiles) {
return ((ReadMaxFiles) limit).maxFiles();
}
}
for (ReadLimit limit: limits) {
if (limit instanceof ReadMaxFiles) {
return ((ReadMaxFiles) limit).maxFiles();
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted.

Comment on lines -461 to +505
readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!

@wypoon
Copy link
Contributor Author

wypoon commented Feb 14, 2025

Thanks @singhpk234.

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @wypoon !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants