Spark: Structured Streaming read limit support follow-up #12260

wypoon · 2025-02-13T23:36:49Z

This fixes the TODO in #4479.
Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset(Offset, ReadLimit). In testing this, a bug was found in SparkMicroBatchStream::getDefaultReadLimit() and fixed.

Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset. In addition, fix a bug.

wypoon · 2025-02-14T00:50:42Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

-      readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
+      readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);


Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!

wypoon · 2025-02-14T00:52:27Z

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java

-  public void testReadStreamOnIcebergTableWithMultipleSnapshots_WithNumberOfFiles_1()
-      throws Exception {
+  public void testReadStreamWithMaxFiles1() throws Exception {


I renamed a few tests to be more concise. The old names were unwieldy and also not conforming to Java style.

wypoon · 2025-02-14T00:54:50Z

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java

+    assertThat(
+            microBatchCount(
+                ImmutableMap.of(
+                    SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1",
+                    SparkReadOptions.STREAMING_MAX_ROWS_PER_MICRO_BATCH, "2")))
+        .isEqualTo(6);


This fails without the fix to SparkMicroBatchStream::getDefaultReadLimit(), as Spark then calls SparkMicroBatchStream::latestOffset(Offset, ReadLimit) with a CompositeReadLimit where one of the ReadLimits is a ReadMaxRows(1).

wypoon · 2025-02-14T00:56:22Z

@singhpk234 @jackye1995 @RussellSpitzer this is a small fix; can you please review?

singhpk234

Mostly LGTM with a minor suggestion, Thanks @wypoon !

singhpk234 · 2025-02-14T05:38:06Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

+      for (int i = 0; i < limits.length; i++) {
+        ReadLimit limit = limits[i];
+        if (limit instanceof ReadMaxFiles) {
+          return ((ReadMaxFiles) limit).maxFiles();
+        }
+      }


[minor] can we use this ?

Suggested change

for (int i = 0; i < limits.length; i++) {

ReadLimit limit = limits[i];

if (limit instanceof ReadMaxFiles) {

return ((ReadMaxFiles) limit).maxFiles();

}

}

for (ReadLimit limit: limits) {

if (limit instanceof ReadMaxFiles) {

return ((ReadMaxFiles) limit).maxFiles();

}

}

singhpk234 · 2025-02-14T05:41:41Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

-      readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
+      readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);


Thank you for catching ! This got missed, as we don't take the Readlimit we get from latestOffset API but rather from the configs which are set in constructor earlier!

wypoon · 2025-02-14T18:06:52Z

Thanks @singhpk234.

singhpk234

LGTM, Thanks @wypoon !

Spark: Structured Streaming read limit support follow-up

5f2582f

Use the ReadLimit passed in to SparkMicroBatchStream::latestOffset. In addition, fix a bug.

github-actions bot added the spark label Feb 13, 2025

wypoon commented Feb 14, 2025

View reviewed changes

singhpk234 reviewed Feb 14, 2025

View reviewed changes

Use enhanced for loop.

7afd085

singhpk234 approved these changes Feb 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Structured Streaming read limit support follow-up #12260

Spark: Structured Streaming read limit support follow-up #12260

wypoon commented Feb 13, 2025

wypoon Feb 14, 2025

singhpk234 Feb 14, 2025

wypoon Feb 14, 2025

wypoon Feb 14, 2025

wypoon commented Feb 14, 2025

singhpk234 left a comment

singhpk234 Feb 14, 2025

wypoon Feb 14, 2025

singhpk234 Feb 14, 2025

wypoon commented Feb 14, 2025

singhpk234 left a comment

		readLimits[1] = ReadLimit.maxRows(maxFilesPerMicroBatch);
		readLimits[1] = ReadLimit.maxRows(maxRecordsPerMicroBatch);

Spark: Structured Streaming read limit support follow-up #12260

Are you sure you want to change the base?

Spark: Structured Streaming read limit support follow-up #12260

Conversation

wypoon commented Feb 13, 2025

wypoon Feb 14, 2025

Choose a reason for hiding this comment

singhpk234 Feb 14, 2025

Choose a reason for hiding this comment

wypoon Feb 14, 2025

Choose a reason for hiding this comment

wypoon Feb 14, 2025

Choose a reason for hiding this comment

wypoon commented Feb 14, 2025

singhpk234 left a comment

Choose a reason for hiding this comment

singhpk234 Feb 14, 2025

Choose a reason for hiding this comment

wypoon Feb 14, 2025

Choose a reason for hiding this comment

singhpk234 Feb 14, 2025

Choose a reason for hiding this comment

wypoon commented Feb 14, 2025

singhpk234 left a comment

Choose a reason for hiding this comment