FileReader improvements #337

AnatolyPopov · 2024-11-08T12:45:03Z

Replacing custom iterator that has known issues from previous PR's comments with Java streams.

aindriu-aiven · 2024-11-08T13:47:42Z

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/FileReader.java

-                final ListObjectsV2Result objectListing = s3Client.listObjectsV2(request);
-                currentBatch = objectListing.getObjectSummaries()
+        final ListObjectsV2Request request = new ListObjectsV2Request().withBucketName(bucketName)
+                .withMaxKeys(s3SourceConfig.getInt(FETCH_PAGE_SIZE) * PAGE_SIZE_FACTOR);


This is where I thought we could make an improvement, by only querying after the last known processed file from S3.
We could have a config item for the first time, but after that we could track the last processed file so that we can query the next time after that file reducing the number of queries we need to make to S3 and starting faster.
At the moment we are streaming every file in the S3 bucket every time we callfetchObjectSummaries()

This is a known issue and is a little out of scope of this PR. Will add startAfter property as discussed in next PRs.

How about using the S3ObjectSummaryIterator from my earlier work.

https://github.com/Claudenw/cloud-storage-connectors-for-apache-kafka/blob/iterator_impl/s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3ObjectSummaryIterator.java

aindriu-aiven

LGTM

Claudenw · 2024-11-08T14:45:50Z

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/FileReader.java

-            }
-        };
+                        .filter(objectSummary -> !failedObjectKeys.contains(objectSummary.getKey())));
+        return s3ObjectStream.iterator();


Rather than hardcoding the filter why not pass a Predicate to do the filtering? Or on the outside of the S3ObjectSummaryIterator you can use the Apache Commons IteratorUtils.filteredIterator to apply a predicate to the iterator.

I agree with you and I would like to have it done in a different I think. I do think that we should not need to pass FileReader object here and there just keep track of files that have faulty records to filter them out later.
But this requires a bigger refactoring that is a little out of scope of this PR.

But thanks for bringing this up.

I think you can insert what we think might be the final approach into the current FileReader.

Actually, we can filter out bad records and record them inside a Predicate.

@Claudenw going to merge this, and create an issue for follow up, I think a wider refactoring sounds plausible in the future, perhaps when this is moved to commons.

AnatolyPopov requested review from a team as code owners November 8, 2024 12:45

aindriu-aiven reviewed Nov 8, 2024

View reviewed changes

FileReader improvements

7aecceb

AnatolyPopov force-pushed the anatolii/replacing-custom-iterator-with-streams branch from 46fab1c to 7aecceb Compare November 8, 2024 14:36

aindriu-aiven approved these changes Nov 8, 2024

View reviewed changes

muralibasani approved these changes Nov 8, 2024

View reviewed changes

Claudenw reviewed Nov 8, 2024

View reviewed changes

aindriu-aiven merged commit 6d10696 into s3-source-release Nov 8, 2024
8 checks passed

aindriu-aiven deleted the anatolii/replacing-custom-iterator-with-streams branch November 8, 2024 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileReader improvements #337

FileReader improvements #337

AnatolyPopov commented Nov 8, 2024

aindriu-aiven Nov 8, 2024

AnatolyPopov Nov 8, 2024

Claudenw Nov 8, 2024

aindriu-aiven left a comment

Claudenw Nov 8, 2024 •

edited

Loading

AnatolyPopov Nov 8, 2024

Claudenw Nov 8, 2024

Claudenw Nov 8, 2024

aindriu-aiven Nov 8, 2024

FileReader improvements #337

FileReader improvements #337

Conversation

AnatolyPopov commented Nov 8, 2024

aindriu-aiven Nov 8, 2024

Choose a reason for hiding this comment

AnatolyPopov Nov 8, 2024

Choose a reason for hiding this comment

Claudenw Nov 8, 2024

Choose a reason for hiding this comment

aindriu-aiven left a comment

Choose a reason for hiding this comment

Claudenw Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

AnatolyPopov Nov 8, 2024

Choose a reason for hiding this comment

Claudenw Nov 8, 2024

Choose a reason for hiding this comment

Claudenw Nov 8, 2024

Choose a reason for hiding this comment

aindriu-aiven Nov 8, 2024

Choose a reason for hiding this comment

Claudenw Nov 8, 2024 •

edited

Loading