Combined set of changes for discussion, focussed on a new methodology for record grouping #319

mkeskells · 2024-10-29T17:44:16Z

This is a very draft form, and would be structured into several PRs. It's a merge of some existing open PRs, a few minor fixes and a more major rework of the way that records are buffered and written to files. Its structured as a subclass, to make it easier to reason about, but may well be a separate sink connector, or merge into the parent

It's not expected that this PR will be merged in its current form. Bits may be cherry picked from theist merge. Its looking for a review of the approach and whether this approach would work well in this connector, or its too much of a change and forked into a private repo (which would be sad)

In summary 087b4b2 is focussed on this approach

When files are full, write them
When records are too old, write the files
Don't hold record in memory if we don't have to
Request commits from Kafka connect when we have completed to write of a file (with a bit of debouncing), rather than waiting for commit to be forced by timeout. This makes writing less batched
Avoid OOM issues, and reduce memory pressure by hastening the write of files, and if needed pausing consumption of records until writes have occurred
Provide different writer models (currently just eager or lazy) provide different options on writer behaviour, different resource pressures and timings. There are probably more options and tweaking to do here

These changes could be applied to S3 - most of the logic isn't GCS specific, it just batching and file writing

This does change the semantics of the existing connector - e.g. at least

overwrites of records with the same key don't occur - there isn't a batch in the same way (I don't need this, and it could be added)
different file names could happen, so a reset and rewrite could cause duplicates.
This can happen in the existing record grouper if there is some reconfiguration after a incomplete write and restart
With these changes it could also occur because the files are written earlier (based on timing), and the batching causes different groups to happen. I don't need this behaviour as I can remove duplicates downstream

(all of the above is in 087b4b2d9681578005dc97b61e7e4e2e6229656c

prior commits are from other PRs, here because this builds upon them, or for the ease of my development

from a header value from a data field via a custom extractors Remove a few simple classes and make a DataExtractor to read things from the `sinkRecord` and few tidyups

Introduce a new property `file.record.grouper.builder` to specify a builder for a grouper enable the grouper to define additional properties and associated documentation Minor refactors of the 'File' common configuration shared between S3 and GCS introduce some more validators Add tests for custom record grouper factory add tests for additional config definition

… RecordGrouper, and more a something that tells the caller the group, not manages the group. Allow full files to be written in the background Allow files to be written when we reach a timeout (i.e. a max delay for a record) Update the kafka commits when we have written files, rather than waiting for commit to be forced by timeout Avoid OOM issues by having back pressure, so that we can flush or cause earlier writes if we have too many buffered records Provide different writer models so that we can write data before the flush (removing the memory pressure potentially, depending on the writer)

aindriu-aiven · 2024-10-30T07:45:56Z

Thanks @mkeskells I'll take a look through today, I have also reached out to some other long time members and have asked them to take a look!

mkeskells · 2024-10-30T14:46:05Z

@aindriu-aiven happy to get on a call or some other form to discuss this. I am sure that a PR is not the most effective way to progress this

aindriu-aiven · 2024-11-04T08:52:32Z

s3-sink-connector/src/main/java/io/aiven/kafka/connect/s3/S3SinkTask.java

+            } else {
+                stream = recordGrouper.records().entrySet().stream();
+            }
+            stream.forEach( entry -> flushFile(entry.getKey(), entry.getValue()));


Hey @mkeskells,
I have been looking through this PR this morning, at the moment, I have a PR open that initiates the upload to S3 on the put() operation and once records are successfully uploaded they are removed from the record grouping, this is an effort to reduce memory usage and improve performance.
#318

I am trying to see how we can dovetail this with the record streamer you have proposed here because, I really like the addition of this functionality, especially adding a max record age for files and I was previously looking at extending the "rotator" functionality to handle this, and I would be interested in your thoughts.

Hi @aindriu-aiven happy to see how we can do this.
Can we connect and progress this together? I have production deadlines (and not using S3)
I did design this change so that I think it should work in a common manner, not bound to a particular storage provider

As I am reading this I think that the eager writing model that I have here, is roughly what you have built for S3. Is that correct. I wanted to have the non eager model as well, so that would allow record overwriting for duplicate keys

I was also trying to ensure that what was written would work well with virtual threads & loom. I know that we can't use them yet here until we support a later JVM, but using the FJ pool and the CompletableFuture allows this to use this and scale later without having to rewrite continuations, and its much easier to debug!

its does seem that the goals and the concepts overlap
I have connected to you on LinkedIn, so that we can discuss how to progress outside of a PR comment, which doesn't seem to be the correct forum

aindriu-aiven · 2024-11-04T08:54:38Z

commons/src/main/java/io/aiven/kafka/connect/common/config/AivenCommonConfig.java

    }

+    protected static int addFileConfigGroup(final ConfigDef configDef, final String groupFile, final String type,


I also have a PR open #312 that will split AivenCommonConfig into a Source and Sink Common config so that we can start building the Source connectors to start reading back the data we have stored here, This is mostly a minor structural change but just wanted to give you a heads up. This info does make sense to be moved to the Sink Common config imo.

Happy to work with what you are doing. It does seems that the config is duplicate a lot, and some bits don't seem to be used.
What do you suggest as the best was to do this, without causing too much conflict that we can avoid @aindriu-aiven ?

Mike Skells added 7 commits October 8, 2024 13:23

Add additional timestamp sources, for

89a2365

from a header value from a data field via a custom extractors Remove a few simple classes and make a DataExtractor to read things from the `sinkRecord` and few tidyups

spotless

055d5df

write in parallel

bbce2f3

api tracer factory

00240e9

add date support for timestamp source

a1930fb

mkeskells requested review from a team as code owners October 29, 2024 17:44

mkeskells marked this pull request as draft October 29, 2024 17:44

mkeskells mentioned this pull request Oct 29, 2024

Parallel file writing #311

Open

aindriu-aiven reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combined set of changes for discussion, focussed on a new methodology for record grouping #319

Combined set of changes for discussion, focussed on a new methodology for record grouping #319

mkeskells commented Oct 29, 2024 •

edited

Loading

aindriu-aiven commented Oct 30, 2024

mkeskells commented Oct 30, 2024

aindriu-aiven Nov 4, 2024

mkeskells Nov 4, 2024

mkeskells Nov 4, 2024

aindriu-aiven Nov 4, 2024

mkeskells Nov 4, 2024

		}

		protected static int addFileConfigGroup(final ConfigDef configDef, final String groupFile, final String type,

Combined set of changes for discussion, focussed on a new methodology for record grouping #319

Are you sure you want to change the base?

Combined set of changes for discussion, focussed on a new methodology for record grouping #319

Conversation

mkeskells commented Oct 29, 2024 • edited Loading

aindriu-aiven commented Oct 30, 2024

mkeskells commented Oct 30, 2024

aindriu-aiven Nov 4, 2024

Choose a reason for hiding this comment

mkeskells Nov 4, 2024

Choose a reason for hiding this comment

mkeskells Nov 4, 2024

Choose a reason for hiding this comment

aindriu-aiven Nov 4, 2024

Choose a reason for hiding this comment

mkeskells Nov 4, 2024

Choose a reason for hiding this comment

mkeskells commented Oct 29, 2024 •

edited

Loading