Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e acknowledgment and checkpointing to RDS source #4819

Merged
merged 8 commits into from
Aug 26, 2024

Conversation

oeyh
Copy link
Collaborator

@oeyh oeyh commented Aug 9, 2024

Description

This PR adds:

  • e2e acknowledgment to export processing
  • e2e acknowledgment and checkpointing to stream processing

Issues Resolved

Contributes to #4561

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@oeyh oeyh marked this pull request as ready for review August 9, 2024 15:59
@@ -56,6 +57,12 @@ public class RdsSourceConfig {
@JsonProperty("acknowledgments")
private boolean acknowledgments = false;

@JsonProperty("s3_data_file_acknowledgment_timeout")
private Duration dataFileAcknowledgmentTimeout = Duration.ofMinutes(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the avg time taken to process data file ? Is this timeout good enough ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty fast in my test but it was not a large dataset. I guess we will need to find out and update accordingly. Will put a larger number here.

I see documentDb uses 2 hours by default. Wondering how was that number calculated?

} else {
do {
currentChangeEventStatus = changeEventStatuses.poll();
} while (!changeEventStatuses.isEmpty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to get stuck in this while loop ?

Copy link
Collaborator Author

@oeyh oeyh Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If the add rate is equal or higher than poll rate, yes. I can add another condition to exit the loop and do checkpoint after certain number of polls.

do {
currentChangeEventStatus = changeEventStatuses.poll();
} while (!changeEventStatuses.isEmpty());
streamCheckpointer.checkpoint(currentChangeEventStatus.getBinlogCoordinate());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider adding metrics for checkpoint count.

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class StreamCheckpointManager {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see this used in steam worker..

Copy link
Collaborator Author

@oeyh oeyh Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream worker here just connect to binlog stream. The stream events are processed in BinlogEventListener class. You can find StreamCheckpointManager used there.

dinujoh
dinujoh previously approved these changes Aug 26, 2024
@@ -101,6 +129,26 @@ public void onEvent(com.github.shyiko.mysql.binlog.event.Event event) {
}
}

public void stopClient() {
try {
binaryLogClient.disconnect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add a debug/info message when the client is disconnected. This will help during performance testing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

acknowledgementSet.add(transformedEvent);
}

bufferAccumulator.add(new Record<>(transformedEvent));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferAccumulator is not thread safe. I would recommend to instantiate this set with in the run method itself instead of passing to this thread. Or avoid using Accumulator if it is not really required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I moved the instantiation into the run method.

} else {
exportFileErrorCounter.increment();
LOG.warn("Negative acknowledgment received for data file {}, retrying", dataFilePartition.getKey());
sourceCoordinator.giveUpPartition(dataFilePartition);
Copy link
Contributor

@san81 san81 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It me just me trying to understand the code here. Not a real comment about the code.

I would assume the retry on this partition and also keeping track of retryCount is handled by the sourceCoordinator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Once the datafile partition is given up, it will be available for acquire again and the file load will be retried by whichever node that picked up the partition.

@oeyh oeyh merged commit d6465ef into opensearch-project:main Aug 26, 2024
42 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants