Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring checkpoint publishing to no longer be a Broadcast Action #2591

Conversation

kartg
Copy link
Member

@kartg kartg commented Mar 24, 2022

Description

This change adopts the same methodology as RetentionLeaseSyncer. A Transport client action is not required since we can skip the step of routing to the primary shard (the checkpoint publisher is guaranteed to fire only from the primary). Now, checkpoint publishing directly invokes a TransportReplicationAction to perform the operation on the primary and replicas. PublishCheckpointAction has been reworked to be the TransportReplicationAction implementation rather than an ActionType.

We leverage dependency injection to create the checkpoint publisher (and its internal action) at IndicesClusterStateService. This is plumbed through to IndexShard -

IndicesClusterStateService -> IndicesService -> IndexService -> IndexShard

IndexShard creates the refresh listener instance. All other transport layer classes tied to the original broadcast action are no longer required.

Unrelated integration tests use a no-op/empty checkpoint publisher to satisfy their constructor/method argument.

Signed-off-by: Kartik Ganesh [email protected]

Issues Resolved

closes #2199

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kartg kartg requested a review from a team as a code owner March 24, 2022 21:36
@mch2 mch2 self-requested a review March 24, 2022 21:49
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 053ea1498474eec1a475e0e897815cb3b06cd078
Log 3742

Reports 3742

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but otherwise lgtm. Thanks for doing this!

@@ -38,8 +38,8 @@ public void beforeRefresh() throws IOException {

@Override
public void afterRefresh(boolean didRefresh) throws IOException {
if (shard.routingEntry().primary()) {
publisher.publish(shard.getLatestReplicationCheckpoint());
if (shard.routingEntry().primary() && shard.indexSettings().isSegrepEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We conditionally wire up the refreshListener in IndexShard if segrep is enabled & the shard is a primary, I'm not sure we need these checks anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i can remove this. Follow up question - should we be checking didRefresh here before publishing the checkpoint ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, yes I think we should. Otherwise we could make multiple calls for the same ReplicationCheckpoint.

This change adopts the same methodology as RetentionLeaseSyncer. A Transport client action is not required since we can skip the step of routing to the primary shard (the checkpoitn publisher is guaranteed to fire only from the primary). Now, checkpoint publishing directly invokes a TransportReplicationAction to perform the operation on the primary and replicas. PublishCheckpointAction has been reworked to be the TransportReplicationAction implementation rather than an ActionType.

We leverage dependency injection to create the checkpoint publisher (and its internal action) at IndicesClusterStateService. This is plumed through to IndexShard which creates the refresh listener instance. All other transport layer classes tied to the original broadcast action are no longer required.

Unrelated integration tests use a no-op/empty checkpoint publisher to satisfy their constructor/method argument.

Signed-off-by: Kartik Ganesh <[email protected]>
@kartg kartg force-pushed the feature/segment-replication branch from 053ea14 to 03bb12d Compare March 25, 2022 00:19

@Override
protected void doExecute(Task task, PublishCheckpointRequest request, ActionListener<ReplicationResponse> listener) {
assert false : "use PublishCheckpointAction#publish";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this function if it throws an assertion error everytime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the implementation here isn't the clearest. What's going on is:

  1. To allow a primary shard to publish the checkpoint to its replicas, we're leveraging the ReplicationOperation functionality of executing on a primary followed by its replicas.
  2. In order to run a ReplicationOperation, we need to register an action name with the transport layer. TransportReplicationAction (the superclass here) does this in its constructor
  3. However, TransportReplicationAction also assumes that it will be called via execute/doExecute, which includes rerouting behavior. We don't need the reroute phase here since we are guaranteed to be on the primary shard
  4. Thus, the purpose of this implementation is to guard against accidental execution via the above code path. Instead, the refresh listener invokes the publish API directly

@kartg kartg merged commit e569e6f into opensearch-project:feature/segment-replication Mar 25, 2022
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 03bb12d
Log 3747

Reports 3747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants