Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Fix for AlreadyClosedException for engine #4743

Merged
merged 2 commits into from
Nov 7, 2022

Conversation

Poojita-Raj
Copy link
Contributor

@Poojita-Raj Poojita-Raj commented Oct 12, 2022

Signed-off-by: Poojita Raj [email protected]

Description

Engine AlreadyClosedException exception takes place when replication is finalized on replica shard and when translog generation rollover takes place at which point we see the engine has already been closed.

We want to make sure we only update the segments when the engine is open. Added an integration test for the same.

Issues Resolved

Resolves #4530

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Poojita-Raj Poojita-Raj requested review from a team and reta as code owners October 12, 2022 00:52
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following some of the test logic. Could you describe a bit more in your commit message when/why the exception occurs and how this fixes it? From reading this and the original issue, it seems this happens when a shard/node is shut down while a replica is currently fetching new segments.

client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE).get();
refresh(INDEX_NAME);

final int initialDocCount = scaledRandomIntBetween(10000, 200000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a huge number of docs for an integ test. Given the test waits until replicas catch up to proceed with assertions, can we get by with 100-200?

) {
indexer.start(initialDocCount);
waitForDocs(initialDocCount, indexer);
refresh(INDEX_NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we flush here and avoid the flush & refresh on line 266?

logger.info("--> Closing the index ");
client().admin().indices().prepareClose(INDEX_NAME).get();

// Add another node to kick off TransportNodesListGatewayStartedShards which fetches latestReplicationCheckpoint for SegRep enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite following why we need to start 3rd node for this test? Before this step you have 2 nodes, with 1 primary & one replica shard & close the index. Are you wanting to test if the 3rd node is allocated one of the shards after its opened?

final long incomingGeneration = infos.getGeneration();
readerManager.updateSegments(infos);

// Commit and roll the xlog when we receive a different generation than what was last received.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we use translog instead of xlog?

Signed-off-by: Poojita Raj <[email protected]>
@Poojita-Raj Poojita-Raj force-pushed the alreadyClosed branch 2 times, most recently from 1b9790c to eeeae29 Compare November 4, 2022 22:30
@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2022

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2022

Codecov Report

Merging #4743 (1d4a6b4) into main (082f059) will decrease coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head 1d4a6b4 differs from pull request most recent head b1b064d. Consider uploading reports for the commit b1b064d to get more accurate results

@@             Coverage Diff              @@
##               main    #4743      +/-   ##
============================================
- Coverage     70.97%   70.91%   -0.06%     
+ Complexity    58172    58127      -45     
============================================
  Files          4708     4708              
  Lines        277556   277559       +3     
  Branches      40189    40189              
============================================
- Hits         196992   196836     -156     
- Misses        64462    64608     +146     
- Partials      16102    16115      +13     
Impacted Files Coverage Δ
.../opensearch/index/engine/NRTReplicationEngine.java 78.16% <100.00%> (+0.31%) ⬆️
.../indices/replication/SegmentReplicationTarget.java 91.26% <100.00%> (+0.08%) ⬆️
...java/org/opensearch/client/indices/DataStream.java 0.00% <0.00%> (-76.09%) ⬇️
.../opensearch/client/indices/CloseIndexResponse.java 17.50% <0.00%> (-73.75%) ⬇️
...a/org/opensearch/client/cluster/ProxyModeInfo.java 0.00% <0.00%> (-55.00%) ⬇️
.../org/opensearch/client/indices/AnalyzeRequest.java 27.00% <0.00%> (-51.00%) ⬇️
.../java/org/opensearch/search/dfs/AggregatedDfs.java 51.61% <0.00%> (-45.17%) ⬇️
...opensearch/persistent/PersistentTasksExecutor.java 22.22% <0.00%> (-44.45%) ⬇️
.../admin/cluster/reroute/ClusterRerouteResponse.java 60.00% <0.00%> (-40.00%) ⬇️
...ava/org/opensearch/search/dfs/DfsSearchResult.java 47.87% <0.00%> (-39.37%) ⬇️
... and 479 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

client().admin().indices().prepareClose(INDEX_NAME).get();

logger.info("--> Opening the index");
client().admin().indices().prepareOpen(INDEX_NAME).get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add assertion of doc count after reopening the index ?

CHANGELOG.md Outdated
@@ -210,6 +210,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Fix version check for 2.x release for awareness attribute decommission([#5034](https://github.com/opensearch-project/OpenSearch/pull/5034))
- Fix flaky test ResourceAwareTasksTests on Windows ([#5077](https://github.com/opensearch-project/OpenSearch/pull/5077))
- Length calculation for block based fetching ([#5055](https://github.com/opensearch-project/OpenSearch/pull/5055))
- Fix for AlreadyClosedException for engine ([#4743](https://github.com/opensearch-project/OpenSearch/pull/4743))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - Pls specify this as a segrep only issue.

@Poojita-Raj Poojita-Raj changed the title Fix for AlreadyClosedException for engine [Segment Replication] Fix for AlreadyClosedException for engine Nov 7, 2022
Signed-off-by: Poojita Raj <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2022

Gradle Check (Jenkins) Run Completed with:

@Poojita-Raj Poojita-Raj merged commit 37d1eba into opensearch-project:main Nov 7, 2022
@Rishikesh1159 Rishikesh1159 added the backport 2.x Backport to 2.x branch label Nov 7, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 7, 2022
* alreadyClosedExceptionFix

Signed-off-by: Poojita Raj <[email protected]>

* adding changelog entry

Signed-off-by: Poojita Raj <[email protected]>

Signed-off-by: Poojita Raj <[email protected]>
(cherry picked from commit 37d1eba)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

# Conflicts:
#	CHANGELOG.md
@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2022

Gradle Check (Jenkins) Run Completed with:

Poojita-Raj pushed a commit that referenced this pull request Nov 9, 2022
… (#5116)

* alreadyClosedExceptionFix

Signed-off-by: Poojita Raj <[email protected]>

* adding changelog entry

Signed-off-by: Poojita Raj <[email protected]>

Signed-off-by: Poojita Raj <[email protected]>
(cherry picked from commit 37d1eba)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

# Conflicts:
#	CHANGELOG.md

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Segment Replication] Flaky engine AlreadyClosedException exception on closed index action
5 participants