Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes shrink action shard allocation checks #335

Merged
merged 1 commit into from
Apr 18, 2022

Conversation

downsrob
Copy link
Contributor

@downsrob downsrob commented Apr 18, 2022

Signed-off-by: Clay Downs [email protected]

Issue #, if available:
NA

Description of changes:
There were a few issues with Shrink shard allocation which would cause failures when shrinking indices with replicas:

  • During shard allocation dry run, a move shard request was initiated for every shard not present on the target node. This could mean a primary shard and multiple replicas. When multiple requests for the same shard are initiated at one time, the dry run will fail with explanation=a copy of this shard is already allocated to this node. To fix this, this PR only creates an allocation dry run request for one random shard copy for each shard.
  • During shard allocation dry run, allocation requests were still created for shards which had a copy on the target node already, because of this the dry run would fail with explanation=a copy of this shard is already allocated to this node. To fix this, this PR checks if any copy of the shard is present on the target node to initiate a request.
  • While waiting for shard allocation to finish, we previously only checked that primary shards were on the target node, meaning that a replica shard being on the target node instead would not allow shrink. This PR fixes that behavior by checking if any copy of each shard is on the target node and started.

CheckList:

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@downsrob downsrob marked this pull request as ready for review April 18, 2022 17:51
@downsrob downsrob requested review from a team, bowenlan-amzn and thalurur April 18, 2022 17:51
@codecov-commenter
Copy link

Codecov Report

Merging #335 (00ad17e) into main (ab1e821) will decrease coverage by 0.24%.
The diff coverage is 45.00%.

@@             Coverage Diff              @@
##               main     #335      +/-   ##
============================================
- Coverage     75.53%   75.28%   -0.25%     
+ Complexity     2134     2111      -23     
============================================
  Files           260      260              
  Lines         12332    12341       +9     
  Branches       1937     1942       +5     
============================================
- Hits           9315     9291      -24     
- Misses         1963     1989      +26     
- Partials       1054     1061       +7     
Impacted Files Coverage Δ
...atemanagement/step/shrink/WaitForMoveShardsStep.kt 36.95% <28.57%> (ø)
...atemanagement/step/shrink/AttemptMoveShardsStep.kt 55.55% <40.00%> (+0.21%) ⬆️
...xmanagement/indexstatemanagement/util/StepUtils.kt 75.90% <62.50%> (-1.43%) ⬇️
...nt/indexstatemanagement/model/destination/Chime.kt 43.47% <0.00%> (-17.40%) ⬇️
...ent/rollup/action/explain/ExplainRollupResponse.kt 76.19% <0.00%> (-9.53%) ⬇️
...gement/transform/model/ContinuousTransformStats.kt 66.66% <0.00%> (-7.41%) ⬇️
...gement/indexstatemanagement/action/ShrinkAction.kt 63.23% <0.00%> (-5.89%) ⬇️
...dexmanagement/transform/model/TransformMetadata.kt 86.40% <0.00%> (-5.83%) ⬇️
...anagement/indexstatemanagement/model/Transition.kt 61.64% <0.00%> (-4.11%) ⬇️
...arch/indexmanagement/rollup/RollupSearchService.kt 61.11% <0.00%> (-3.71%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab1e821...00ad17e. Read the comment docs.

@downsrob
Copy link
Contributor Author

Test and build workflow CI failure was due to unrelated flaky test

var numShardsInSync = 0
for (shard: ShardStats in response.shards) {
val routingInfo = shard.shardRouting
val nodeIdShardIsOn = routingInfo.currentNodeId()
val nodeNameShardIsOn = context.clusterService.state().nodes()[nodeIdShardIsOn].name
if (nodeNameShardIsOn.equals(nodeToMoveOnto) && routingInfo.started()) {
shardIdOnNode[shard.shardRouting.id] = true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a note here, we want to extract below if (routingInfo.primary()) block into a function checkReplicaInSync for readability.

@downsrob downsrob merged commit 21cfcea into opensearch-project:main Apr 18, 2022
@downsrob downsrob deleted the shrink-fix branch April 18, 2022 18:42
wuychn pushed a commit to ochprince/index-management that referenced this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants