Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SegmentReplicationIT.testReplicaHasDiffFilesThanPrimary for node-node replication #8912

Merged
merged 2 commits into from
Jul 27, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jul 27, 2023

Description

This test is now failing for node-node replication and is a bug. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync. This is incorrectly using Map.compute which will cancel the existing handler but not replace its entry in the allocationIdToHandlers map. As a result this can leave the copyState map with an entry which holds an index commit while the test is cleaning up. The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Related Issues

Resolves #7643

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

#8213

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle Check (Jenkins) Run Completed with:

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':distribution:bwc:staged:buildBwcLinuxTar'.
> Building 2.9.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/staged/build/bwc/checkout-2.9/distribution/archives/linux-tar/build/distributions/opensearch-min-2.9.0-SNAPSHOT-linux-x64.tar.gz

@Poojita-Raj
Copy link
Contributor

Could we also start specifying number of successful runs we did for flaky tests before merging them in for confidence?

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle Check (Jenkins) Run Completed with:

ResourceAwareTasksTests.testBasicTaskResourceTracking test failure. #8213

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=140, name=opensearch[TransportTasksActionTests][search][T#1], state=RUNNABLE, group=TGRP-ResourceAwareTasksTests]
	at __randomizedtesting.SeedInfo.seed([1BC85B1C2F3D6C40:136F92396C1BFBE5]:0)
Caused by: java.lang.AssertionError: 
Expected: a value less than or equal to <8200000L>
     but: <8280416L> was greater than <8200000L>
	at __randomizedtesting.SeedInfo.seed([1BC85B1C2F3D6C40]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:964)
	at org.junit.Assert.assertThat(Assert.java:930)
	at org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.assertMemoryUsageWithinLimits(ResourceAwareTasksTests.java:677)
	at 

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

mch2 added 2 commits July 27, 2023 11:29
This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle Check (Jenkins) Run Completed with:

Not sure of failure, retrying again.

Removing login credentials for https://index.docker.io/v1/
+ docker image prune -f --all
Total reclaimed space: 0B
[Pipeline] }
[Pipeline] // script
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // timeout
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: script returned exit code 128
Finished: FAILURE

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Jul 27, 2023

Codecov Report

Merging #8912 (dd3ad7a) into main (91bc891) will decrease coverage by 0.10%.
The diff coverage is 81.81%.

@@             Coverage Diff              @@
##               main    #8912      +/-   ##
============================================
- Coverage     71.05%   70.95%   -0.10%     
+ Complexity    57198    57160      -38     
============================================
  Files          4758     4758              
  Lines        269863   269869       +6     
  Branches      39484    39487       +3     
============================================
- Hits         191755   191491     -264     
- Misses        61982    62236     +254     
- Partials      16126    16142      +16     
Files Changed Coverage Δ
...ndices/replication/OngoingSegmentReplications.java 90.47% <81.81%> (-1.84%) ⬇️

... and 453 files with indirect coverage changes

@mch2
Copy link
Member Author

mch2 commented Jul 27, 2023

Could we also start specifying number of successful runs we did for flaky tests before merging them in for confidence?

FWIW I've run this ~500 times locally and don't see it failing in other ways, also have a unit for the specific way this was failing.

@mch2
Copy link
Member Author

mch2 commented Jul 27, 2023

Precommit is failing here on windows with:

> Task :modules:analysis-common:jarHell FAILED
| Output for C:\hostedtoolcache\windows\Java_Adopt_jdk\11.0.20-8\x64\bin\java.exe:OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000090000000, 117440512, 0) failed; error='The paging file is too small for this operation to complete' (DOS error/errno=1455)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 117440512 bytes for Failed to commit area from 0x0000000090000000 to 0x0000000097000000 of length 117440512.
# An error report file with more information is saved as:
# D:\a\OpenSearch\OpenSearch\modules\analysis-common\hs_err_pid3688.log


FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':modules:analysis-common:jarHell'.
> Process 'command 'C:\hostedtoolcache\windows\Java_Adopt_jdk\11.0.20-8\x64\bin\java.exe'' finished with non-zero exit value 1

@mch2 mch2 merged commit 4ad4182 into opensearch-project:main Jul 27, 2023
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Jul 27, 2023
@mch2 mch2 deleted the n2ndiff branch July 27, 2023 22:17
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 27, 2023
…node replication (#8912)

* Fix SegmentReplicationIT.testReplicahasDiffFilesThanPrimary

This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
(cherry picked from commit 4ad4182)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mch2 pushed a commit that referenced this pull request Jul 27, 2023
…node replication (#8912) (#8936)

* Fix SegmentReplicationIT.testReplicahasDiffFilesThanPrimary

This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.



* PR feedback.



---------


(cherry picked from commit 4ad4182)

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 28, 2023
…node replication (opensearch-project#8912)

* Fix SegmentReplicationIT.testReplicahasDiffFilesThanPrimary

This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
…node replication (opensearch-project#8912)

* Fix SegmentReplicationIT.testReplicahasDiffFilesThanPrimary

This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…node replication (opensearch-project#8912)

* Fix SegmentReplicationIT.testReplicahasDiffFilesThanPrimary

This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Kaushal Kumar <[email protected]>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…node replication (opensearch-project#8912)

* Fix SegmentReplicationIT.testReplicahasDiffFilesThanPrimary

This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Ivan Brusic <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…node replication (opensearch-project#8912)

* Fix SegmentReplicationIT.testReplicahasDiffFilesThanPrimary

This test is now failing for node-node replication. On the primary shard the prepareSegmentReplication method should cancel any ongoing replication if it is running and start a new sync.  Thisis incorrectly using Map.compute which will not replace the existing handler entry in the allocationIdToHandlers map. It will only cancel the existing source handler. As a result this can leave the copyState map with an entry and hold an index commit while the test is cleaning up.  The copyState is only cleared when a handler is cancelled directly or from a cluster state update.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] org.opensearch.remotestore.SegmentReplicationRemoteStoreIT.testReplicaHasDiffFilesThanPrimary is flaky
3 participants