Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT class failing #119191

Open
elasticsearchmachine opened this issue Dec 21, 2024 · 6 comments
Open

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT class failing #119191

elasticsearchmachine opened this issue Dec 21, 2024 · 6 comments
Assignees
Labels
blocker :StorageEngine/Logs You know, for Logs Team:StorageEngine >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Dec 21, 2024

Build Scans:

Reproduction Line:

./gradlew ":qa:smoke-test-multinode:yamlRestTest" --tests "org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT" -Dtests.method="test {yaml=indices.open/10_basic/?wait_for_active_shards=index-setting is removed}" -Dtests.seed=13E7BBAF2808FC2 -Dtests.locale=yo-NG -Dtests.timezone=Africa/Casablanca -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.io.UncheckedIOException: java.net.ConnectException: Connection refused

Issue Reasons:

  • [main] 26 failures in class org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT (3.7% fail rate in 703 executions)
  • [main] 2 failures in step rocky-9_platform-support-unix (14.3% fail rate in 14 executions)
  • [main] 2 failures in step openjdk21_checkpart1_java-fips-matrix (13.3% fail rate in 15 executions)
  • [main] 15 failures in step part-1 (5.9% fail rate in 253 executions)
  • [main] 2 failures in step part1 (2.0% fail rate in 98 executions)
  • [main] 3 failures in pipeline elasticsearch-periodic-platform-support (20.0% fail rate in 15 executions)
  • [main] 3 failures in pipeline elasticsearch-periodic (20.0% fail rate in 15 executions)
  • [main] 15 failures in pipeline elasticsearch-pull-request (5.9% fail rate in 256 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (2.0% fail rate in 98 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :StorageEngine/Logs You know, for Logs >test-failure Triaged test failures from CI labels Dec 21, 2024
elasticsearchmachine added a commit that referenced this issue Dec 21, 2024
…eIT org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT #119191
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 16 failures in class org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT (4.1% fail rate in 390 executions)
  • [8.x] 2 failures in step oraclelinux-9_platform-support-unix (22.2% fail rate in 9 executions)
  • [8.x] 2 failures in step openjdk17_checkpart1_java-fips-matrix (22.2% fail rate in 9 executions)
  • [8.x] 2 failures in step part1 (5.0% fail rate in 40 executions)
  • [8.x] 7 failures in step part-1 (12.3% fail rate in 57 executions)
  • [8.x] 2 failures in pipeline elasticsearch-periodic-platform-support (22.2% fail rate in 9 executions)
  • [8.x] 3 failures in pipeline elasticsearch-periodic (33.3% fail rate in 9 executions)
  • [8.x] 2 failures in pipeline elasticsearch-intake (5.0% fail rate in 40 executions)
  • [8.x] 7 failures in pipeline elasticsearch-pull-request (12.7% fail rate in 55 executions)

Build Scans:

@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Dec 21, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 26 failures in class org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT (3.7% fail rate in 703 executions)
  • [main] 2 failures in step rocky-9_platform-support-unix (14.3% fail rate in 14 executions)
  • [main] 2 failures in step openjdk21_checkpart1_java-fips-matrix (13.3% fail rate in 15 executions)
  • [main] 15 failures in step part-1 (5.9% fail rate in 253 executions)
  • [main] 2 failures in step part1 (2.0% fail rate in 98 executions)
  • [main] 3 failures in pipeline elasticsearch-periodic-platform-support (20.0% fail rate in 15 executions)
  • [main] 3 failures in pipeline elasticsearch-periodic (20.0% fail rate in 15 executions)
  • [main] 15 failures in pipeline elasticsearch-pull-request (5.9% fail rate in 256 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (2.0% fail rate in 98 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Dec 22, 2024
…eIT org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT #119191
@kkrik-es
Copy link
Contributor

@martijnvg this is still failing, I think it's still indices.create/20_synthetic_source/create index with use_synthetic_source. Wanna take a look?

@martijnvg
Copy link
Member

This is a different assertion failing and I suspect this is related to synthetic recovery source:

[2024-12-19T21:33:56,704][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][generic][T#3]], exiting
java.lang.AssertionError: seqNo [0] was processed twice in generation [2], with different data. prvOp [Index{id='1', seqNo=0, primaryTerm=1, version=1, autoGeneratedIdTimestamp=-1}], newOp [Index{id='1', seqNo=0, primaryTerm=1, version=1, autoGeneratedIdTimestamp=-1}]
	at org.elasticsearch.index.translog.TranslogWriter.assertNoSeqNumberConflict(TranslogWriter.java:308) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.translog.TranslogWriter.add(TranslogWriter.java:259) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.translog.Translog.add(Translog.java:628) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1233) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1085) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1011) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:2027) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:2014) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$4(RecoveryTarget.java:453) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:356) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:428) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.performTranslogOps(PeerRecoveryTargetService.java:649) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.handleRequest(PeerRecoveryTargetService.java:596) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.handleRequest(PeerRecoveryTargetService.java:588) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:682) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:669) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:90) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.transport.InboundHandler.doHandleRequest(InboundHandler.java:289) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:302) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1044) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]
Caused by: java.lang.RuntimeException: stack capture previous op
	at org.elasticsearch.index.translog.TranslogWriter.assertNoSeqNumberConflict(TranslogWriter.java:315) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.translog.TranslogWriter.add(TranslogWriter.java:259) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.translog.Translog.add(Translog.java:628) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1233) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1085) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1011) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:952) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:684) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:661) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$dispatchedShardOperationOnReplica$5(TransportShardBulkAction.java:617) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:356) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:615) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:80) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$2.doRun(TransportWriteAction.java:248) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	... 5 more

Let me mute the two tests that use synthetic recovery source.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Dec 23, 2024
@martijnvg
Copy link
Member

I suspect the reason that the assertion the TranslogWriter#assertNoSeqNumberConflict(...) assertion trips is because during document replication we use the source as was provided via the index or bulk api, but during shard recovery we use synthetic source and at a byte level this aren't the same as was provided at index time.

This is expected with synthetic source. For example if the original source contained whitespaces, then this doesn't exist in the version of the document generated by synthetic source. Field names are re-ordered as well and depending on synthetic source settings duplicate array values don't appear in the source generated by synthetic source.

I don't know yet what change needs to be made to address this assertion failure. Should the assertion be changed to take synthetic source into account? Or something else, like replication the source generated by synthetic source to replicas instead of original source provided via index or bulk api.

martijnvg added a commit that referenced this issue Dec 23, 2024
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Dec 23, 2024
Backporting elastic#119206 to 8.x branch.

The assertion that trips is related to synthetic recovery source usage, no need to mute the entire test suite.
Relates to elastic#119191
martijnvg added a commit that referenced this issue Dec 23, 2024
Backporting #119206 to 8.x branch.

The assertion that trips is related to synthetic recovery source usage, no need to mute the entire test suite.
Relates to #119191
@martijnvg martijnvg added blocker and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :StorageEngine/Logs You know, for Logs Team:StorageEngine >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants