Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing fails on master #53225

Closed
mayya-sharipova opened this issue Mar 6, 2020 · 6 comments · Fixed by #53415
Closed

IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing fails on master #53225

mayya-sharipova opened this issue Mar 6, 2020 · 6 comments · Fixed by #53415
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Mar 6, 2020

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part2/4093/console
Build Scans: https://gradle-enterprise.elastic.co/s/tp2q6hwdkzcom

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ccr:internalClusterTest' --tests "org.elasticsearch.xpack.ccr.IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing"
-Dtests.seed=6CB3F4DF8251DFB5
-Dtests.security.manager=true
-Dtests.locale=es-AR
-Dtests.timezone=Asia/Aden
-Dcompiler.java=13

Doesn't reproduce for me.
No other failures of this test for this year.

Stack trace:

java.lang.AssertionError: incorrect global checkpoint {"remote_cluster":"leader_cluster","follow_shard_index":"index2","follow_shard_index_uuid":"56HGhnidRZOl-eRO5HtDUw","follow_shard_shard":0,"leader_shard_index":"index1","leader_shard_index_uuid":"mlnxKU3BSPqE8AGe9MQt9A","leader_shard_shard":0,"max_read_request_operation_count":5120,"max_write_request_operation_count":5120,"max_outstanding_read_requests":12,"max_outstanding_write_requests":9,"max_read_request_size":"32mb","max_write_request_size":"9223372036854775807b","max_write_buffer_count":2147483647,"max_write_buffer_size":"512mb","max_retry_delay":"10ms","read_poll_timeout":"10ms","headers":{}}
Expected: <229L>
     but: was <-1L>
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.lambda$assertTask$63(IndexFollowingIT.java:1476)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:881)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:854)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing(IndexFollowingIT.java:1348)
@mayya-sharipova mayya-sharipova added >test-failure Triaged test failures from CI :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Mar 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CCR)

@dnhatn dnhatn self-assigned this Mar 11, 2020
@dnhatn
Copy link
Member

dnhatn commented Mar 11, 2020

  2> mar 06, 2020 5:25:28 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
  2> WARNING: Uncaught exception in thread: Thread[elasticsearch[followerd3][ccr][T#22],5,TGRP-IndexFollowingIT]
  2> org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [leader_cluster]
  2> 	at __randomizedtesting.SeedInfo.seed([6CB3F4DF8251DFB5]:0)
  2> 	at org.elasticsearch.transport.RemoteClusterService.getRemoteClusterConnection(RemoteClusterService.java:205)
  2> 	at org.elasticsearch.transport.RemoteClusterService.ensureConnected(RemoteClusterService.java:188)
  2> 	at org.elasticsearch.transport.RemoteClusterAwareClient.doExecute(RemoteClusterAwareClient.java:48)
  2> 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:377)
  2> 	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:661)
  2> 	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.state(AbstractClient.java:691)
  2> 	at org.elasticsearch.xpack.ccr.action.CcrRequests.getIndexMetadata(CcrRequests.java:59)
  2> 	at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.innerUpdateMapping(ShardFollowTasksExecutor.java:144)
  2> 	at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.updateMapping(ShardFollowNodeTask.java:481)
  2> 	at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.lambda$updateMapping$17(ShardFollowNodeTask.java:482)
  2> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

I will take a closer look tomorrow.

dnhatn added a commit that referenced this issue Mar 13, 2020
A remote client can throw a NoSuchRemoteClusterException while fetching 
the cluster state from the leader cluster. We also need to handle that
exception when retrying to add a retention lease to the leader shard.

Closes #53225
dnhatn added a commit that referenced this issue Apr 4, 2020
A remote client can throw a NoSuchRemoteClusterException while fetching 
the cluster state from the leader cluster. We also need to handle that
exception when retrying to add a retention lease to the leader shard.

Closes #53225
dnhatn added a commit that referenced this issue Apr 4, 2020
A remote client can throw a NoSuchRemoteClusterException while fetching
the cluster state from the leader cluster. We also need to handle that
exception when retrying to add a retention lease to the leader shard.

Closes #53225
dnhatn added a commit that referenced this issue Apr 4, 2020
A remote client can throw a NoSuchRemoteClusterException while fetching
the cluster state from the leader cluster. We also need to handle that
exception when retrying to add a retention lease to the leader shard.

Closes #53225
dnhatn added a commit that referenced this issue Apr 5, 2020
A remote client can throw a NoSuchRemoteClusterException while fetching
the cluster state from the leader cluster. We also need to handle that
exception when retrying to add a retention lease to the leader shard.

Closes #53225
@pgomulka
Copy link
Contributor

pgomulka commented Jul 3, 2020

@dnhatn by any chance this was not backported to 6.8 ? Do you think it is worth a backport?
there was a very similar failure in that test on that branch

java.lang.AssertionError: incorrect global checkpoint {"remote_cluster":"leader_cluster","follow_shard_index":"index2","follow_shard_index_uuid":"6PI3qVcLS12o7i-_cGFeEw","follow_shard_shard":1,"leader_shard_index":"index1","leader_shard_index_uuid":"OiYzpvIbThCOK_FB8uuSCA","leader_shard_shard":1,"max_read_request_operation_count":9016,"max_write_request_operation_count":5120,"max_outstanding_read_requests":12,"max_outstanding_write_requests":9,"max_read_request_size":"30573350b","max_write_request_size":"9223372036854775807b","max_write_buffer_count":2147483647,"max_write_buffer_size":"512mb","max_retry_delay":"10ms","read_poll_timeout":"10ms","headers":{}}
Expected: <171L>
     but: was <-1L>
	at __randomizedtesting.SeedInfo.seed([3C886A4EEAD67EB9:FA33AFE3CB28BE11]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.lambda$assertTask$53(IndexFollowingIT.java:1386)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:906)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:880)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing(IndexFollowingIT.java:1257)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)

https://gradle-enterprise.elastic.co/s/ysedpzc6m6ilc

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ccr:internalClusterTest' \
  -Dtests.seed=3C886A4EEAD67EB9 \
  -Dtests.class=org.elasticsearch.xpack.ccr.IndexFollowingIT \
  -Dtests.method="testUpdateRemoteConfigsDuringFollowing" \
  -Dtests.security.manager=true \
  -Dtests.locale=ro \
  -Dtests.timezone=SystemV/PST8PDT \
  -Dcompiler.java=12 \
  -Druntime.java=12

interestingly SystemV/PST8PDT is used. This is 6.8 so I guess joda is used, which does not support that timezone (it is only supported by java.time).
could this be that there are nodes in 7 and 6.8 in this test?

@pgomulka pgomulka reopened this Jul 3, 2020
@dnhatn
Copy link
Member

dnhatn commented Jul 4, 2020

I think this is the reason. I will work on a fix.

1> [2020-07-03T06:09:36,080][WARN ][o.e.x.c.a.ShardFollowNodeTask] [followerd3] shard follow task encounter non-retryable error
1> java.util.concurrent.RejectedExecutionException: connect queue is full
1> at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.connect(RemoteClusterConnection.java:445) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.connect(RemoteClusterConnection.java:427) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterConnection.ensureConnected(RemoteClusterConnection.java:221) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterService.ensureConnected(RemoteClusterService.java:393) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterAwareClient.doExecute(RemoteClusterAwareClient.java:50) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:403) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.innerSendShardChangesRequest(ShardFollowTasksExecutor.java:267) [main/:?]
1> at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.sendShardChangesRequest(ShardFollowNodeTask.java:289) [main/:?]
1> at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.lambda$sendShardChangesRequest$4(ShardFollowNodeTask.java:320) [main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
1> at java.lang.Thread.run(Thread.java:835) [?:?]

@dnhatn
Copy link
Member

dnhatn commented Jul 4, 2020

I've opened #59036.

dnhatn added a commit that referenced this issue Jul 8, 2020
…59036)

The backport in #56073 was supposed to change the max pending listeners 
to 1000 and throw ESRejectedExecutionException instead of
RejectedExecutionException when reaching that limit. However, it missed
the latter.

Closes #53225
@dnhatn
Copy link
Member

dnhatn commented Jul 8, 2020

Fixed in #59036.

@dnhatn dnhatn closed this as completed Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants