IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing fails on master #53225

mayya-sharipova · 2020-03-06T14:53:53Z

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part2/4093/console
Build Scans: https://gradle-enterprise.elastic.co/s/tp2q6hwdkzcom

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ccr:internalClusterTest' --tests "org.elasticsearch.xpack.ccr.IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing"
-Dtests.seed=6CB3F4DF8251DFB5
-Dtests.security.manager=true
-Dtests.locale=es-AR
-Dtests.timezone=Asia/Aden
-Dcompiler.java=13

Doesn't reproduce for me.
No other failures of this test for this year.

Stack trace:

java.lang.AssertionError: incorrect global checkpoint {"remote_cluster":"leader_cluster","follow_shard_index":"index2","follow_shard_index_uuid":"56HGhnidRZOl-eRO5HtDUw","follow_shard_shard":0,"leader_shard_index":"index1","leader_shard_index_uuid":"mlnxKU3BSPqE8AGe9MQt9A","leader_shard_shard":0,"max_read_request_operation_count":5120,"max_write_request_operation_count":5120,"max_outstanding_read_requests":12,"max_outstanding_write_requests":9,"max_read_request_size":"32mb","max_write_request_size":"9223372036854775807b","max_write_buffer_count":2147483647,"max_write_buffer_size":"512mb","max_retry_delay":"10ms","read_poll_timeout":"10ms","headers":{}}
Expected: <229L>
     but: was <-1L>
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.lambda$assertTask$63(IndexFollowingIT.java:1476)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:881)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:854)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing(IndexFollowingIT.java:1348)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-06T14:53:55Z

Pinging @elastic/es-distributed (:Distributed/CCR)

dnhatn · 2020-03-11T03:03:03Z

  2> mar 06, 2020 5:25:28 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
  2> WARNING: Uncaught exception in thread: Thread[elasticsearch[followerd3][ccr][T#22],5,TGRP-IndexFollowingIT]
  2> org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [leader_cluster]
  2> 	at __randomizedtesting.SeedInfo.seed([6CB3F4DF8251DFB5]:0)
  2> 	at org.elasticsearch.transport.RemoteClusterService.getRemoteClusterConnection(RemoteClusterService.java:205)
  2> 	at org.elasticsearch.transport.RemoteClusterService.ensureConnected(RemoteClusterService.java:188)
  2> 	at org.elasticsearch.transport.RemoteClusterAwareClient.doExecute(RemoteClusterAwareClient.java:48)
  2> 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:377)
  2> 	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:661)
  2> 	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.state(AbstractClient.java:691)
  2> 	at org.elasticsearch.xpack.ccr.action.CcrRequests.getIndexMetadata(CcrRequests.java:59)
  2> 	at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.innerUpdateMapping(ShardFollowTasksExecutor.java:144)
  2> 	at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.updateMapping(ShardFollowNodeTask.java:481)
  2> 	at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.lambda$updateMapping$17(ShardFollowNodeTask.java:482)
  2> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

I will take a closer look tomorrow.

A remote client can throw a NoSuchRemoteClusterException while fetching the cluster state from the leader cluster. We also need to handle that exception when retrying to add a retention lease to the leader shard. Closes #53225

pgomulka · 2020-07-03T13:33:50Z

@dnhatn by any chance this was not backported to 6.8 ? Do you think it is worth a backport?
there was a very similar failure in that test on that branch

java.lang.AssertionError: incorrect global checkpoint {"remote_cluster":"leader_cluster","follow_shard_index":"index2","follow_shard_index_uuid":"6PI3qVcLS12o7i-_cGFeEw","follow_shard_shard":1,"leader_shard_index":"index1","leader_shard_index_uuid":"OiYzpvIbThCOK_FB8uuSCA","leader_shard_shard":1,"max_read_request_operation_count":9016,"max_write_request_operation_count":5120,"max_outstanding_read_requests":12,"max_outstanding_write_requests":9,"max_read_request_size":"30573350b","max_write_request_size":"9223372036854775807b","max_write_buffer_count":2147483647,"max_write_buffer_size":"512mb","max_retry_delay":"10ms","read_poll_timeout":"10ms","headers":{}}
Expected: <171L>
     but: was <-1L>
	at __randomizedtesting.SeedInfo.seed([3C886A4EEAD67EB9:FA33AFE3CB28BE11]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.lambda$assertTask$53(IndexFollowingIT.java:1386)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:906)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:880)
	at org.elasticsearch.xpack.ccr.IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing(IndexFollowingIT.java:1257)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)

https://gradle-enterprise.elastic.co/s/ysedpzc6m6ilc

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ccr:internalClusterTest' \
  -Dtests.seed=3C886A4EEAD67EB9 \
  -Dtests.class=org.elasticsearch.xpack.ccr.IndexFollowingIT \
  -Dtests.method="testUpdateRemoteConfigsDuringFollowing" \
  -Dtests.security.manager=true \
  -Dtests.locale=ro \
  -Dtests.timezone=SystemV/PST8PDT \
  -Dcompiler.java=12 \
  -Druntime.java=12

interestingly SystemV/PST8PDT is used. This is 6.8 so I guess joda is used, which does not support that timezone (it is only supported by java.time).
could this be that there are nodes in 7 and 6.8 in this test?

dnhatn · 2020-07-04T03:02:51Z

I think this is the reason. I will work on a fix.

1> [2020-07-03T06:09:36,080][WARN ][o.e.x.c.a.ShardFollowNodeTask] [followerd3] shard follow task encounter non-retryable error
1> java.util.concurrent.RejectedExecutionException: connect queue is full
1> at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.connect(RemoteClusterConnection.java:445) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.connect(RemoteClusterConnection.java:427) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterConnection.ensureConnected(RemoteClusterConnection.java:221) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterService.ensureConnected(RemoteClusterService.java:393) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.transport.RemoteClusterAwareClient.doExecute(RemoteClusterAwareClient.java:50) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:403) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.innerSendShardChangesRequest(ShardFollowTasksExecutor.java:267) [main/:?]
1> at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.sendShardChangesRequest(ShardFollowNodeTask.java:289) [main/:?]
1> at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.lambda$sendShardChangesRequest$4(ShardFollowNodeTask.java:320) [main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.11-SNAPSHOT.jar:6.8.11-SNAPSHOT]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
1> at java.lang.Thread.run(Thread.java:835) [?:?]

dnhatn · 2020-07-04T19:53:31Z

I've opened #59036.

…59036) The backport in #56073 was supposed to change the max pending listeners to 1000 and throw ESRejectedExecutionException instead of RejectedExecutionException when reaching that limit. However, it missed the latter. Closes #53225

dnhatn · 2020-07-08T03:32:36Z

Fixed in #59036.

mayya-sharipova added >test-failure Triaged test failures from CI :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Mar 6, 2020

dnhatn self-assigned this Mar 11, 2020

dnhatn mentioned this issue Mar 11, 2020

Handle no such remote cluster exception in ccr #53415

Merged

dnhatn closed this as completed in #53415 Mar 13, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

pgomulka reopened this Jul 3, 2020

dnhatn mentioned this issue Jul 4, 2020

Throw ESRejectedExecutionException when too many pending listeners #59036

Merged

dnhatn closed this as completed Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing fails on master #53225

IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing fails on master #53225

mayya-sharipova commented Mar 6, 2020 •

edited

Loading

elasticmachine commented Mar 6, 2020

dnhatn commented Mar 11, 2020

pgomulka commented Jul 3, 2020 •

edited

Loading

dnhatn commented Jul 4, 2020

dnhatn commented Jul 4, 2020

dnhatn commented Jul 8, 2020

IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing fails on master #53225

IndexFollowingIT.testUpdateRemoteConfigsDuringFollowing fails on master #53225

Comments

mayya-sharipova commented Mar 6, 2020 • edited Loading

elasticmachine commented Mar 6, 2020

dnhatn commented Mar 11, 2020

pgomulka commented Jul 3, 2020 • edited Loading

dnhatn commented Jul 4, 2020

dnhatn commented Jul 4, 2020

dnhatn commented Jul 8, 2020

mayya-sharipova commented Mar 6, 2020 •

edited

Loading

pgomulka commented Jul 3, 2020 •

edited

Loading