[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

gwbrown · 2019-01-11T23:04:55Z

This failure appears to be pretty rare, but happened twice in 6.x intake this morning:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/960/console
~~https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/966/console (maybe? This one might be caused by something else)~~ [edit: Don't think this one is connected, see below]

Reproduce line (I'm having trouble with my local environment ATM so I can't confirm whether this reproduces but I don't think so):

./gradlew :x-pack:qa:rolling-upgrade:without-system-key:v5.6.15#twoThirdsUpgradedTestRunner \
  -Dtests.seed=9E7FA2C9D621808E \
  -Dtests.class=org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT \
  -Dtests.method="testMixedCluster" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-BH \
  -Dtests.timezone=CET \
  -Dcompiler.java=11 \
  -Druntime.java=8

Stack trace:

org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:45085], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
{"error":"invalid_grant","error_description":"token has already been refreshed"}
	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:929)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
	at org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT.testMixedCluster(TokenBackwardsCompatibilityIT.java:114)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:45085], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
{"error":"invalid_grant","error_description":"token has already been refreshed"}
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:548)
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:533)
	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
	... 1 more
	Suppressed: org.apache.http.ConnectionClosedException: Connection closed
		at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344)
		at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
		... 10 more

This also seems to cause some trouble in the cluster which causes other tests to fail.

Possibly related to #33197 and/or #31195

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-11T23:04:56Z

Pinging @elastic/es-security

gwbrown · 2019-01-11T23:18:08Z

After looking more at these, especially https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/966/console, I'm inclined to say that there might be another problem here. I'm going to leave this open as this test failure may be valid, but I'm not sure it's the only problem - it looks like at least one of the nodes just died in the middle of some of these tests.

gwbrown · 2019-01-11T23:44:35Z

And there's another failure just now in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+release-tests/335/console

jkakavas · 2019-01-14T08:45:18Z

These two reproduce locally but this doesn't seem to be related to the token refresh but to the fact that no master can be elected in the mixed cluster

|    [2019-01-14T09:38:53,098][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-14T09:38:53,612][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-14T09:38:58,228][INFO ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] failed to send join request to master [{node-2}{aIvm3YvcRsGaZ6uRbbI7Eg}{1glrAnPPTEKyoYgGfe1PFw}{127.0.0.1}{127.0.0.1:33229}{testattr=test, ml.max_open_jobs=10, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
|    [2019-01-14T09:39:23,612][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-14T09:39:23,614][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-14T09:39:24,123][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-14T09:39:54,124][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-14T09:39:54,124][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]

The Token refresh error is seen in relevant logs from the i.e. 335 but the suppressed Exception doesn't make sense to me :

> Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:38358], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
   > {"error":"invalid_grant","error_description":"token has already been refreshed"}
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:548)
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:533)
   > 	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
   > 	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
   > 	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
   > 	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
   > 	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
   > 	... 1 more
   > 	Suppressed: org.apache.http.ConnectionClosedException: Connection closed
   > 		at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344)
   > 		at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
   > 		... 10 moreThrowable #2: java.io.IOException: listener timeout after waiting for [30000] ms
   > 	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:908)
   > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.wipeSnapshots(ESRestTestCase.java:497)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:469)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:263)
   > 	at java.lang.Thread.run(Thread.java:748)

Locally, these tests fail before the test even tries to refresh the token and I see similar MasterNotDiscoveredException messages in the CI logs.

I'll keep looking, but it could be the refresh token error is a red herring

matriv · 2019-01-15T13:43:43Z

Happened again today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1002/console
and reproduced locally on 6.x:

|    [2019-01-15T15:28:19,561][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:28:22,279][WARN ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={upgraded-node-1}{gX7MA9FtT-aKFk9CRr-lcQ}{h2_Hsc68QEKyeI2q_Kv7pA}{127.0.0.1}{127.0.0.1:57280}{upgraded=true, ml.machine_memory=34359738368, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
|    [2019-01-15T15:28:49,570][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:28:49,572][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:28:49,894][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:29:19,896][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:29:19,898][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:29:20,400][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:29:25,379][INFO ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] failed to send join request to master [{node-2}{WQQi8QhsStKrOLR3HlfSPw}{bx0glClkRDOTHGZGKqQ8fg}{127.0.0.1}{127.0.0.1:56798}{testattr=test, ml.max_open_jobs=10, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
|    [2019-01-15T15:29:50,405][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:29:50,406][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:29:50,912][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:30:20,916][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:30:20,918][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|-----------------------------------------

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':x-pack:qa:rolling-upgrade:without-system-key:v5.6.15#twoThirdsUpgradedTestCluster#wait'.
> Elasticsearch cluster failed to pass wait condition

jkakavas · 2019-01-15T13:59:33Z

👀

Relates to #37379

matriv · 2019-01-15T16:10:27Z

The tests are muted in master and 6.x as they fail quite often.

matriv · 2019-01-15T16:26:59Z

I reverted muting it on master as it seems to fail only in 6.x
6129e9d

jkakavas · 2019-01-16T09:38:45Z

Happened again today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1002/console
and reproduced locally on 6.x:

The local reproduction doesn't have to do with the original error ( failed to refresh token ) that we encounter on CI (see the explanation here).

I think the way to try and reproduce these locally is to

$ cd x-pack/qa/rolling-upgrade/without-system-key
$ ../../../../gradlew check -Dtests.seed=2AFD25BC2B1F75AB -Dtests.class=org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT  -Dtests.security.manager=true -Dtests.locale=fr-FR -Dtests.timezone=IST -Dcompiler.java=11 -Druntime.java=8

which doesn't reproduce for me.

Looking into the rest of the failures I came to realize that this might fail more consistently than we think. The mixedClusterTest is only run in the twoThirdsUpdated state but only if the master is on the new version. That means that if the original master is the last node to be updated, then the mixedClusterTest will not run. In all the successful CI builds I've seen, the test is actually skipped

IGNOR/A 0.08s | TokenBackwardsCompatibilityIT.testMixedCluster
   > Assumption #1: the master must be on the latest version before we can write

Now this makes this failure even more strange, as
a) It never reproduces locally
b) The test itself creates the token and refresh token just before it tries to use the refresh token, so I can't really figure out how it gets already refreshed.

I'll keep looking

jkakavas · 2019-01-16T14:00:05Z

I unmuted the test in 05b2c80 hoping to get some additional debugging information as this never fails locally. I will keep a close eye in 6.x and mute this test again so that I won't add too much noise.

jkakavas · 2019-02-08T15:32:47Z

This was muted again right after 05b2c80 to not cause any more CI issues. I will unmute with proper logging in the next few days ( once CI is stabilized) to see if this fails and get enough information to debug it

jkakavas · 2019-02-12T08:07:45Z

We unfortunately never got to the bottom of this. This had never failed in master and keeps on not failing so I will close it for now, and reopen if such a failure is encountered ever again.

This change - Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made the assumption that in the oneThirdUpgraded stage the master node will be on the old version and in the twoThirdsUpgraded stage, the master node will be one of the upgraded ones. However, there is no guarantee that the master node in any of the states will or will not be one of the upgraded ones. This class now tests: - That we can generate and consume tokens before we start the rolling upgrade. - That we can consume tokens generated in the old cluster during all the stages of the rolling upgrade. - That while on a mixed cluster, when/if the master node is upgraded, we can generate, consume and refresh a token - That after the rolling upgrade, we can consume a token generated in an old cluster and can invalidate it so that it can't be used any more. - Ensures that during the rolling upgrade, the upgraded nodes have the same configuration as the old nodes. Specifically that the file realm we use is explicitly named `file1`. This is needed because while attempting to refresh a token in a mixed cluster we might create a token hitting an old node and attempt to refresh it hitting a new node. If the file realm name is not the same, the refresh will be seen as being made by a "different" client, and will, thus, fail. - Renames the Authentication variable we check while refreshing a token to be clientAuth in order to make the code more readable. Some of the above were possibly causing the flakiness of elastic#37379

This change - Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made the assumption that in the oneThirdUpgraded stage the master node will be on the old version and in the twoThirdsUpgraded stage, the master node will be one of the upgraded ones. However, there is no guarantee that the master node in any of the states will or will not be one of the upgraded ones. This class now tests: - That we can generate and consume tokens before we start the rolling upgrade. - That we can consume tokens generated in the old cluster during all the stages of the rolling upgrade. - That while on a mixed cluster, when/if the master node is upgraded, we can generate, consume and refresh a token - That after the rolling upgrade, we can consume a token generated in an old cluster and can invalidate it so that it can't be used any more. - Ensures that during the rolling upgrade, the upgraded nodes have the same configuration as the old nodes. Specifically that the file realm we use is explicitly named `file1`. This is needed because while attempting to refresh a token in a mixed cluster we might create a token hitting an old node and attempt to refresh it hitting a new node. If the file realm name is not the same, the refresh will be seen as being made by a "different" client, and will, thus, fail. - Renames the Authentication variable we check while refreshing a token to be clientAuth in order to make the code more readable. Some of the above were possibly causing the flakiness of #37379

This change - Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made the assumption that in the oneThirdUpgraded stage the master node will be on the old version and in the twoThirdsUpgraded stage, the master node will be one of the upgraded ones. However, there is no guarantee that the master node in any of the states will or will not be one of the upgraded ones. This class now tests: - That we can generate and consume tokens before we start the rolling upgrade. - That we can consume tokens generated in the old cluster during all the stages of the rolling upgrade. - That while on a mixed cluster, when/if the master node is upgraded, we can generate, consume and refresh a token - That after the rolling upgrade, we can consume a token generated in an old cluster and can invalidate it so that it can't be used any more. - Ensures that during the rolling upgrade, the upgraded nodes have the same configuration as the old nodes. Specifically that the file realm we use is explicitly named `file1`. This is needed because while attempting to refresh a token in a mixed cluster we might create a token hitting an old node and attempt to refresh it hitting a new node. If the file realm name is not the same, the refresh will be seen as being made by a "different" client, and will, thus, fail. - Renames the Authentication variable we check while refreshing a token to be clientAuth in order to make the code more readable. Some of the above were possibly causing the flakiness of elastic#37379

This change is a backport of #39252 - Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made the assumption that in the oneThirdUpgraded stage the master node will be on the old version and in the twoThirdsUpgraded stage, the master node will be one of the upgraded ones. However, there is no guarantee that the master node in any of the states will or will not be one of the upgraded ones. This class now tests: - That we can generate and consume tokens before we start the rolling upgrade. - That we can consume tokens generated in the old cluster during all the stages of the rolling upgrade. - That while on a mixed cluster, when/if the master node is upgraded, we can generate, consume and refresh a token - That after the rolling upgrade, we can consume a token generated in an old cluster and can invalidate it so that it can't be used any more. - Ensures that during the rolling upgrade, the upgraded nodes have the same configuration as the old nodes. Specifically that the file realm we use is explicitly named `file1`. This is needed because while attempting to refresh a token in a mixed cluster we might create a token hitting an old node and attempt to refresh it hitting a new node. If the file realm name is not the same, the refresh will be seen as being made by a "different" client, and will, thus, fail. - Renames the Authentication variable we check while refreshing a token to be clientAuth in order to make the code more readable. Some of the above were possibly causing the flakiness of #37379

gwbrown added >test-failure Triaged test failures from CI :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) labels Jan 11, 2019

jaymode assigned jkakavas Jan 12, 2019

matriv added a commit that referenced this issue Jan 15, 2019

[TEST] Muted TokenBackwardsCompatibilityIT.*

65e42ab

Relates to #37379

matriv added a commit that referenced this issue Jan 15, 2019

[TEST] Muted TokenBackwardsCompatibilityIT.*

8b59492

Relates to #37379

jkakavas closed this as completed Feb 12, 2019

jkakavas mentioned this issue Feb 21, 2019

Fix TokenBackwardsCompatibility tests #39252

Merged

jkakavas mentioned this issue Feb 22, 2019

[BACKPORT-7.x] Fix TokenBackwardsCompatibility tests (#39252) #39294

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

gwbrown commented Jan 11, 2019 •

edited

Loading

elasticmachine commented Jan 11, 2019

gwbrown commented Jan 11, 2019

gwbrown commented Jan 11, 2019

jkakavas commented Jan 14, 2019

matriv commented Jan 15, 2019

jkakavas commented Jan 15, 2019

matriv commented Jan 15, 2019

matriv commented Jan 15, 2019

jkakavas commented Jan 16, 2019

jkakavas commented Jan 16, 2019

jkakavas commented Feb 8, 2019

jkakavas commented Feb 12, 2019

[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

Comments

gwbrown commented Jan 11, 2019 • edited Loading

elasticmachine commented Jan 11, 2019

gwbrown commented Jan 11, 2019

gwbrown commented Jan 11, 2019

jkakavas commented Jan 14, 2019

matriv commented Jan 15, 2019

jkakavas commented Jan 15, 2019

matriv commented Jan 15, 2019

matriv commented Jan 15, 2019

jkakavas commented Jan 16, 2019

jkakavas commented Jan 16, 2019

jkakavas commented Feb 8, 2019

jkakavas commented Feb 12, 2019

gwbrown commented Jan 11, 2019 •

edited

Loading