Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

Closed
gwbrown opened this issue Jan 11, 2019 · 12 comments
Closed

[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

gwbrown opened this issue Jan 11, 2019 · 12 comments
Assignees
Labels
:Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) >test-failure Triaged test failures from CI

Comments

@gwbrown
Copy link
Contributor

gwbrown commented Jan 11, 2019

This failure appears to be pretty rare, but happened twice in 6.x intake this morning:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/960/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/966/console (maybe? This one might be caused by something else) [edit: Don't think this one is connected, see below]

Reproduce line (I'm having trouble with my local environment ATM so I can't confirm whether this reproduces but I don't think so):

./gradlew :x-pack:qa:rolling-upgrade:without-system-key:v5.6.15#twoThirdsUpgradedTestRunner \
  -Dtests.seed=9E7FA2C9D621808E \
  -Dtests.class=org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT \
  -Dtests.method="testMixedCluster" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-BH \
  -Dtests.timezone=CET \
  -Dcompiler.java=11 \
  -Druntime.java=8

Stack trace:

org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:45085], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
{"error":"invalid_grant","error_description":"token has already been refreshed"}
	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:929)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
	at org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT.testMixedCluster(TokenBackwardsCompatibilityIT.java:114)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:45085], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
{"error":"invalid_grant","error_description":"token has already been refreshed"}
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:548)
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:533)
	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
	... 1 more
	Suppressed: org.apache.http.ConnectionClosedException: Connection closed
		at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344)
		at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
		... 10 more

This also seems to cause some trouble in the cluster which causes other tests to fail.

Possibly related to #33197 and/or #31195

@gwbrown gwbrown added >test-failure Triaged test failures from CI :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) labels Jan 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@gwbrown
Copy link
Contributor Author

gwbrown commented Jan 11, 2019

After looking more at these, especially https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/966/console, I'm inclined to say that there might be another problem here. I'm going to leave this open as this test failure may be valid, but I'm not sure it's the only problem - it looks like at least one of the nodes just died in the middle of some of these tests.

@gwbrown
Copy link
Contributor Author

gwbrown commented Jan 11, 2019

@jkakavas
Copy link
Member

These two reproduce locally but this doesn't seem to be related to the token refresh but to the fact that no master can be elected in the mixed cluster

|    [2019-01-14T09:38:53,098][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-14T09:38:53,612][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-14T09:38:58,228][INFO ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] failed to send join request to master [{node-2}{aIvm3YvcRsGaZ6uRbbI7Eg}{1glrAnPPTEKyoYgGfe1PFw}{127.0.0.1}{127.0.0.1:33229}{testattr=test, ml.max_open_jobs=10, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
|    [2019-01-14T09:39:23,612][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-14T09:39:23,614][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-14T09:39:24,123][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-14T09:39:54,124][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-14T09:39:54,124][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]

The Token refresh error is seen in relevant logs from the i.e. 335 but the suppressed Exception doesn't make sense to me :

> Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:38358], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
   > {"error":"invalid_grant","error_description":"token has already been refreshed"}
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:548)
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:533)
   > 	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
   > 	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
   > 	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
   > 	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
   > 	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
   > 	... 1 more
   > 	Suppressed: org.apache.http.ConnectionClosedException: Connection closed
   > 		at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344)
   > 		at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
   > 		... 10 moreThrowable #2: java.io.IOException: listener timeout after waiting for [30000] ms
   > 	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:908)
   > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.wipeSnapshots(ESRestTestCase.java:497)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:469)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:263)
   > 	at java.lang.Thread.run(Thread.java:748)

Locally, these tests fail before the test even tries to refresh the token and I see similar MasterNotDiscoveredException messages in the CI logs.

I'll keep looking, but it could be the refresh token error is a red herring

@matriv
Copy link
Contributor

matriv commented Jan 15, 2019

Happened again today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1002/console
and reproduced locally on 6.x:

|    [2019-01-15T15:28:19,561][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:28:22,279][WARN ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={upgraded-node-1}{gX7MA9FtT-aKFk9CRr-lcQ}{h2_Hsc68QEKyeI2q_Kv7pA}{127.0.0.1}{127.0.0.1:57280}{upgraded=true, ml.machine_memory=34359738368, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
|    [2019-01-15T15:28:49,570][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:28:49,572][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:28:49,894][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:29:19,896][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:29:19,898][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:29:20,400][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:29:25,379][INFO ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] failed to send join request to master [{node-2}{WQQi8QhsStKrOLR3HlfSPw}{bx0glClkRDOTHGZGKqQ8fg}{127.0.0.1}{127.0.0.1:56798}{testattr=test, ml.max_open_jobs=10, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
|    [2019-01-15T15:29:50,405][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:29:50,406][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:29:50,912][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:30:20,916][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:30:20,918][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|-----------------------------------------

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':x-pack:qa:rolling-upgrade:without-system-key:v5.6.15#twoThirdsUpgradedTestCluster#wait'.
> Elasticsearch cluster failed to pass wait condition

@jkakavas
Copy link
Member

👀

matriv added a commit that referenced this issue Jan 15, 2019
matriv added a commit that referenced this issue Jan 15, 2019
@matriv
Copy link
Contributor

matriv commented Jan 15, 2019

The tests are muted in master and 6.x as they fail quite often.

@matriv
Copy link
Contributor

matriv commented Jan 15, 2019

I reverted muting it on master as it seems to fail only in 6.x
6129e9d

@jkakavas
Copy link
Member

Happened again today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1002/console
and reproduced locally on 6.x:

The local reproduction doesn't have to do with the original error ( failed to refresh token ) that we encounter on CI (see the explanation here).

I think the way to try and reproduce these locally is to

$ cd x-pack/qa/rolling-upgrade/without-system-key
$ ../../../../gradlew check -Dtests.seed=2AFD25BC2B1F75AB -Dtests.class=org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT  -Dtests.security.manager=true -Dtests.locale=fr-FR -Dtests.timezone=IST -Dcompiler.java=11 -Druntime.java=8

which doesn't reproduce for me.

Looking into the rest of the failures I came to realize that this might fail more consistently than we think. The mixedClusterTest is only run in the twoThirdsUpdated state but only if the master is on the new version. That means that if the original master is the last node to be updated, then the mixedClusterTest will not run. In all the successful CI builds I've seen, the test is actually skipped

IGNOR/A 0.08s | TokenBackwardsCompatibilityIT.testMixedCluster
   > Assumption #1: the master must be on the latest version before we can write

Now this makes this failure even more strange, as
a) It never reproduces locally
b) The test itself creates the token and refresh token just before it tries to use the refresh token, so I can't really figure out how it gets already refreshed.

I'll keep looking

@jkakavas
Copy link
Member

I unmuted the test in 05b2c80 hoping to get some additional debugging information as this never fails locally. I will keep a close eye in 6.x and mute this test again so that I won't add too much noise.

@jkakavas
Copy link
Member

jkakavas commented Feb 8, 2019

This was muted again right after 05b2c80 to not cause any more CI issues. I will unmute with proper logging in the next few days ( once CI is stabilized) to see if this fails and get enough information to debug it

@jkakavas
Copy link
Member

We unfortunately never got to the bottom of this. This had never failed in master and keeps on not failing so I will close it for now, and reopen if such a failure is encountered ever again.

jkakavas added a commit to jkakavas/elasticsearch that referenced this issue Feb 21, 2019
This change

- Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made
  the assumption that in the oneThirdUpgraded stage the master node
  will be on the old version and in the twoThirdsUpgraded stage, the
  master node will be one of the upgraded ones. However, there is no
  guarantee that the master node in any of the states will or will
  not be one of the upgraded ones.
  This class now tests:
  - That we can generate and consume tokens before we start the
  rolling upgrade.
  - That we can consume tokens generated in the old cluster during
  all the stages of the rolling upgrade.
  - That while on a mixed cluster, when/if the master node is
  upgraded, we can generate, consume and refresh a token
  - That after the rolling upgrade, we can consume a token
  generated in an old cluster and can invalidate it so that it
  can't be used any more.
- Ensures that during the rolling upgrade, the upgraded nodes have
the same configuration as the old nodes. Specifically that the
file realm we use is explicitly named `file1`. This is needed
because while attempting to refresh a token in a mixed cluster
we might create a token hitting an old node and attempt to refresh
it hitting a new node. If the file realm name is not the same, the
refresh will be seen as being made by a "different" client, and
will, thus, fail.
- Renames the Authentication variable we check while refreshing a
token to be clientAuth in order to make the code more readable.

Some of the above were possibly causing the flakiness of elastic#37379
jkakavas added a commit that referenced this issue Feb 22, 2019
This change

- Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made
  the assumption that in the oneThirdUpgraded stage the master node
  will be on the old version and in the twoThirdsUpgraded stage, the
  master node will be one of the upgraded ones. However, there is no
  guarantee that the master node in any of the states will or will
  not be one of the upgraded ones.
  This class now tests:
  - That we can generate and consume tokens before we start the
  rolling upgrade.
  - That we can consume tokens generated in the old cluster during
  all the stages of the rolling upgrade.
  - That while on a mixed cluster, when/if the master node is
  upgraded, we can generate, consume and refresh a token
  - That after the rolling upgrade, we can consume a token
  generated in an old cluster and can invalidate it so that it
  can't be used any more.
- Ensures that during the rolling upgrade, the upgraded nodes have
the same configuration as the old nodes. Specifically that the
file realm we use is explicitly named `file1`. This is needed
because while attempting to refresh a token in a mixed cluster
we might create a token hitting an old node and attempt to refresh
it hitting a new node. If the file realm name is not the same, the
refresh will be seen as being made by a "different" client, and
will, thus, fail.
- Renames the Authentication variable we check while refreshing a
token to be clientAuth in order to make the code more readable.

Some of the above were possibly causing the flakiness of #37379
jkakavas added a commit to jkakavas/elasticsearch that referenced this issue Feb 22, 2019
This change

- Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made
  the assumption that in the oneThirdUpgraded stage the master node
  will be on the old version and in the twoThirdsUpgraded stage, the
  master node will be one of the upgraded ones. However, there is no
  guarantee that the master node in any of the states will or will
  not be one of the upgraded ones.
  This class now tests:
  - That we can generate and consume tokens before we start the
  rolling upgrade.
  - That we can consume tokens generated in the old cluster during
  all the stages of the rolling upgrade.
  - That while on a mixed cluster, when/if the master node is
  upgraded, we can generate, consume and refresh a token
  - That after the rolling upgrade, we can consume a token
  generated in an old cluster and can invalidate it so that it
  can't be used any more.
- Ensures that during the rolling upgrade, the upgraded nodes have
the same configuration as the old nodes. Specifically that the
file realm we use is explicitly named `file1`. This is needed
because while attempting to refresh a token in a mixed cluster
we might create a token hitting an old node and attempt to refresh
it hitting a new node. If the file realm name is not the same, the
refresh will be seen as being made by a "different" client, and
will, thus, fail.
- Renames the Authentication variable we check while refreshing a
token to be clientAuth in order to make the code more readable.

Some of the above were possibly causing the flakiness of elastic#37379
weizijun pushed a commit to weizijun/elasticsearch that referenced this issue Feb 22, 2019
This change

- Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made
  the assumption that in the oneThirdUpgraded stage the master node
  will be on the old version and in the twoThirdsUpgraded stage, the
  master node will be one of the upgraded ones. However, there is no
  guarantee that the master node in any of the states will or will
  not be one of the upgraded ones.
  This class now tests:
  - That we can generate and consume tokens before we start the
  rolling upgrade.
  - That we can consume tokens generated in the old cluster during
  all the stages of the rolling upgrade.
  - That while on a mixed cluster, when/if the master node is
  upgraded, we can generate, consume and refresh a token
  - That after the rolling upgrade, we can consume a token
  generated in an old cluster and can invalidate it so that it
  can't be used any more.
- Ensures that during the rolling upgrade, the upgraded nodes have
the same configuration as the old nodes. Specifically that the
file realm we use is explicitly named `file1`. This is needed
because while attempting to refresh a token in a mixed cluster
we might create a token hitting an old node and attempt to refresh
it hitting a new node. If the file realm name is not the same, the
refresh will be seen as being made by a "different" client, and
will, thus, fail.
- Renames the Authentication variable we check while refreshing a
token to be clientAuth in order to make the code more readable.

Some of the above were possibly causing the flakiness of elastic#37379
weizijun pushed a commit to weizijun/elasticsearch that referenced this issue Feb 22, 2019
This change

- Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made
  the assumption that in the oneThirdUpgraded stage the master node
  will be on the old version and in the twoThirdsUpgraded stage, the
  master node will be one of the upgraded ones. However, there is no
  guarantee that the master node in any of the states will or will
  not be one of the upgraded ones.
  This class now tests:
  - That we can generate and consume tokens before we start the
  rolling upgrade.
  - That we can consume tokens generated in the old cluster during
  all the stages of the rolling upgrade.
  - That while on a mixed cluster, when/if the master node is
  upgraded, we can generate, consume and refresh a token
  - That after the rolling upgrade, we can consume a token
  generated in an old cluster and can invalidate it so that it
  can't be used any more.
- Ensures that during the rolling upgrade, the upgraded nodes have
the same configuration as the old nodes. Specifically that the
file realm we use is explicitly named `file1`. This is needed
because while attempting to refresh a token in a mixed cluster
we might create a token hitting an old node and attempt to refresh
it hitting a new node. If the file realm name is not the same, the
refresh will be seen as being made by a "different" client, and
will, thus, fail.
- Renames the Authentication variable we check while refreshing a
token to be clientAuth in order to make the code more readable.

Some of the above were possibly causing the flakiness of elastic#37379
jkakavas added a commit that referenced this issue Feb 26, 2019
This change is a backport of  #39252

- Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made
  the assumption that in the oneThirdUpgraded stage the master node
  will be on the old version and in the twoThirdsUpgraded stage, the
  master node will be one of the upgraded ones. However, there is no
  guarantee that the master node in any of the states will or will
  not be one of the upgraded ones.
  This class now tests:
  - That we can generate and consume tokens before we start the
  rolling upgrade.
  - That we can consume tokens generated in the old cluster during
  all the stages of the rolling upgrade.
  - That while on a mixed cluster, when/if the master node is
  upgraded, we can generate, consume and refresh a token
  - That after the rolling upgrade, we can consume a token
  generated in an old cluster and can invalidate it so that it
  can't be used any more.
- Ensures that during the rolling upgrade, the upgraded nodes have
the same configuration as the old nodes. Specifically that the
file realm we use is explicitly named `file1`. This is needed
because while attempting to refresh a token in a mixed cluster
we might create a token hitting an old node and attempt to refresh
it hitting a new node. If the file realm name is not the same, the
refresh will be seen as being made by a "different" client, and
will, thus, fail.
- Renames the Authentication variable we check while refreshing a
token to be clientAuth in order to make the code more readable.

Some of the above were possibly causing the flakiness of #37379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants