Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] UpgradeClusterClientYamlTestSuiteIT test {p0=mixed_cluster/90_ml_data_frame_analytics_crud/Put an outlier_detection job on the mixed cluster} failing #83567

Closed
tvernum opened this issue Feb 7, 2022 · 17 comments · Fixed by #83913
Assignees
Labels
:ml Machine learning :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:ML Meta label for the ML team Team:Security Meta label for security team >test-failure Triaged test failures from CI

Comments

@tvernum
Copy link
Contributor

tvernum commented Feb 7, 2022

Build scan:
https://gradle-enterprise.elastic.co/s/oyckzy4coyvbm/tests/:x-pack:qa:rolling-upgrade:v8.0.0%23twoThirdsUpgradedTest/org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT/test%20%7Bp0=mixed_cluster%2F90_ml_data_frame_analytics_crud%2FPut%20an%20outlier_detection%20job%20on%20the%20mixed%20cluster%7D

Reproduction line:
./gradlew ':x-pack:qa:rolling-upgrade:v8.0.0#twoThirdsUpgradedTest' -Dtests.class="org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT" -Dtests.method="test {p0=mixed_cluster/90_ml_data_frame_analytics_crud/Put an outlier_detection job on the mixed cluster}" -Dtests.seed=1981B74F2FB431B8 -Dtests.bwc=true -Dtests.locale=es-ES -Dtests.timezone=Australia/Eucla -Druntime.java=17

Applicable branches:
master

Reproduces locally?:
Yes

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT&tests.test=test%20%7Bp0%3Dmixed_cluster/90_ml_data_frame_analytics_crud/Put%20an%20outlier_detection%20job%20on%20the%20mixed%20cluster%7D

Failure excerpt:

java.lang.AssertionError: Failure at [mixed_cluster/90_ml_data_frame_analytics_crud:158]: expected [2xx] status code but api [ml.stop_data_frame_analytics] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"Timed out when waiting for persistent tasks after 30s","stack_trace":"org.elasticsearch.ElasticsearchException$1: Timed out when waiting for persistent tasks after 30s\n\tat org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:638)\n\tat org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:566)\n\tat org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:140)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:101)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:81)\n\tat org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:55)\n\tat org.elasticsearch.client.node.NodeClient.lambda$executeLocally$1(NodeClient.java:114)\n\tat org.elasticsearch.tasks.TaskManager$1.onFailure(TaskManager.java:185)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38)\n\tat org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)\n\tat org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1343)\n\tat org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:366)\n\tat org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:221)\n\tat org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:364)\n\tat org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:356)\n\tat org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:132)\n\tat org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:88)\n\tat org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:728)\n\tat org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:148)\n\tat org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:120)\n\tat org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:85)\n\tat org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:70)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1374)\n\tat io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1237)\n\tat io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1286)\n\tat io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\nCaused by: java.lang.IllegalStateException: Timed out when waiting for persistent tasks after 30s\n\tat org.elasticsearch.persistent.PersistentTasksService$2.onTimeout(PersistentTasksService.java:224)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263)\n\tat org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:651)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.lang.Thread.run(Thread.java:833)\n"}],"type":"illegal_state_exception","reason":"Timed out when waiting for persistent tasks after 30s","stack_trace":"java.lang.IllegalStateException: Timed out when waiting for persistent tasks after 30s\n\tat org.elasticsearch.persistent.PersistentTasksService$2.onTimeout(PersistentTasksService.java:224)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263)\n\tat org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:651)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.lang.Thread.run(Thread.java:833)\n"},"status":500}]

  at __randomizedtesting.SeedInfo.seed([1981B74F2FB431B8:91D5889581485C40]:0)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:489)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:462)
  at jdk.internal.reflect.GeneratedMethodAccessor22.invoke(null:-1)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

  Caused by: java.lang.AssertionError: expected [2xx] status code but api [ml.stop_data_frame_analytics] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"Timed out when waiting for persistent tasks after 30s","stack_trace":"org.elasticsearch.ElasticsearchException$1: Timed out when waiting for persistent tasks after 30s\n\tat org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:638)\n\tat org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:566)\n\tat org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:140)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:101)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:81)\n\tat org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:55)\n\tat org.elasticsearch.client.node.NodeClient.lambda$executeLocally$1(NodeClient.java:114)\n\tat org.elasticsearch.tasks.TaskManager$1.onFailure(TaskManager.java:185)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38)\n\tat org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)\n\tat org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1343)\n\tat org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:366)\n\tat org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:221)\n\tat org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:364)\n\tat org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:356)\n\tat org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:132)\n\tat org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:88)\n\tat org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:728)\n\tat org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:148)\n\tat org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:120)\n\tat org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:85)\n\tat org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:70)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1374)\n\tat io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1237)\n\tat io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1286)\n\tat io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\nCaused by: java.lang.IllegalStateException: Timed out when waiting for persistent tasks after 30s\n\tat org.elasticsearch.persistent.PersistentTasksService$2.onTimeout(PersistentTasksService.java:224)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263)\n\tat org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:651)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.lang.Thread.run(Thread.java:833)\n"}],"type":"illegal_state_exception","reason":"Timed out when waiting for persistent tasks after 30s","stack_trace":"java.lang.IllegalStateException: Timed out when waiting for persistent tasks after 30s\n\tat org.elasticsearch.persistent.PersistentTasksService$2.onTimeout(PersistentTasksService.java:224)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345)\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263)\n\tat org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:651)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.lang.Thread.run(Thread.java:833)\n"},"status":500}]

    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:374)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:478)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:462)
    at jdk.internal.reflect.GeneratedMethodAccessor22.invoke(null:-1)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:568)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
    at java.lang.Thread.run(Thread.java:833)

@tvernum tvernum added :ml Machine learning >test-failure Triaged test failures from CI labels Feb 7, 2022
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 7, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@tvernum
Copy link
Contributor Author

tvernum commented Feb 7, 2022

I've been trying with this reproduction:

./gradlew ':x-pack:qa:rolling-upgrade:v8.0.0#bwcTest' -Dtests.class="org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT" -Dtests.method="test {p0=mixed_cluster/90_ml_data_frame_analytics_crud/Put an outlier_detection job on the mixed cluster}

It fails all the way backto b610aee, so I suspect the triggering commit is actually on 8.0 not master. Actually given potential serialization changes in that timeframe, the fact that it fails doesn't imply much.

@tvernum
Copy link
Contributor Author

tvernum commented Feb 7, 2022

Actually, this reproduction seems to pass:

./gradlew ':x-pack:qa:rolling-upgrade:v8.0.0#bwcTest' -Dtests.class="org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT"

so it may just be that the particular test method can't be run in isolation (hence the reproduction failures) and failed for an unrelated reason in CI

@droberts195
Copy link
Contributor

the particular test method can't be run in isolation (hence the reproduction failures) and failed for an unrelated reason in CI

Yes, it's definitely the case that the mixed cluster tests in the rolling upgrade tests won't work unless the old cluster tests previously succeeded. This has been a fatal flaw in the repro lines for rolling upgrade tests since the dawn of time.

We'll need to look at the server-side logs for the CI run where it failed as part of the full run.

@droberts195
Copy link
Contributor

droberts195 commented Feb 7, 2022

The server-side error is this:

[2022-02-07T02:10:54,843][ERROR][o.e.x.s.a.AuthenticatorChain] [v8.0.0-2] caught exception while trying to read authentication from request [transport request action [indices:admin/get]]
java.io.EOFException: tried to read: 10 bytes but only 1 remaining
        at org.elasticsearch.common.io.stream.ByteBufferStreamInput.ensureCanReadBytes(ByteBufferStreamInput.java:117) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readArraySize(StreamInput.java:1242) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:421) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readGenericValue(StreamInput.java:694) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readMap(StreamInput.java:663) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.core.security.authc.Authentication.<init>(Authentication.java:81) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.core.security.authc.support.AuthenticationContextSerializer.decode(AuthenticationContextSerializer.java:64) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.core.security.authc.support.AuthenticationContextSerializer.deserializeHeaderAndPutInContext(AuthenticationContextSerializer.java:54) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.core.security.authc.support.AuthenticationContextSerializer.readFromContext(AuthenticationContextSerializer.java:48) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authc.AuthenticatorChain.lookForExistingAuthentication(AuthenticatorChain.java:245) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authc.AuthenticatorChain.authenticateAsync(AuthenticatorChain.java:86) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:171) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.applyInternal(SecurityActionFilter.java:159) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:118) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:77) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:54) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.tasks.TaskManager.registerAndExecute(TaskManager.java:170) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:105) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:83) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:380) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:57) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.client.ParentTaskAssigningClient.doExecute(ParentTaskAssigningClient.java:55) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:380) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

It suggests there is a problem with the BWC compatibility of the new functionality added in #82639.

@albertzaharovits please could you have a look. The PR that changed this was merged on Friday. The problem is with serializing Authentication objects in a mixed 8.0.0/8.2.0 cluster. An 8.2.0 node is sending more data than an 8.0.0 node expects.

@droberts195 droberts195 added the :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) label Feb 7, 2022
@elasticmachine elasticmachine added the Team:Security Meta label for security team label Feb 7, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@valeriy42
Copy link
Contributor

Another failure here https://gradle-enterprise.elastic.co/s/pegxn2ph5lkwe

@ywangd
Copy link
Member

ywangd commented Feb 7, 2022

(I tried to reproduce but my internet is being funny right now and building ES is too much of a challenge)
Without looking into the details, I think this happened because the Authorization header was serialised as a task header in a newer node (8.2) and then a older node (8.0) tried to deserialised it and failed due to unknown fields.

It's a classical BWC issue. But it is trickier for Authentication object because it not only transfer across wire but also write/read via index. The write/read via index is more of an issue because it does not have an explicit output version (unlike wire transfer). We used to have a roadmap item for it but I cannot find it anymore (something along the line of BWC for serialised authentication header).

The root problem existed way before #82639. We were lucky it didn't manifest itself because we never change anything* to be version specific till #82639. I will work with Albert for a solution.

*We did change API key descriptor format in the authentication metadata. I suspect it could also cause similar issue. But we probably don't have test coverage for it.

@ywangd ywangd self-assigned this Feb 7, 2022
@joegallo
Copy link
Contributor

joegallo commented Feb 8, 2022

@danhermann
Copy link
Contributor

@mark-vieira
Copy link
Contributor

mark-vieira commented Feb 14, 2022

Should we consider muting if we don't have a fix lined up? this is failing at least a handful of times a day. I hit this today in back to back PR checks.

@ywangd
Copy link
Member

ywangd commented Feb 14, 2022

@mark-vieira I just raised a draft PR #83913 that should fix this failure. I'll get it ready for review today and hopefully it can be merged in a reasonable timeframe.

Btw, I am having trouble to reproduce this locally because the reproducation line is for twoThirdsUpgradedTest which cannot run on its own because it expects some setup from the oldCluster. But the oldClusterTest is skipped using the reproduction line (my build scan). Could you please advise on how to get it run properly locally?

@mark-vieira
Copy link
Contributor

Just omit the -Dtests.method system property. The issue here is that some of these test suites ahve inter dependencies between individual test methods so the safest way to reproduce is to run the full suite.

@davidkyle
Copy link
Member

Btw, I am having trouble to reproduce this locally because the reproducation line is for twoThirdsUpgradedTest

Swap twoThirdsUpgradedTest for bwcTest

./gradlew ':x-pack:qa:rolling-upgrade:v8.0.0#bwcTest'

@mark-vieira
Copy link
Contributor

Swap twoThirdsUpgradedTest for bwcTest

This isn't actually the problem. Running twoThirdsUpgradedTest will run the old cluster first. That works as intended. The problem is that some of these yaml tests cannot be run in complete isolation. Other tests in the same suite do setup that is required for subsequent tests.

@pugnascotia
Copy link
Contributor

ywangd added a commit that referenced this issue Feb 17, 2022
Authentication headers are persisted as part of a task definition including ML
jobs, CCR following etc. The persistence process store them into either an
index or the cluster state. In both cases, the headers are retrieved from
ThreadContext as a string which is the serialised form of the Authentication
object. This string is always serialised with the node's version.

The problem is: In a mixed cluster, the task can be created in a newer node and
persisted into an index but then needs to be loaded by a older node. The older
node does not understand the newer format of the serialised Authentication
object and hence error out on reading it.

This PR adds additional logic in places where the headers are persisted. It
compares the Authentication version with minNodeVersion and rewrites it if the
minNodeVersion is older. Since we already filter security headers in places
where headers are persisted, the new logic is hooked into the same places and
essentially another enhancement on how to handle security headers for persisted
tasks.

Resolves: #83567
probakowski pushed a commit to probakowski/elasticsearch that referenced this issue Feb 23, 2022
Authentication headers are persisted as part of a task definition including ML
jobs, CCR following etc. The persistence process store them into either an
index or the cluster state. In both cases, the headers are retrieved from
ThreadContext as a string which is the serialised form of the Authentication
object. This string is always serialised with the node's version.

The problem is: In a mixed cluster, the task can be created in a newer node and
persisted into an index but then needs to be loaded by a older node. The older
node does not understand the newer format of the serialised Authentication
object and hence error out on reading it.

This PR adds additional logic in places where the headers are persisted. It
compares the Authentication version with minNodeVersion and rewrites it if the
minNodeVersion is older. Since we already filter security headers in places
where headers are persisted, the new logic is hooked into the same places and
essentially another enhancement on how to handle security headers for persisted
tasks.

Resolves: elastic#83567
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:ML Meta label for the ML team Team:Security Meta label for security team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.