-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] UpgradeClusterClientYamlTestSuiteIT test {p0=mixed_cluster/90_ml_data_frame_analytics_crud/Put an outlier_detection job on the mixed cluster} failing #83567
Comments
Pinging @elastic/ml-core (Team:ML) |
I've been trying with this reproduction:
|
Actually, this reproduction seems to pass:
so it may just be that the particular test method can't be run in isolation (hence the reproduction failures) and failed for an unrelated reason in CI |
Yes, it's definitely the case that the mixed cluster tests in the rolling upgrade tests won't work unless the old cluster tests previously succeeded. This has been a fatal flaw in the repro lines for rolling upgrade tests since the dawn of time. We'll need to look at the server-side logs for the CI run where it failed as part of the full run. |
The server-side error is this:
It suggests there is a problem with the BWC compatibility of the new functionality added in #82639. @albertzaharovits please could you have a look. The PR that changed this was merged on Friday. The problem is with serializing |
Pinging @elastic/es-security (Team:Security) |
Another failure here https://gradle-enterprise.elastic.co/s/pegxn2ph5lkwe |
(I tried to reproduce but my internet is being funny right now and building ES is too much of a challenge) It's a classical BWC issue. But it is trickier for The root problem existed way before #82639. We were lucky it didn't manifest itself because we never change anything* to be version specific till #82639. I will work with Albert for a solution. *We did change API key descriptor format in the authentication metadata. I suspect it could also cause similar issue. But we probably don't have test coverage for it. |
Just bit me on https://gradle-enterprise.elastic.co/s/whsnjwho26jhm |
Should we consider muting if we don't have a fix lined up? this is failing at least a handful of times a day. I hit this today in back to back PR checks. |
@mark-vieira I just raised a draft PR #83913 that should fix this failure. I'll get it ready for review today and hopefully it can be merged in a reasonable timeframe. Btw, I am having trouble to reproduce this locally because the reproducation line is for |
Just omit the |
Swap
|
This isn't actually the problem. Running |
Another failure: https://gradle-enterprise.elastic.co/s/7rom26ojihdos |
Authentication headers are persisted as part of a task definition including ML jobs, CCR following etc. The persistence process store them into either an index or the cluster state. In both cases, the headers are retrieved from ThreadContext as a string which is the serialised form of the Authentication object. This string is always serialised with the node's version. The problem is: In a mixed cluster, the task can be created in a newer node and persisted into an index but then needs to be loaded by a older node. The older node does not understand the newer format of the serialised Authentication object and hence error out on reading it. This PR adds additional logic in places where the headers are persisted. It compares the Authentication version with minNodeVersion and rewrites it if the minNodeVersion is older. Since we already filter security headers in places where headers are persisted, the new logic is hooked into the same places and essentially another enhancement on how to handle security headers for persisted tasks. Resolves: #83567
Authentication headers are persisted as part of a task definition including ML jobs, CCR following etc. The persistence process store them into either an index or the cluster state. In both cases, the headers are retrieved from ThreadContext as a string which is the serialised form of the Authentication object. This string is always serialised with the node's version. The problem is: In a mixed cluster, the task can be created in a newer node and persisted into an index but then needs to be loaded by a older node. The older node does not understand the newer format of the serialised Authentication object and hence error out on reading it. This PR adds additional logic in places where the headers are persisted. It compares the Authentication version with minNodeVersion and rewrites it if the minNodeVersion is older. Since we already filter security headers in places where headers are persisted, the new logic is hooked into the same places and essentially another enhancement on how to handle security headers for persisted tasks. Resolves: elastic#83567
Build scan:
https://gradle-enterprise.elastic.co/s/oyckzy4coyvbm/tests/:x-pack:qa:rolling-upgrade:v8.0.0%23twoThirdsUpgradedTest/org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT/test%20%7Bp0=mixed_cluster%2F90_ml_data_frame_analytics_crud%2FPut%20an%20outlier_detection%20job%20on%20the%20mixed%20cluster%7D
Reproduction line:
./gradlew ':x-pack:qa:rolling-upgrade:v8.0.0#twoThirdsUpgradedTest' -Dtests.class="org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT" -Dtests.method="test {p0=mixed_cluster/90_ml_data_frame_analytics_crud/Put an outlier_detection job on the mixed cluster}" -Dtests.seed=1981B74F2FB431B8 -Dtests.bwc=true -Dtests.locale=es-ES -Dtests.timezone=Australia/Eucla -Druntime.java=17
Applicable branches:
master
Reproduces locally?:
Yes
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT&tests.test=test%20%7Bp0%3Dmixed_cluster/90_ml_data_frame_analytics_crud/Put%20an%20outlier_detection%20job%20on%20the%20mixed%20cluster%7D
Failure excerpt:
The text was updated successfully, but these errors were encountered: