-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] MlJobIT.testDeleteJob and testDeleteJobAsync fail with index_not_found_exception #45652
Comments
Pinging @elastic/ml-core |
Another same test failure on master happened today. Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/241/console
|
Was merged recently and it significantly changed the internal logic for |
Similar looking issue today (although this failure is in
|
@tlrx I am not sure if this indicates a larger issue or not. It does not make sense to me that |
@benwtrent Can you please provide a reproducing scenario with cat/indices that returns a 404? |
@tlrx the scenario are these tests. They have failed due to this error.
The best I can figure out is this:
I am not sure how to reliably re-create an internal race condition. |
We've had a few instances of a similar looking failure for the https://gradle-enterprise.elastic.co/s/i4jrzz5jxishg/tests/zcquf3hc3eoda-negkewcdisf2a |
This logging is to help debug elastic#45652 and will be removed once the cause is known.
This failure is happening a lot now. It seems to me that the only way to make progress is to find out exactly where the |
This logging is to help debug #45652 and will be removed once the cause is known.
@droberts195 I am not sure the logging worked. This test failed due to an |
Interesting. This shows that it's the very first I think the problem is then almost certainly due to the way the security plugin replaces So when the core code of |
Relates elastic#45652 The most recent failure suggests the exception is thrown earlier than previously assumed.
@droberts195 In principle the However the code you're hinting at is not faulty IMO. It computes the superset of authorized indices that will be used to compute the wildcard expression further on, but there is no race there because it uses the same |
Relates #45652 The most recent failure suggests the exception is thrown earlier than previously assumed.
As there are daily failures of this, I've muted the test on master, 7.x, 7.4 and 7.3. |
This also affects testDeleteJob (not only testDeleteJobAsync). I've updated the issue title to that effect and will be muting that test as well. |
This debug was never logged, so although elastic#45652 is not yet fixed there is no point keeping it. The strange IndexNotFoundException comes entirely from the authz layer. Relates elastic#46739 Relates elastic#45652
The Instead I added logging directly into the
|
One more important thing about the stack trace in the previous comment is that it came from the master node, which was So it looks like @albertzaharovits was correct with the idea that:
I think the authz layer on I will try to get it to fail again with logging in |
I got more debug, but it actually suggests that the index that is not found has not been recently deleted but recently created:
Note:
It suggests to me that whatever source of information the authz layer is using to expand |
Yes the state used by Line 123 in 96b4f3d
and the state used by authz comes from: Line 243 in 5761b0a
I imagine this is going to be non-trivial to fix. It's not an ML problem even though ML tests are the ones running into it. Is there a way to allow the authorization code to see the same cluster state that the action it's authorizing will see? /cc @elastic/es-security |
Nice investigation @droberts195 . However, when authorization happens on a different node, it appears that the cluster publication algorithm makes the new cluster state first available on the other nodes before revealing it on the master. I think this is the unlikelier sibling to the problem that the wildcard expansion happens on an outdated cluster state. I can't think of a fix given the way Security works atm. I will raise an issue, but the fix must be a pluggable wildcard resolution in core to which Security hooks in. |
When the ML native multi-node tests use _cat/indices/_all and the request goes to a non-master node, _all is translated to a list of concrete indices on the coordinating node before the request is forwarded to the master node. Then it is possible for the master node to return an index_not_found_exception if one of the concrete indices that was expanded on the coordinating node has been deleted in the meantime. It has been observed that the index that gets deleted when the problem affects the ML native multi-node tests is always the ML notifications index. The tests that fail are only interested in the presence or absense of ML results indices. Therefore the workaround is to only _cat indices that match the ML results patten. Fixes elastic#45652
When the ML native multi-node tests use _cat/indices/_all and the request goes to a non-master node, _all is translated to a list of concrete indices by the authz layer on the coordinating node before the request is forwarded to the master node. Then it is possible for the master node to return an index_not_found_exception if one of the concrete indices that was expanded on the coordinating node has been deleted in the meantime. (#47159 has been opened to track the underlying problem.) It has been observed that the index that gets deleted when the problem affects the ML native multi-node tests is always the ML notifications index. The tests that fail are only interested in the presence or absense of ML results indices. Therefore the workaround is to only _cat indices that match the ML results index pattern. Fixes #45652
When the ML native multi-node tests use _cat/indices/_all and the request goes to a non-master node, _all is translated to a list of concrete indices by the authz layer on the coordinating node before the request is forwarded to the master node. Then it is possible for the master node to return an index_not_found_exception if one of the concrete indices that was expanded on the coordinating node has been deleted in the meantime. (#47159 has been opened to track the underlying problem.) It has been observed that the index that gets deleted when the problem affects the ML native multi-node tests is always the ML notifications index. The tests that fail are only interested in the presence or absense of ML results indices. Therefore the workaround is to only _cat indices that match the ML results index pattern. Fixes #45652
When the ML native multi-node tests use _cat/indices/_all and the request goes to a non-master node, _all is translated to a list of concrete indices by the authz layer on the coordinating node before the request is forwarded to the master node. Then it is possible for the master node to return an index_not_found_exception if one of the concrete indices that was expanded on the coordinating node has been deleted in the meantime. (#47159 has been opened to track the underlying problem.) It has been observed that the index that gets deleted when the problem affects the ML native multi-node tests is always the ML notifications index. The tests that fail are only interested in the presence or absense of ML results indices. Therefore the workaround is to only _cat indices that match the ML results index pattern. Fixes #45652
When the ML native multi-node tests use _cat/indices/_all and the request goes to a non-master node, _all is translated to a list of concrete indices by the authz layer on the coordinating node before the request is forwarded to the master node. Then it is possible for the master node to return an index_not_found_exception if one of the concrete indices that was expanded on the coordinating node has been deleted in the meantime. (#47159 has been opened to track the underlying problem.) It has been observed that the index that gets deleted when the problem affects the ML native multi-node tests is always the ML notifications index. The tests that fail are only interested in the presence or absense of ML results indices. Therefore the workaround is to only _cat indices that match the ML results index pattern. Fixes #45652
Thanks @albertzaharovits. I've added more debug now and you are correct that the request is definitely being expanded to concrete indices on the coordinating node. I'll add a link to my custom debug branch to your new issue #47159. In a failure case that I observed today with the latest debug,
So it proves that the coordinating node On
This is interesting because it shows that the master node created the "missing" index 0.2 seconds before the coordinating node expanded In the meantime I think the underlying problem can be sidestepped in the ML tests that have been failing. I did this in #47160. (My repro branch with custom debug deliberately backs out this workaround to allow these tests to reveal the problem.) |
This one comes from an intake job. Cannot reproduce it. Full log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part2/793/consoleText
The text was updated successfully, but these errors were encountered: