Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multindex expressions expands to missing indices with Security #47159

Open
albertzaharovits opened this issue Sep 26, 2019 · 3 comments
Open
Labels
>bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team

Comments

@albertzaharovits
Copy link
Contributor

The ES Security Plugin does wildcard expansion in IndicesAndAliasesResolver on the coordinating node of the request. During this process, it rewrites the request to not include wildcards but only concrete indices. There are a few known problems with this approach, for example see #45171 (comment).

This issue is acknowledging another limitation of this approach. The wildcard expansion and the actual handling of the request for the expanded concrete indices could happen on different cluster state versions. For example, a wildcard expression is expanded to an index that is subsequently removed, but before the actual handling of the request takes place. This will generate an index missing exception (assuming ignore_unavailable=false). The same result is not possible with Security turned off because wildcard expansion and request handling work on the same cluster state version.

Causes #45652

@albertzaharovits albertzaharovits added >bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Sep 26, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

droberts195 added a commit that referenced this issue Sep 26, 2019
When the ML native multi-node tests use _cat/indices/_all
and the request goes to a non-master node, _all is
translated to a list of concrete indices by the authz layer
on the coordinating node before the request is forwarded
to the master node. Then it is possible for the master
node to return an index_not_found_exception if one of
the concrete indices that was expanded on the
coordinating node has been deleted in the meantime.
(#47159 has been opened to track the underlying problem.)

It has been observed that the index that gets deleted when
the problem affects the ML native multi-node tests is
always the ML notifications index. The tests that fail are
only interested in the presence or absense of ML results
indices. Therefore the workaround is to only _cat indices
that match the ML results index pattern.

Fixes #45652
droberts195 added a commit that referenced this issue Sep 26, 2019
When the ML native multi-node tests use _cat/indices/_all
and the request goes to a non-master node, _all is
translated to a list of concrete indices by the authz layer
on the coordinating node before the request is forwarded
to the master node. Then it is possible for the master
node to return an index_not_found_exception if one of
the concrete indices that was expanded on the
coordinating node has been deleted in the meantime.
(#47159 has been opened to track the underlying problem.)

It has been observed that the index that gets deleted when
the problem affects the ML native multi-node tests is
always the ML notifications index. The tests that fail are
only interested in the presence or absense of ML results
indices. Therefore the workaround is to only _cat indices
that match the ML results index pattern.

Fixes #45652
droberts195 added a commit that referenced this issue Sep 26, 2019
When the ML native multi-node tests use _cat/indices/_all
and the request goes to a non-master node, _all is
translated to a list of concrete indices by the authz layer
on the coordinating node before the request is forwarded
to the master node. Then it is possible for the master
node to return an index_not_found_exception if one of
the concrete indices that was expanded on the
coordinating node has been deleted in the meantime.
(#47159 has been opened to track the underlying problem.)

It has been observed that the index that gets deleted when
the problem affects the ML native multi-node tests is
always the ML notifications index. The tests that fail are
only interested in the presence or absense of ML results
indices. Therefore the workaround is to only _cat indices
that match the ML results index pattern.

Fixes #45652
droberts195 added a commit that referenced this issue Sep 26, 2019
When the ML native multi-node tests use _cat/indices/_all
and the request goes to a non-master node, _all is
translated to a list of concrete indices by the authz layer
on the coordinating node before the request is forwarded
to the master node. Then it is possible for the master
node to return an index_not_found_exception if one of
the concrete indices that was expanded on the
coordinating node has been deleted in the meantime.
(#47159 has been opened to track the underlying problem.)

It has been observed that the index that gets deleted when
the problem affects the ML native multi-node tests is
always the ML notifications index. The tests that fail are
only interested in the presence or absense of ML results
indices. Therefore the workaround is to only _cat indices
that match the ML results index pattern.

Fixes #45652
@droberts195
Copy link
Contributor

The code I used to investigate this problem is on commit droberts195@3fe500d of the cat_indices_404_repro branch in my fork.

Run ./runner.sh >& runner.log from the top level of the repo and eventually the problem will occur and you can then edit x-pack/plugin/ml/qa/native-multi-node-tests/build/testclusters/*/logs/integTest.log to examine the server logs.

(You probably think I'm crazy to not just run ./gradlew :x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner --tests org.elasticsearch.xpack.ml.integration.MlJobIT -Dtests.iters=100. I did try this, and never managed to reproduce the problem with it. This leads me to think that once hotspot has optimised the code the problem doesn't occur. I also tried a while loop in the shell script and that didn't reproduce the problem either, hence the many duplicated lines. I cannot think of a good explanation for why copy and paste was more successful than a while loop at the shell script level. I guess the problem is just extremely intermittent and sensitive to timing.)

@rjernst rjernst added the Team:Security Meta label for security team label May 4, 2020
@droberts195
Copy link
Contributor

#81901 contains another way to reproduce this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team
Projects
None yet
Development

No branches or pull requests

4 participants