Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Term query on _index field does not take aliases into account #23306

Closed
uschindler opened this issue Feb 22, 2017 · 27 comments
Closed

Term query on _index field does not take aliases into account #23306

uschindler opened this issue Feb 22, 2017 · 27 comments
Assignees
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories

Comments

@uschindler
Copy link
Contributor

While trying to remove indicesQuery (deprecated by #17710) in my own code base, I noticed the following problem: The indicesQuery supports to pass any name of an index to the query, so also aliases work (of course it does not work with filtered aliases, doesn't it? - but that does not matter for the problem).

In my case I needed a query that returns results of a query for a specific index only (only alias name known, not its real name) and a matchAll query for the other indexes. Setup was:

  • alias named "allMyStuff" with 3 indexes. Each of the three indexes have an internal, timestamped name, but there also exists an alias pointing to the most recent index. The client code only knows the alias, not the real index name.
  • query executes on "allMyStaff" alias and filters the indices to query using indicesQuery.

This works with indicesQuery, specifiying the alias to only run the specific query on and return matchAll for all the others. With a term query on "_index" field, you cannot use the alias anymore. It only return documents if the index matches completely, so termQuery("_index", "alias") does not match anything.

So I don't agree with deprecating and removing indicesQuery unless this can be fixed.

@clintongormley
Copy link
Contributor

Good point @uschindler - thanks

@uschindler
Copy link
Contributor Author

I think the fix would be quite easy by using the same matching algorithm like indicesQuery does on the doToQuery() method:

@Override
protected Query doToQuery(QueryShardContext context) throws IOException {
    if (context.matchesIndices(indices)) {
        return innerQuery.toQuery(context);
    }
    return noMatchQuery.toQuery(context);
}

I think, IndexFieldType should do something similar, here the example for termQuery method (similar in the other methods like termsQuery):

@Override
public Query termQuery(Object value, @Nullable QueryShardContext context) {
    if (context.matchesIndices(new String[] { value.toString() /*or some other conversion of the term*/))) {
        return Queries.newMatchAllQuery();
    } else {
        return Queries.newMatchNoDocsQuery("Index didn't match. Index queried: " + context.index().getName() + " vs. " + value);
    }
}

I can try to provide a PR about this!

@uschindler
Copy link
Contributor Author

uschindler commented Feb 22, 2017

There is still the question what to do with "filtered" aliases. Those may already broken with indicesQuery (not verified - if the filter of alias is applied afterwards, it would work). The "correct" way would be to apply the filter instead of matchAllDocs.

@uschindler
Copy link
Contributor Author

...not so easy to fix because context.matchesIndices disappeared in master branch.

@javanna
Copy link
Member

javanna commented Mar 20, 2017

hey @uschindler you raise a good point which we have overlooked. The indices query bugged us for quite a while and we weren't sure how many people were using it. We need to fix the alias problem, but that doesn't require restoring the indices query. The fact that you'd need the matchesIndices method back for your fix means that part of the code removed should be restored, see 103984a . The method was removed as it was left unused after the indices query removal.

Let us discuss first though whether we come up with different ideas to address this.

@uschindler
Copy link
Contributor Author

What was the problem with the indices query? Quite simple and straight-forward!

@sjmc7
Copy link

sjmc7 commented May 18, 2017

We were also using the indices query with aliases so would be good to see this fixed.

@carrino
Copy link

carrino commented Oct 20, 2017

+1 Was using indices query with aliases, also. Ran into this when migrating to 6.0.0-rc1

@carrino
Copy link

carrino commented Oct 20, 2017

In addition to the alias problem even if you pass all the indexes the alias could resolve to, I found another issue.

Before, the IndicesQuery would fully encapsulate the inner query and it would not get parsed on an index of the wrong type. This means you could do a query one index with some fields and another index with another set of fields wrapping each in an IndicesQuery.

If you convert this to a boolean with must clauses, all the boolean query is parsed even if the must index clause would fail. This parsing can lead to number_format_exceptions.

@uschindler
Copy link
Contributor Author

Hi @carrino: I agree with you. I had the same problem. I don't fully understand why the Elasticsearch team deprecated the indices query. It is very useful and a "term query on the index field" is by far no replacement!!!

I'd suggest to unddo the deprecation and keep the query as is. Otherwise I have to write a plugin that restores the query, which is way stupid. The whole class is only a few lines of code and brings so much phantastic usage opportunities. And there is no reason to remove it from the perspective of the backend.

@carrino
Copy link

carrino commented Oct 25, 2017

I was able to work around this limitation by changing bool query to short circuit if one of it's filter or must clauses finds a MatchNoDocsQuery. I don't see any other query types that can be used to encapsulate a query on an index besides boolean being modified in this way. Any other ideas?

diff --git a/core/src/main/java/org/elasticsearch/index/query/BoolQueryBuilder.java b/core/src/main/java/org/elasticsearch/index/qu
index ac57c2a..11c714e 100644
--- a/core/src/main/java/org/elasticsearch/index/query/BoolQueryBuilder.java
+++ b/core/src/main/java/org/elasticsearch/index/query/BoolQueryBuilder.java
@@ -23,6 +23,7 @@ import org.apache.lucene.search.BooleanClause;
 import org.apache.lucene.search.BooleanClause.Occur;
 import org.apache.lucene.search.BooleanQuery;
 import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.search.MatchNoDocsQuery;
 import org.apache.lucene.search.Query;
 import org.elasticsearch.Version;
 import org.elasticsearch.common.ParseField;
@@ -376,10 +377,16 @@ public class BoolQueryBuilder extends AbstractQueryBuilder<BoolQueryBuilder> {
     @Override
     protected Query doToQuery(QueryShardContext context) throws IOException {
         BooleanQuery.Builder booleanQueryBuilder = new BooleanQuery.Builder();
-        addBooleanClauses(context, booleanQueryBuilder, mustClauses, BooleanClause.Occur.MUST);
+        if (!addBooleanClauses(context, booleanQueryBuilder, filterClauses, BooleanClause.Occur.FILTER)) {
+            return new MatchNoDocsQuery();
+        }
+
+        if (!addBooleanClauses(context, booleanQueryBuilder, mustClauses, BooleanClause.Occur.MUST)) {
+            return new MatchNoDocsQuery();
+        }
+
         addBooleanClauses(context, booleanQueryBuilder, mustNotClauses, BooleanClause.Occur.MUST_NOT);
         addBooleanClauses(context, booleanQueryBuilder, shouldClauses, BooleanClause.Occur.SHOULD);
-        addBooleanClauses(context, booleanQueryBuilder, filterClauses, BooleanClause.Occur.FILTER);
         BooleanQuery booleanQuery = booleanQueryBuilder.build();
         if (booleanQuery.clauses().isEmpty()) {
             return new MatchAllDocsQuery();
@@ -395,7 +402,7 @@ public class BoolQueryBuilder extends AbstractQueryBuilder<BoolQueryBuilder> {
         return adjustPureNegative ? fixNegativeQueryIfNeeded(query) : query;
     }
 
-    private static void addBooleanClauses(QueryShardContext context, BooleanQuery.Builder booleanQueryBuilder,
+    private static boolean addBooleanClauses(QueryShardContext context, BooleanQuery.Builder booleanQueryBuilder,
                                           List<QueryBuilder> clauses, Occur occurs) throws IOException {
         for (QueryBuilder query : clauses) {
             Query luceneQuery = null;
@@ -409,8 +416,12 @@ public class BoolQueryBuilder extends AbstractQueryBuilder<BoolQueryBuilder> {
                     luceneQuery = query.toFilter(context);
                     break;
             }
+            if ((occurs == Occur.FILTER || occurs == Occur.MUST) && luceneQuery.equals(new MatchNoDocsQuery())) {
+                return false;
+            }
             booleanQueryBuilder.add(new BooleanClause(luceneQuery, occurs));
         }
+        return true;
     }
 
     @Override

@javanna
Copy link
Member

javanna commented Oct 27, 2017

We finally discussed this, sorry @uschindler it took us a while to get to this! We want to fix this, but instead of restoring the removed indices query, we would prefer to try and make term query against _index work with aliases.

@carrino
Copy link

carrino commented Oct 27, 2017

Did you talk about the query parse encapsulation aspect at all? Moving from indicesQuery to bool query with termQuery can result in parse errors. Do we want to make the booleanQueryBuilder change above to prevent this regression?

@javanna
Copy link
Member

javanna commented Oct 27, 2017

We have yet to look at implementation details, we need to try it out in practice and see what comes out code-wise. Stay tuned.

@uschindler
Copy link
Contributor Author

I agree with @carrino - the good thing with the indices query was the fact that the query part that was not used for the current shard/index was not parsed at all. It was therefor possible to query 2 completely different indexes with incompatible schemas and have a separate inner query on both sides of the indices query. With the current "bool" approach, the full query is parsed for all shards and those shards that are from an incompatible index wil fail if you hit a query with field name checking (which is the default for most queries now).

In fact the indices query was something like an if/then/else statement in the query where you were able to pass a different query to an index based on the name. With a bool query that's not possible, as the parsing stage is done before execution.

@acarstoiu
Copy link

@uschindler, @carrino: As a workaround, perform the index filtering in the URL path and apply the index-specific queries ANDed with a term filter on _type. Be aware that has_parent/has_child predicates need to be run with ignore_unmapped set.
I know that this won't work in all cases, but you have a good chance. This solves also the matter with filtered aliases.

@ developers: Thank you for keeping our minds sharp, but people use Elasticsearch precisely because they want to get rid of the workarounds and limitations they had when searching in traditional databases.

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@carrino
Copy link

carrino commented Feb 26, 2018

Can we add the "regression" label to this also as it is a regression with es6 and is blocking upgrading?

@talevy talevy removed the :Search/Search Search-related issues that do not fall into other categories label Mar 26, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@talevy talevy added discuss :Search/Search Search-related issues that do not fall into other categories labels Mar 26, 2018
@tomcallahan
Copy link
Contributor

@talevy can you elaborate why this was labeled discuss?

@talevy
Copy link
Contributor

talevy commented Jul 9, 2018

@tomcallahan for sure. this latest comment #23306 (comment) didn't make it clear that we have a known plan on how to implement this, and so during our issue cleanup I thought it was worth bringing back up to make sure there is a plan. No other reason. If it is clear that we want to do this, and there is no discussion necessary, and only implementation details, then I'm happy to remove the label

@tomcallahan
Copy link
Contributor

OK, let's take off the label until we know we have something specific to examine

@barrucadu
Copy link

barrucadu commented Jun 4, 2019

Is there any plan to fix this?

I've just hit this issue migrating to ES6 and have come up with the solution of, when constructing a query, first making a request to ES to get the real name of an alias. An additional round-trip per query isn't great.

Our index names don't change often, so in principle we could cache them. But when they do we don't have a great way of signalling to the process that it needs to clear its cache, short of restarting it.

barrucadu added a commit to alphagov/search-api that referenced this issue Jun 4, 2019
The format migrator prevents users from getting duplicate results when
formats are migrated into the "govuk" index from the "government" or
"detailed" ones: it ensures that only non-migrated formats are
returned from the "government"/"detailed" indices and only migrated
formats are returned from the "govuk" index.

Previously it did this with an `indices` query.  They look like
this:

    { indices:
      { indices: ["list", "of", "indices"]
      , query: { query to perform against docs from the indices }
      , no_match_query: { query to perform against other docs }
      }
    }

However, the `indices` query has been deprecated in ES5 and removed in
ES6, in favour of just searching over the `_index` field in docs
directly.

This PR changes the `indices` query into a `should` with two clauses
where one clause matches documents in the "govuk" index and the other
clause matches documents in the other indices.  It's a bit more
cumbersome because the `terms` query I'm using doesn't handle index
aliases[1], and involves another round-trip to elasticsearch when
constructing the query to find out the real name of the "govuk" alias.

Index real names only change very infrequently (whenever we reindex
the cluster), so in principle they could be cached.  But we don't
currently have a good way for the reindexing rake task to signal to
the web processes that they need to purge their cache.

[1] elastic/elasticsearch#23306
@jtibshirani jtibshirani self-assigned this Aug 17, 2019
@jtibshirani
Copy link
Contributor

I have been looking into a fix for this. It is slightly more tricky to support now that fully-qualified index names such as cluster:index can be provided in cross cluster search. As a simple example, a wildcard query like a*b*c becomes difficult to interpret -- we could try to resolve a*b as the cluster name and c as an index alias, or a as the cluster name, b*c as the alias.

We support wildcards in qualified index names in the _search URL -- to me it seems we should follow the same strategy for handling wildcards, first proposed in #23893. When specifying a remote index name, we would require that the term contain a : to allow for distinguishing the cluster and index components. Another option would be to disallow issuing wildcard queries against the _index field. To me this doesn't seem like a good choice because wildcard support can be quite useful and was added in response to a customer request (#25722). Tagging @javanna and @jimczi for their thoughts on the above.

Finally, I haven’t yet dug into the query parsing issue @carrino raised, but will make sure to consider it while developing a fix.

@javanna
Copy link
Member

javanna commented Aug 22, 2019

@jtibshirani thanks for looking into this. I don't think CCS comes into play here, or maybe I don't follow why it may. We have prohibited using : in index and aliases names, hence if an index expression contains : it must be because it's referring to remote clusters. On the expansion of a*b*c I think that should only be expanded to local indices., while if you get something like *:a*b*c it refers to remote clusters matching the cluster expression at the left of :. Does this make sense to you?

@jtibshirani
Copy link
Contributor

jtibshirani commented Sep 11, 2019

@javanna we are in agreement, my explanation above was just confusing :) I opened #46640 to add support for aliases, and included a couple open questions in the PR description.

I am not sure of the best way forward in terms of the other discrepancy (where the indices query avoided parsing its subqueries unless they matched the current index). I'll mark this for discussion internally and follow-up with another comment or PR.

@jtibshirani
Copy link
Contributor

I looked into the parsing issue and do not see a good way to address it with the current API. It has been quite some time since we deprecated and removed the indices query, so we are not in a good position to reverse that decision.

It would be very helpful to collect up-to-date information about the remaining problems that were cited (query parsing and filtered aliases). If anyone is still running into problems related to the indices query removal, please open a new issue -- this will let us dig into those problems directly and get a sense of their priority.

I am going to close the issue, since we merged #46640 to address the original problem ("Term query on _index field does not take aliases into account"), but will keep a lookout for new related issues!

@uschindler
Copy link
Contributor Author

Hi @jtibshirani,

many thanks for fixing this! I agree for the discussion of other problems (like not evaluating the query trees that are not part of the currently seen index) to have separate issues.

I think with the current approach, filtered aliases should work (I have not tested them yet), as their filter is always applied as a filter on the whole query, so you would not see documents filtered away. With indices query it was more problematic regarding parsing of query parts.

laurentqro pushed a commit to alphagov/search-api that referenced this issue Dec 19, 2019
The format migrator prevents users from getting duplicate results when
formats are migrated into the "govuk" index from the "government" or
"detailed" ones: it ensures that only non-migrated formats are
returned from the "government"/"detailed" indices and only migrated
formats are returned from the "govuk" index.

Previously it did this with an `indices` query.  They look like
this:

    { indices:
      { indices: ["list", "of", "indices"]
      , query: { query to perform against docs from the indices }
      , no_match_query: { query to perform against other docs }
      }
    }

However, the `indices` query has been deprecated in ES5 and removed in
ES6, in favour of just searching over the `_index` field in docs
directly.

This PR changes the `indices` query into a `should` with two clauses
where one clause matches documents in the "govuk" index and the other
clause matches documents in the other indices.  It's a bit more
cumbersome because the `terms` query I'm using doesn't handle index
aliases[1], and involves another round-trip to elasticsearch when
constructing the query to find out the real name of the "govuk" alias.

Index real names only change very infrequently (whenever we reindex
the cluster), so in principle they could be cached.  But we don't
currently have a good way for the reindexing rake task to signal to
the web processes that they need to purge their cache.

[1] elastic/elasticsearch#23306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests