-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Term query on _index field does not take aliases into account #23306
Comments
Good point @uschindler - thanks |
I think the fix would be quite easy by using the same matching algorithm like indicesQuery does on the @Override
protected Query doToQuery(QueryShardContext context) throws IOException {
if (context.matchesIndices(indices)) {
return innerQuery.toQuery(context);
}
return noMatchQuery.toQuery(context);
} I think, @Override
public Query termQuery(Object value, @Nullable QueryShardContext context) {
if (context.matchesIndices(new String[] { value.toString() /*or some other conversion of the term*/))) {
return Queries.newMatchAllQuery();
} else {
return Queries.newMatchNoDocsQuery("Index didn't match. Index queried: " + context.index().getName() + " vs. " + value);
}
} I can try to provide a PR about this! |
There is still the question what to do with "filtered" aliases. Those may already broken with indicesQuery (not verified - if the filter of alias is applied afterwards, it would work). The "correct" way would be to apply the filter instead of matchAllDocs. |
...not so easy to fix because |
hey @uschindler you raise a good point which we have overlooked. The indices query bugged us for quite a while and we weren't sure how many people were using it. We need to fix the alias problem, but that doesn't require restoring the indices query. The fact that you'd need the Let us discuss first though whether we come up with different ideas to address this. |
What was the problem with the indices query? Quite simple and straight-forward! |
We were also using the indices query with aliases so would be good to see this fixed. |
+1 Was using indices query with aliases, also. Ran into this when migrating to 6.0.0-rc1 |
In addition to the alias problem even if you pass all the indexes the alias could resolve to, I found another issue. Before, the IndicesQuery would fully encapsulate the inner query and it would not get parsed on an index of the wrong type. This means you could do a query one index with some fields and another index with another set of fields wrapping each in an IndicesQuery. If you convert this to a boolean with must clauses, all the boolean query is parsed even if the must index clause would fail. This parsing can lead to number_format_exceptions. |
Hi @carrino: I agree with you. I had the same problem. I don't fully understand why the Elasticsearch team deprecated the indices query. It is very useful and a "term query on the index field" is by far no replacement!!! I'd suggest to unddo the deprecation and keep the query as is. Otherwise I have to write a plugin that restores the query, which is way stupid. The whole class is only a few lines of code and brings so much phantastic usage opportunities. And there is no reason to remove it from the perspective of the backend. |
I was able to work around this limitation by changing bool query to short circuit if one of it's filter or must clauses finds a MatchNoDocsQuery. I don't see any other query types that can be used to encapsulate a query on an index besides boolean being modified in this way. Any other ideas?
|
We finally discussed this, sorry @uschindler it took us a while to get to this! We want to fix this, but instead of restoring the removed indices query, we would prefer to try and make term query against |
Did you talk about the query parse encapsulation aspect at all? Moving from indicesQuery to bool query with termQuery can result in parse errors. Do we want to make the booleanQueryBuilder change above to prevent this regression? |
We have yet to look at implementation details, we need to try it out in practice and see what comes out code-wise. Stay tuned. |
I agree with @carrino - the good thing with the indices query was the fact that the query part that was not used for the current shard/index was not parsed at all. It was therefor possible to query 2 completely different indexes with incompatible schemas and have a separate inner query on both sides of the indices query. With the current "bool" approach, the full query is parsed for all shards and those shards that are from an incompatible index wil fail if you hit a query with field name checking (which is the default for most queries now). In fact the indices query was something like an if/then/else statement in the query where you were able to pass a different query to an index based on the name. With a bool query that's not possible, as the parsing stage is done before execution. |
@uschindler, @carrino: As a workaround, perform the index filtering in the URL path and apply the index-specific queries ANDed with a @ developers: Thank you for keeping our minds sharp, but people use Elasticsearch precisely because they want to get rid of the workarounds and limitations they had when searching in traditional databases. |
Can we add the "regression" label to this also as it is a regression with es6 and is blocking upgrading? |
Pinging @elastic/es-search-aggs |
@talevy can you elaborate why this was labeled discuss? |
@tomcallahan for sure. this latest comment #23306 (comment) didn't make it clear that we have a known plan on how to implement this, and so during our issue cleanup I thought it was worth bringing back up to make sure there is a plan. No other reason. If it is clear that we want to do this, and there is no discussion necessary, and only implementation details, then I'm happy to remove the label |
OK, let's take off the label until we know we have something specific to examine |
Is there any plan to fix this? I've just hit this issue migrating to ES6 and have come up with the solution of, when constructing a query, first making a request to ES to get the real name of an alias. An additional round-trip per query isn't great. Our index names don't change often, so in principle we could cache them. But when they do we don't have a great way of signalling to the process that it needs to clear its cache, short of restarting it. |
The format migrator prevents users from getting duplicate results when formats are migrated into the "govuk" index from the "government" or "detailed" ones: it ensures that only non-migrated formats are returned from the "government"/"detailed" indices and only migrated formats are returned from the "govuk" index. Previously it did this with an `indices` query. They look like this: { indices: { indices: ["list", "of", "indices"] , query: { query to perform against docs from the indices } , no_match_query: { query to perform against other docs } } } However, the `indices` query has been deprecated in ES5 and removed in ES6, in favour of just searching over the `_index` field in docs directly. This PR changes the `indices` query into a `should` with two clauses where one clause matches documents in the "govuk" index and the other clause matches documents in the other indices. It's a bit more cumbersome because the `terms` query I'm using doesn't handle index aliases[1], and involves another round-trip to elasticsearch when constructing the query to find out the real name of the "govuk" alias. Index real names only change very infrequently (whenever we reindex the cluster), so in principle they could be cached. But we don't currently have a good way for the reindexing rake task to signal to the web processes that they need to purge their cache. [1] elastic/elasticsearch#23306
I have been looking into a fix for this. It is slightly more tricky to support now that fully-qualified index names such as We support wildcards in qualified index names in the Finally, I haven’t yet dug into the query parsing issue @carrino raised, but will make sure to consider it while developing a fix. |
@jtibshirani thanks for looking into this. I don't think CCS comes into play here, or maybe I don't follow why it may. We have prohibited using |
@javanna we are in agreement, my explanation above was just confusing :) I opened #46640 to add support for aliases, and included a couple open questions in the PR description. I am not sure of the best way forward in terms of the other discrepancy (where the |
I looked into the parsing issue and do not see a good way to address it with the current API. It has been quite some time since we deprecated and removed the It would be very helpful to collect up-to-date information about the remaining problems that were cited (query parsing and filtered aliases). If anyone is still running into problems related to the I am going to close the issue, since we merged #46640 to address the original problem ("Term query on _index field does not take aliases into account"), but will keep a lookout for new related issues! |
Hi @jtibshirani, many thanks for fixing this! I agree for the discussion of other problems (like not evaluating the query trees that are not part of the currently seen index) to have separate issues. I think with the current approach, filtered aliases should work (I have not tested them yet), as their filter is always applied as a filter on the whole query, so you would not see documents filtered away. With indices query it was more problematic regarding parsing of query parts. |
The format migrator prevents users from getting duplicate results when formats are migrated into the "govuk" index from the "government" or "detailed" ones: it ensures that only non-migrated formats are returned from the "government"/"detailed" indices and only migrated formats are returned from the "govuk" index. Previously it did this with an `indices` query. They look like this: { indices: { indices: ["list", "of", "indices"] , query: { query to perform against docs from the indices } , no_match_query: { query to perform against other docs } } } However, the `indices` query has been deprecated in ES5 and removed in ES6, in favour of just searching over the `_index` field in docs directly. This PR changes the `indices` query into a `should` with two clauses where one clause matches documents in the "govuk" index and the other clause matches documents in the other indices. It's a bit more cumbersome because the `terms` query I'm using doesn't handle index aliases[1], and involves another round-trip to elasticsearch when constructing the query to find out the real name of the "govuk" alias. Index real names only change very infrequently (whenever we reindex the cluster), so in principle they could be cached. But we don't currently have a good way for the reindexing rake task to signal to the web processes that they need to purge their cache. [1] elastic/elasticsearch#23306
While trying to remove indicesQuery (deprecated by #17710) in my own code base, I noticed the following problem: The indicesQuery supports to pass any name of an index to the query, so also aliases work (of course it does not work with filtered aliases, doesn't it? - but that does not matter for the problem).
In my case I needed a query that returns results of a query for a specific index only (only alias name known, not its real name) and a matchAll query for the other indexes. Setup was:
This works with indicesQuery, specifiying the alias to only run the specific query on and return matchAll for all the others. With a term query on "_index" field, you cannot use the alias anymore. It only return documents if the index matches completely, so termQuery("_index", "alias") does not match anything.
So I don't agree with deprecating and removing indicesQuery unless this can be fixed.
The text was updated successfully, but these errors were encountered: