-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery #12156
Conversation
It's fine to just change the test case here, we're only checking that the debug functionality on the monitor reports that a given query has matched and the specific string output isn't important. |
OK thanks. I've removed the BQ rewrite logic and updated this test. I'm not convinced we need to actually rewrite to a BQ during the query rewrite, but that's the one big difference with this implementation. |
@gsmiller this is a great simplification. Thank you. I'm going to share it with our team. |
feb3c0b
to
aee4262
Compare
I've rebased this work on top of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I am not fully sure what default rewrite method is best here.
Thanks @uschindler.
The nice thing is it's easy to control now (bitset rewrite, boolean scoring, doc values post-filtering, etc.). Based on the benchmark wins in #12055 for other multi-term queries, I thought it would be good to use the same rewrite by default, at least for now. We can change the default easily if we learn more. |
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that match query terms. With the introduction of PR apache#12156 we saw degradation in performance of bool queries where one of the mandatory clauses is a TermInSetQuery with query terms not present in the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which would shortcut the whole bool query. This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field doesn't contain any query terms. Relates to PR apache#12156
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that match query terms. With the introduction of PR #12156 we saw degradation in performance of bool queries where one of the mandatory clauses is a TermInSetQuery with query terms not present in the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which would shortcut the whole bool query. This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field doesn't contain any query terms. Relates to PR #12156
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that match query terms. With the introduction of PR #12156 we saw degradation in performance of bool queries where one of the mandatory clauses is a TermInSetQuery with query terms not present in the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which would shortcut the whole bool query. This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field doesn't contain any query terms. Relates to PR #12156
Description
TermInSetQuery
's implementation is more-or-less an exactly copy ofMultiTermQuery
+MultiTermQueryConstantScoreWrapper
. This PR removes the customTermInSetQuery
implementation in favor of extendingMultiTermQuery
withMultiTermQueryConstantScoreWrapper
as the default rewrite behavior.One nice benefit of this (beyond code cleanup) is that different rewrite methods can be provided for different behavior. Specifically, we can leverage
DocValuesRewriteMethod
to completely replaceSortedSetDocValuesSetQuery
(I would propose doing that in a follow up PR once we're happy with this change).A couple notes about the change:
MultiTermQueryConstantScoreWrapper
aScoreSupplier
implementation soTermInSetQuery
can still be used efficiently withIndexOrDocValuesQuery
. This also required a new (optional) public API inMultiTermQuery
where queries can expose their number of terms (where applicable).Retaining the existingTermInSetQuery#rewrite
behavior was slightly non-trivial. I don't really like how I've done this in the PR, but I also can't think of a better solution. I'm not actually convinced we need to retain this, but I'm not sure if client code might rely on the existing behavior in places. Even if that's the case, it might not be a good enough argument to keep it, but I did find a breaking unit test if I didn't keep it. (TestPresearcherMatchCollector.testMatchCollectorShowMatches
assumes the query will get rewritten to a BQ). I'm not familiar enough with themonitor
package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here. If someone is more familiar and/or has opinions on the need to retain the existing "rewrite" behavior, I'd love some feedback.Performance
When this has been discussed in the past, there's been an open question around performance since the term intersection happens a bit differently (
seekCeil
vs.seekExact
). I ran some benchmarks using a one-off tool (similar to #12151) and found no noticeable regressions/issues. Here's the output of my tool (which you can check out here:TiSBench.java.txt )
All Country Code Filter Terms
Medium Cardinality + High Cost Country Code Filter Terms
Low Cardinality + High Cost Country Code Filter Terms
Medium Cardinality + Low Cost Country Code Filter Terms
Low Cardinality + Low Cost Country Code Filter Terms
High Cardinality PK Filter Terms
Medium Cardinality PK Filter Terms
Low Cardinality PK Filter Terms