-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce combined_fields
query
#71213
Introduce combined_fields
query
#71213
Conversation
combined_fields
query for searching multiple text fieldscombined_fields
query
Some restrictions on the query (these are explained in more detail in the docs):
The PR is large because it also includes a refactor around |
8e23cd8
to
86e09d2
Compare
* Factor out query creation to this class. * Rename to ZeroTermsQueryOption to clarify it isn't a query.
86e09d2
to
acbba64
Compare
Pinging @elastic/es-search (Team:Search) |
accepts text fields that do not share the same analyzer. | ||
|
||
`multi_match` takes a field-centric view of the query by default. In contrast, | ||
`combined_fields` is term-centric: `operator` and `minimum_should_match` are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It per term application of operator
and minimum_should_match
is how combined_fields
is superior to cross_fields
which is also considered term-centric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, cross_fields
is also able to apply operator
and minimum_should_match
in a term-centric way. The main benefit of combined_fields
is the robust and understandable scoring algorithm.
I tried to update this section to make it clearer:
- When mentioning field-centric scoring, make it clear that we're referring to
best_fields
andmost_fields
- Add a sentence comparing to
cross_fields
and mentioning the benefit
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/360_combined_fields.yml
Outdated
Show resolved
Hide resolved
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/360_combined_fields.yml
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/CombinedFieldsQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/CombinedFieldsQueryBuilder.java
Outdated
Show resolved
Hide resolved
...lClusterTest/java/org/elasticsearch/search/fetch/subphase/highlight/HighlighterSearchIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/query/CombinedFieldsQueryParsingTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks, this a great addition of a new query combined_fields
. I left a couple of comments with a main comment to make a little bit clear in documentation what is an advantage of combined_fields
query over cross_fields and other multi_match
.
But overall this PR LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. I wonder if we need to implement auto_generate_synonyms_phrase_query
and zero_terms_query
for the new query? Auto-generate in particular feels like a leftover from when we didn't really handle query-time token graphs and I think we should consider deprecating it elsewhere.
@@ -0,0 +1,466 @@ | |||
/* @notice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the @notice
annotation? This is required by our build for non-Elastic licensed code so that the notices are copied: #57017. I based this off other Lucene classes that we've copied like XMoreLikeThis
and IndexableBinaryStringTools
.
@@ -0,0 +1,161 @@ | |||
/* @notice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here?
if (terms.isEmpty()) { | ||
extractWeightedTerms(terms, query, 1F); | ||
} | ||
extractWeightedTerms(terms, query, 1F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky, I meant to call it out explicitly! Without this fix, the plain
highlighter does not work on combined_field
queries. Specifically, it will only highlight the first term, and fail to highlight subsequent terms because of the terms.isEmpty()
check.
My understanding is that this was added in #15516 to work around a Lucene issue. The Lucene issue was later fixed: https://issues.apache.org/jira/browse/LUCENE-7112. All the tests added in #15516 still pass with this change.
* Avoid using rest_total_hits_as_int * Streamline check in HighlighterSearchIT
Thank you @mayya-sharipova and @romseygeek for the reviews! I pushed commits addressing your comments.
I added some improvements to the docs section "Comparison to
To me |
I caught up with @jimczi and now have a better understanding. @romseygeek I think you're suggesting just using the default of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me, thanks @jtibshirani
@jtibshirani Thanks for explanations and new changes, they all LGTM. |
This PR introduces a new query called `combined_fields` for searching multiple text fields. It takes a term-centric view, first analyzing the query string into individual terms, then searching for each term any of the fields as though they were one combined field. It is based on Lucene's `CombinedFieldQuery`, which takes a principled approach to scoring based on the BM25F formula. This query provides an alternative to the `cross_fields` `multi_match` mode. It has simpler behavior and a more robust approach to scoring. Addresses elastic#41106.
This PR introduces a new query called `combined_fields` for searching multiple text fields. It takes a term-centric view, first analyzing the query string into individual terms, then searching for each term any of the fields as though they were one combined field. It is based on Lucene's `CombinedFieldQuery`, which takes a principled approach to scoring based on the BM25F formula. This query provides an alternative to the `cross_fields` `multi_match` mode. It has simpler behavior and a more robust approach to scoring. Addresses #41106.
This PR introduces a new query called
combined_fields
for searching multipletext fields. It takes a term-centric view, first analyzing the query string
into individual terms, then searching for each term any of the fields as though
they were one combined field. It is based on Lucene's
CombinedFieldQuery
,which takes a principled approach to scoring based on the BM25F formula.
This query provides an alternative to the
cross_fields
multi_match
mode. Ithas simpler behavior and a more robust approach to scoring.
Addresses #41106.