Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce combined_fields query #71213

Merged
merged 10 commits into from
Apr 14, 2021

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Apr 2, 2021

This PR introduces a new query called combined_fields for searching multiple
text fields. It takes a term-centric view, first analyzing the query string
into individual terms, then searching for each term any of the fields as though
they were one combined field. It is based on Lucene's CombinedFieldQuery,
which takes a principled approach to scoring based on the BM25F formula.

This query provides an alternative to the cross_fields multi_match mode. It
has simpler behavior and a more robust approach to scoring.

Addresses #41106.

@jtibshirani jtibshirani changed the title Introduce combined_fields query for searching multiple text fields Introduce combined_fields query Apr 2, 2021
@jtibshirani
Copy link
Contributor Author

Some restrictions on the query (these are explained in more detail in the docs):

  • All fields must be in the text type family. Additionally, all fields must have the same search analyzer.
  • Currently only BM25 similarity is supported. Other default similarities or per-field similarities are not allowed.
  • It has a simpler API than multi_match and omits several parameters, including analyzer, fuzzines and all related options, lenient, slop, and tiebreaker.

The PR is large because it also includes a refactor around ZeroTermsQuery. This is broken out into its own commit to make it easier to see what changed.

@jtibshirani jtibshirani force-pushed the combined-fields-query branch 2 times, most recently from 8e23cd8 to 86e09d2 Compare April 2, 2021 01:29
@jtibshirani jtibshirani force-pushed the combined-fields-query branch from 86e09d2 to acbba64 Compare April 2, 2021 04:30
@jtibshirani jtibshirani added :Search/Search Search-related issues that do not fall into other categories >enhancement v7.13.0 v8.0.0 labels Apr 2, 2021
@jtibshirani jtibshirani marked this pull request as ready for review April 2, 2021 05:02
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

accepts text fields that do not share the same analyzer.

`multi_match` takes a field-centric view of the query by default. In contrast,
`combined_fields` is term-centric: `operator` and `minimum_should_match` are
Copy link
Contributor

@mayya-sharipova mayya-sharipova Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It per term application of operator and minimum_should_match is how combined_fields is superior to cross_fields which is also considered term-centric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, cross_fields is also able to apply operator and minimum_should_match in a term-centric way. The main benefit of combined_fields is the robust and understandable scoring algorithm.

I tried to update this section to make it clearer:

  • When mentioning field-centric scoring, make it clear that we're referring to best_fields and most_fields
  • Add a sentence comparing to cross_fields and mentioning the benefit

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks, this a great addition of a new query combined_fields. I left a couple of comments with a main comment to make a little bit clear in documentation what is an advantage of combined_fields query over cross_fields and other multi_match.
But overall this PR LGTM!

@jtibshirani jtibshirani requested a review from romseygeek April 12, 2021 05:54
Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. I wonder if we need to implement auto_generate_synonyms_phrase_query and zero_terms_query for the new query? Auto-generate in particular feels like a leftover from when we didn't really handle query-time token graphs and I think we should consider deprecating it elsewhere.

@@ -0,0 +1,466 @@
/* @notice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the @notice annotation? This is required by our build for non-Elastic licensed code so that the notices are copied: #57017. I based this off other Lucene classes that we've copied like XMoreLikeThis and IndexableBinaryStringTools.

@@ -0,0 +1,161 @@
/* @notice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here?

if (terms.isEmpty()) {
extractWeightedTerms(terms, query, 1F);
}
extractWeightedTerms(terms, query, 1F);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this has changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky, I meant to call it out explicitly! Without this fix, the plain highlighter does not work on combined_field queries. Specifically, it will only highlight the first term, and fail to highlight subsequent terms because of the terms.isEmpty() check.

My understanding is that this was added in #15516 to work around a Lucene issue. The Lucene issue was later fixed: https://issues.apache.org/jira/browse/LUCENE-7112. All the tests added in #15516 still pass with this change.

* Avoid using rest_total_hits_as_int
* Streamline check in HighlighterSearchIT
@jtibshirani
Copy link
Contributor Author

Thank you @mayya-sharipova and @romseygeek for the reviews! I pushed commits addressing your comments.

with a main comment to make a little bit clear in documentation what is an advantage of combined_fields query over cross_fields and other multi_match.

I added some improvements to the docs section "Comparison to multi_match query". Let me know if this helps or if you think it's still unclear. For context, once this query is merged and users have tried it out, I plan to have a follow-up discussion about what to do with cross_fields. We may be able to deprecate and remove it, or maybe we'll identify some exact cases where it's helpful. After that discussion we can revise the docs to be even clearer.

I wonder if we need to implement auto_generate_synonyms_phrase_query and zero_terms_query for the new query?

To me zero_terms_query seems helpful, we've seem users request it on query types where it was missing like match_phrase and match_phrase_prefix. I don't understand the details yet, but I'd be happy to remove auto_generate_synonyms_phrase_query if we don't think it's helpful anymore.

@jtibshirani
Copy link
Contributor Author

I don't understand the details yet, but I'd be happy to remove auto_generate_synonyms_phrase_query if we don't think it's helpful anymore.

I caught up with @jimczi and now have a better understanding. @romseygeek I think you're suggesting just using the default of true and removing the possibility to disable it? I'd prefer to have a flag for now for consistency with other query types, which makes the API easier for users to understand. If we end up deprecating + removing it, we can just do it across all relevant query types, including combined_fields.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, thanks @jtibshirani

@mayya-sharipova
Copy link
Contributor

@jtibshirani Thanks for explanations and new changes, they all LGTM.

@jtibshirani jtibshirani merged commit 318bf14 into elastic:master Apr 14, 2021
@jtibshirani jtibshirani deleted the combined-fields-query branch April 14, 2021 20:33
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Apr 14, 2021
This PR introduces a new query called `combined_fields` for searching multiple
text fields. It takes a term-centric view, first analyzing the query string
into individual terms, then searching for each term any of the fields as though
they were one combined field. It is based on Lucene's `CombinedFieldQuery`,
which takes a principled approach to scoring based on the BM25F formula.

This query provides an alternative to the `cross_fields` `multi_match` mode. It
has simpler behavior and a more robust approach to scoring.

Addresses elastic#41106.
mark-vieira pushed a commit that referenced this pull request Apr 15, 2021
This PR introduces a new query called `combined_fields` for searching multiple
text fields. It takes a term-centric view, first analyzing the query string
into individual terms, then searching for each term any of the fields as though
they were one combined field. It is based on Lucene's `CombinedFieldQuery`,
which takes a principled approach to scoring based on the BM25F formula.

This query provides an alternative to the `cross_fields` `multi_match` mode. It
has simpler behavior and a more robust approach to scoring.

Addresses #41106.
@jrodewig jrodewig mentioned this pull request Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants