-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add match_only_text
, a space-efficient variant of text
.
#66172
Add match_only_text
, a space-efficient variant of text
.
#66172
Conversation
This adds a new `match_only_text` field, which indexes the same data as a `text` field that has `index_options: docs` and `norms: false` and uses the `_source` for positional queries like `match_phrase`. Unlike `text`, this field doesn't support scoring. An alternative to this new field could have been to make the `text` field still able to run positional queries when positions are not indexed, but I like this new field better because it avoids questions around how scoring should perform.
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very cool! I have some suggestions around making the confirmation step a bit more efficient, although I guess there will be tradeoffs between the cost of building the index from source and how much work has to be done during analysis.
} | ||
|
||
@Override | ||
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to copy this bit, because there will be no BWC issue with a new mapper
...text/src/main/java/org/elasticsearch/xpack/matchonlytext/query/SourceConfirmedTextQuery.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
public void testSpanNear() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work, given the TODO in SourceConfirmedTextQuery implementation above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the query just uses a MatchAllDocsQuery as an approximation, but the query "works".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we'll create better approximations for spans in a follow-up.
server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java
Show resolved
Hide resolved
@@ -670,7 +670,7 @@ private void merge(FieldMapper toMerge, Conflicts conflicts) { | |||
} | |||
} | |||
|
|||
protected void toXContent(XContentBuilder builder, boolean includeDefaults) throws IOException { | |||
public void toXContent(XContentBuilder builder, boolean includeDefaults) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid changing this, the concerns about BWC only apply to existing mappings and won't be a problem for entirely new field mappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jpountz
...ext/src/main/java/org/elasticsearch/xpack/matchonlytext/mapper/MatchOnlyTextFieldMapper.java
Outdated
Show resolved
Hide resolved
@@ -69,6 +69,7 @@ values. | |||
==== Text search types | |||
|
|||
<<text,`text`>>:: Analyzed, unstructured text. | |||
<<match-only-text,`match_only_text`>>:: A more space-efficient variant of `text`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to make a single page for the keyword
type family, so that users could easily compare the trade-offs between each type. It'd be nice to do the same for the new 'text' family'.
Thanks for the helpful reviews @romseygeek and @jtibshirani. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple last comments. One thing I wanted to double-check -- I guess some users could want scoring for most queries (as it can help surface relevant content) but confirm positional queries using _source? Maybe we're keeping things simple at first and assume this use case very often sorts on timestamp.
return SourceValueFetcher.toString(name(), context, format); | ||
} | ||
|
||
private Query toQuery(Query query, QueryShardContext queryShardContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user disables _source on the field, we still accept positional queries but will just return no results. Maybe we should throw an error instead, as we do when positions are disabled. We could at least just check QueryShardContext#isSourceEnabled
to catch cases where source is turned off entirely.
} else if (query instanceof MultiPhraseQuery) { | ||
return approximate((MultiPhraseQuery) query); | ||
} else { | ||
// TODO: spans and intervals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we file an issue to track this? We just try not to have field types/ queries that scan all documents (unless they're runtime fields of course :))
I have long hesitated about this sort of things. I have considered the following options besides this PR:
This field still provides some scoring in the sense that the score of a document will be equal with the number of matching terms from the query with this field. So if your query is a match query for |
I had initially thought I could make this field support span and interval queries, but this is more challenging than I thought because these query builders to not consult the field mappers to be constructed and just assume the field has been indexed with positions. We could fix this by adding new query builders in |
I finally took some time to come back to this PR. I moved the code from a new plugin in x-pack to |
|
||
[horizontal] | ||
|
||
<<analyzer,`analyzer`>>:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought I had since reviewing: if this is just targeted at log lines, would it make sense to cut down on analysis config options? For example, not allowing for a different search analyzer or search quote analyzers. Or even removing the option configuring the analyzer, just using a default that targets log lines. This could make it simpler to maintain long BWC for the field type. (This is a rough idea, and I am not sure it makes sense... maybe many users in fact tweak analyzers for log lines.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think it's a good call, e.g. as far as I know, ECS doesn't configure analyzers. It would be much easier to add it in the future if it proves needed than to remove it when we want to ease backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IntervalsSource
parts look good to me.
return doc; | ||
} | ||
|
||
private boolean setIterator(int doc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can throw IOException?
public int hashCode() { | ||
// Not using matchesProvider and valueFetcherProvider, which don't identify this source but are only used to avoid scanning linearly | ||
// through all documents | ||
return Objects.hash(in, indexAnalyzer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index analyzer should be immutable for an open index, I think? So I'm not sure that it needs to be included here or in equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'm including it because I want to avoid making too many assumptions about how this class is used by Elasticsearch. Is it fine with you?
This field now fully supports intervals thanks to #71429. |
…66172) This adds a new `match_only_text` field, which indexes the same data as a `text` field that has `index_options: docs` and `norms: false` and uses the `_source` for positional queries like `match_phrase`. Unlike `text`, this field doesn't support scoring.
Adds release highlights for match_only_text (elastic#66172) and more memory-efficient composite aggregations (elastic#74559).
This adds a new
match_only_text
field, which indexes the same data as atext
field that has
index_options: docs
andnorms: false
and uses the_source
for positional queries like
match_phrase
.Unlike
text
, this field doesn't support scoring and span queries.This new field is part of the
text
family, so it is returned as atext
field in the_field_caps
output.Closes #64467