-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
token_count type : add enable_position_increments option (fix #23227) #24175
Conversation
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks @fbaligand
I left one comment regarding the meaning of the option. The title of the PR is misleading I think, enable_position_increments
should not take 0-increment token into account so it's still the number of positions that we are after. We simply don't want to count positions where all tokens have been filtered.
Can you also add a description of the option in the docs:
https://github.com/elastic/elasticsearch/blob/master/docs/reference/mapping/types/token-count.asciidoc
* @param fieldValue field value to pass to analyzer | ||
* @return number of tokens in a token stream | ||
* @throws IOException if tokenStream throws it | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enablePositionIncrement
should not take 0-increment into account.
I think you can just add a param to countPositions
and instead of count += position.getPositionIncrement();
you can do: count += enablePosIncr ? Math.min(1, pos.getPosIncr()) : pos.getPosIncr();
?
return new TokenStreamComponents(new MockTokenizer(), tokenStream); | ||
} | ||
}; | ||
assertThat(TokenCountFieldMapper.countTokens(analyzer, "", ""), equalTo(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't count token with 0-increment this should be equal to 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jimczi
First, thanks for this first review.
Then, concerning your comment : the feature global goal is to have an option to count the number of tokens after analysis, whatever their positions. In that unit test, there are 3 tokens in the stream, so that's why I expect 3 as a result.
Given your comment, maybe the option "enable_position_increments" is misnamed for this feature goal ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming is fine. This feature as described in the issue is about not counting tokens filtered from the token stream. This is what enable_position_increments=false
does and I think it's all we need to do here. If your analyzer adds alternative tokens to each position they should not alter the final count since we're looking for the number of tokens in the original text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I get what you mean.
That makes sense, especially when you use synonym
filter.
=> I will fix all your requested changes and push them soon.
Hi @jimczi I fixed your requested changes. By the way, I fixed also some compilation warnings on TokenCount classes. |
jenkins please test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready I think. I left one comment regarding the docs that could clarify the usage.
This PR is actually tested by our internal CI so I'll merge it if it passes the test.
Thank you very much for your contribution @fbaligand !
counting tokens. This means that even if the analyzer filters out stop | ||
words they are included in the count. | ||
words they are included in the count. | ||
If you don't want to count stop words, set option `enable_position_increments` to `false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is redundant with the definition below and it's a little bit confusing since it impacts not only the stop filter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
I see 2 options :
- remove the note.
- only keep this part to just explain the mechanism :
Technically, the
token_count
type sums position increments rather than counting tokens.
Which do you prefer ?
Moreover, I can add this precision to enable_position_increments
definition :
Set to
true
if you want to count all tokens computed by analyzer tokenizer.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just remove the note. The definition of the param below is descriptive enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for enable_position_increments
definition, you mean to not add "Set to true if you want to count all tokens computed by analyzer tokenizer" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the current definition is ok.
Add option "enable_position_increments" with default value true. If option is set to false, indexed value is the number of tokens (not position increments count)
OK. I removed the note, rebased my code on master branch, and pushed it ! |
Great ! |
* master: (61 commits) Build: Move plugin cli and tests to distribution tool (elastic#24220) Peer Recovery: remove maxUnsafeAutoIdTimestamp hand off (elastic#24243) Adds version 5.3.2 and backwards compatibility indices for 5.3.1 Add utility method to parse named XContent objects with typed prefix (elastic#24240) MultiBucketsAggregation.Bucket should not extend Writeable (elastic#24216) Don't expose cleaned-up tasks as pending in PrioritizedEsThreadPoolExecutor (elastic#24237) Adds declareNamedObjects methods to ConstructingObjectParser (elastic#24219) ESIntegTestCase.indexRandom should not introduce types. (elastic#24202) Tests: Extend InternalStatsTests (elastic#24212) IndicesQueryCache should delegate the scorerSupplier method. (elastic#24209) Speed up parsing of large `terms` queries. (elastic#24210) [TEST] make sure that the random query_string query generator defines a default_field or a list of fields token_count type : add an option to count tokens (fix elastic#23227) (elastic#24175) Query string default field (elastic#24214) Make Aggregations an abstract class rather than an interface (elastic#24184) [TEST] ensure expected sequence no and version are set when index/delete engine operation has a document failure Extract batch executor out of cluster service (elastic#24102) Add 5.3.1 to bwc versions Added "release-state" support to plugin docs Added examples to cross cluster search of using cluster settings ...
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)
Fix #23227