Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer-api: reduce Tokenizer overhead #2062

Merged
merged 4 commits into from
Jun 8, 2023
Merged

tokenizer-api: reduce Tokenizer overhead #2062

merged 4 commits into from
Jun 8, 2023

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented May 30, 2023

Previously a new Token for each text encountered was created, which
contains String::with_capacity(200)
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.

#1654

@PSeitz PSeitz requested a review from fulmicoton May 30, 2023 13:32
@PSeitz PSeitz force-pushed the tokenizer_api branch 2 times, most recently from 8a59b38 to 9c95c3d Compare May 31, 2023 02:48
/// Creates a token stream for a given `str`.
fn token_stream<'a>(&self, text: &'a str) -> Self::TokenStream<'a>;
fn token_stream<'a, 'b>(&'b mut self, text: &'a str) -> Self::TokenStream<'a, 'b>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really helpful to have two lifetimes here instead of binding both to the presumably shorter one, e.g.

fn token_stream<'a>(&'a mut self, text: &'a str) -> Self::TokenStream<'a>;

and let the general case be handled via covariance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, one lifetime is enough

Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
@PSeitz PSeitz force-pushed the tokenizer_api branch 2 times, most recently from 24a18f0 to 69c9277 Compare June 2, 2023 04:55
@codecov-commenter
Copy link

Codecov Report

Merging #2062 (90b170b) into main (3942fc6) will decrease coverage by 0.01%.
The diff coverage is 90.24%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2062      +/-   ##
==========================================
- Coverage   94.38%   94.38%   -0.01%     
==========================================
  Files         319      319              
  Lines       60040    60091      +51     
==========================================
+ Hits        56670    56714      +44     
- Misses       3370     3377       +7     
Impacted Files Coverage Δ
src/query/more_like_this/more_like_this.rs 65.67% <33.33%> (-2.65%) ⬇️
src/tokenizer/ngram_tokenizer.rs 98.95% <85.71%> (+<0.01%) ⬆️
src/tokenizer/whitespace_tokenizer.rs 92.59% <85.71%> (+0.13%) ⬆️
src/tokenizer/facet_tokenizer.rs 94.80% <88.88%> (+1.13%) ⬆️
src/tokenizer/regex_tokenizer.rs 95.23% <90.90%> (+0.18%) ⬆️
src/core/json_utils.rs 99.58% <100.00%> (ø)
src/fastfield/mod.rs 100.00% <100.00%> (ø)
src/fastfield/writer.rs 94.85% <100.00%> (ø)
src/indexer/segment_writer.rs 97.79% <100.00%> (+<0.01%) ⬆️
src/postings/mod.rs 98.57% <100.00%> (ø)
... and 16 more

... and 1 file with indirect coverage changes

@PSeitz PSeitz merged commit fdecb79 into main Jun 8, 2023
@PSeitz PSeitz deleted the tokenizer_api branch June 8, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants