-
-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer-api: reduce Tokenizer overhead #2062
Conversation
8a59b38
to
9c95c3d
Compare
tokenizer-api/src/lib.rs
Outdated
/// Creates a token stream for a given `str`. | ||
fn token_stream<'a>(&self, text: &'a str) -> Self::TokenStream<'a>; | ||
fn token_stream<'a, 'b>(&'b mut self, text: &'a str) -> Self::TokenStream<'a, 'b>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really helpful to have two lifetimes here instead of binding both to the presumably shorter one, e.g.
fn token_stream<'a>(&'a mut self, text: &'a str) -> Self::TokenStream<'a>;
and let the general case be handled via covariance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, one lifetime is enough
Previously a new `Token` for each text encountered was created, which contains `String::with_capacity(200)` In the new API the token_stream gets mutable access to the tokenizer, this allows state to be shared (in this PR Token is shared). Ideally the allocation for the BoxTokenStream would also be removed, but this may require some lifetime tricks.
24a18f0
to
69c9277
Compare
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #2062 +/- ##
==========================================
- Coverage 94.38% 94.38% -0.01%
==========================================
Files 319 319
Lines 60040 60091 +51
==========================================
+ Hits 56670 56714 +44
- Misses 3370 3377 +7
|
Previously a new
Token
for each text encountered was created, whichcontains
String::with_capacity(200)
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
#1654