Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer API allocations #2052

Open
PSeitz opened this issue May 19, 2023 · 3 comments
Open

Tokenizer API allocations #2052

PSeitz opened this issue May 19, 2023 · 3 comments

Comments

@PSeitz
Copy link
Contributor

PSeitz commented May 19, 2023

Currently the tokenizer api generates a lot of allocations.

For every Text encountered text_analyzer::token_stream() is called

impl TextAnalyzer {
    /// Creates a token stream for a given `str`.
    pub fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
        self.tokenizer.box_token_stream(text)
    }
}

A boxed token stream typically creates a Token:

impl Default for Token {
    fn default() -> Token {
        Token {
            offset_from: 0,
            offset_to: 0,
            position: usize::MAX,
            text: String::with_capacity(200),
            position_length: 1,
        }
    }
}
@PSeitz
Copy link
Contributor Author

PSeitz commented Jun 9, 2023

This PR #2062 fixes this mostly.
Only allocation is now the BoxTokenStream per text, which could be avoided with some lifetime hacks (and unsafe probably).

@fulmicoton
Copy link
Collaborator

Can we close this?

@PSeitz
Copy link
Contributor Author

PSeitz commented Jul 10, 2023

It would be nice to remove the BoxTokenStream allocation per text and use the Tokenizer directly. e.g. set_text on the Tokenizer and then get the tokens from Tokenizer directly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants