Tokenizer API allocations #2052

PSeitz · 2023-05-19T09:51:58Z

Currently the tokenizer api generates a lot of allocations.

For every Text encountered text_analyzer::token_stream() is called

impl TextAnalyzer {
    /// Creates a token stream for a given `str`.
    pub fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
        self.tokenizer.box_token_stream(text)
    }
}

A boxed token stream typically creates a Token:

impl Default for Token {
    fn default() -> Token {
        Token {
            offset_from: 0,
            offset_to: 0,
            position: usize::MAX,
            text: String::with_capacity(200),
            position_length: 1,
        }
    }
}

The text was updated successfully, but these errors were encountered:

PSeitz · 2023-06-09T04:40:51Z

This PR #2062 fixes this mostly.
Only allocation is now the BoxTokenStream per text, which could be avoided with some lifetime hacks (and unsafe probably).

fulmicoton · 2023-07-10T00:59:03Z

Can we close this?

PSeitz · 2023-07-10T01:31:34Z

It would be nice to remove the BoxTokenStream allocation per text and use the Tokenizer directly. e.g. set_text on the Tokenizer and then get the tokens from Tokenizer directly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer API allocations #2052

Tokenizer API allocations #2052

PSeitz commented May 19, 2023

PSeitz commented Jun 9, 2023

fulmicoton commented Jul 10, 2023

PSeitz commented Jul 10, 2023

Tokenizer API allocations #2052

Tokenizer API allocations #2052

Comments

PSeitz commented May 19, 2023

PSeitz commented Jun 9, 2023

fulmicoton commented Jul 10, 2023

PSeitz commented Jul 10, 2023