Memory leak in HF tokenizer when using truncation and optWithOverflowingTokens(false) #3316

lesters · 2024-07-10T11:14:16Z

We have observed a native memory leak when using ai.djl.huggingface.tokenizers.HuggingFaceTokenizer. Using the default (false) option of optWithOverflowingTokens we see a significant increase in memory usage over time when we have long strings that are truncated to shorter token sequences. When we set optWithOverFlowingTokens to true, we do not see this memory increase. This is particularly evident when very long strings are truncated to short token sequences.

Testing back, this behaviour started in version 0.27.0, and tracing that to the release notes, it seems that this PR might be the culprit: #2957.

Particularly: https://github.com/deepjavalibrary/djl/pull/2957/files#diff-62d10f278a5a7644ce30deff638cf6ead21457bca60b9cc7430d115dd2fa2b38R533-R537

It seems that by calling TokenizersLibrary.LIB.getOverflowing(encoding), a clone will be created that is only cleaned up when withOverflowTokens is true, as toEncoding is then called recursively on the overflowing handles which eventually calls TokenizersLibrary.LIB.deleteEncoding(encoding); on this copy.

So when withOverflowTokens is false, this cleanup does not occur.

The text was updated successfully, but these errors were encountered:

lesters · 2024-07-10T11:52:04Z

Code to reproduce problem:

var input = "this will become a long string".repeat(256);

var tokenizer = ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.builder()
        .optTokenizerPath(Path.of("src/test/models/huggingface/bert-base-uncased.json"))
        .optMaxLength(5)
        .optTruncation(true)
        .optWithOverflowingTokens(false)
        .build();

while (true) {
    tokenizer.encode(input);
}

Memory usage will increase very rapidly here.

lesters added the bug Something isn't working label Jul 10, 2024

baldersheim mentioned this issue Jul 10, 2024

[tokenizers] Fixes memory leak when there is overflowing tokens #3317

Merged

frankfliu closed this as completed in #3317 Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in HF tokenizer when using truncation and optWithOverflowingTokens(false) #3316

Memory leak in HF tokenizer when using truncation and optWithOverflowingTokens(false) #3316

lesters commented Jul 10, 2024 •

edited

Loading

lesters commented Jul 10, 2024 •

edited

Loading

Memory leak in HF tokenizer when using truncation and optWithOverflowingTokens(false) #3316

Memory leak in HF tokenizer when using truncation and optWithOverflowingTokens(false) #3316

Comments

lesters commented Jul 10, 2024 • edited Loading

lesters commented Jul 10, 2024 • edited Loading

lesters commented Jul 10, 2024 •

edited

Loading

lesters commented Jul 10, 2024 •

edited

Loading