Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer.json compability with jni rust tokenizers - data did not match any variant of untagged enum #3141

Closed
jobergum opened this issue Apr 30, 2024 · 2 comments · Fixed by #3143
Labels
bug Something isn't working

Comments

@jobergum
Copy link

Description

The python/rust upstream transformer tokenizer save_pretrained function adds a new key on the model level in the tokenizer.json configuration. model.byte_fallback which causes an exception when calling the native createTokenizerFromString. Maybe related to using a older rust version of the transformer tokenizers?

Expected Behavior

Able to load the tokenizer from a tokenizer.json file.

Error Message

Caused by: java.lang.RuntimeException: 
data did not match any variant of untagged enum PreTokenizerWrapper at line 73 column 3
	at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizerFromString(Native Method)

How to Reproduce?

  1. Install a recent version of the transformers library
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-small')
tokenizer.save_pretrained("saved")

Attempt to load the saved tokenizer.json file with 0.27.0 using HuggingFaceTokenizer.newInstance

@frankfliu
Copy link
Contributor

@jobergum

I confirmed your issue. Will try to upgrade to 0.19.1.
For the mean time, you can use:

HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance("intfloat/multilingual-e5-small");

@jobergum
Copy link
Author

Thank you for the swift reply! Yes, using pre-existing tokenizer files works great, but if people do any type of changes and saves the tokenizer file, it breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants