-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading tokenizer.model
with Rust API
#1518
Comments
You cannot load a tokenizer.model, you need to write a converter. https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L544 is the simplest way to understand the process! |
Ok, I understand. Do you know of a way or a library to do this in Rust without reaching for the Python transformers converter? |
A library no, but we should be able to come up with a small rust code to do this 😉 |
@ArthurZucker are there any specifications or example loaders which I can look at to implement this? |
I also have the same question, for llava reasons😉 |
Yes! Actually the best way to do this is to use the converters from In rust we would need to read and parse the |
Ok. Could I use this crate? One other question: I am implementing GGUF to HF This is what I currently do: https://github.com/EricLBuehler/mistral.rs/blob/d66e5aff1e7faf208469c5bef3c70d45ffda5401/mistralrs-core/src/pipeline/gguf_tokenizer.rs#L116-L142, I would appreciate it if you could take a quick look and see if there is anything obviously wrong! |
oh I also have an interest in reading sentence piece tokenizers as well, in order to invoke the SigLIP text transformer in Rust! EDIT: using the library mentioned by Eric above, I was able to load up https://huggingface.co/google/siglip-so400m-patch14-384/blob/main/spiece.model and it seemingly tokenized my input! |
@EricLBuehler we actually shipped this in I'll think about potentially automatically convert sentencepiece.model to rust, but the big problem is that I don't want to have to support sentencepiece + tiktoken, so might just be example gists / snippets of how to do this! |
Thank you, @ArthurZucker for the link! I was actually able to get the GPT2 conversion to work now! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hello all,
Thank you for your excellent work here. I am trying to load a
tokenizer.model
file in my Rust application. However, it seems that theTokenizer::from_file
function only support loading from atokenizer.json
file. This causes problems as using a small script to save thetokenizer.json
is error-prone and hard to discover for users. Is there a way to load atokenizer.model
file?The text was updated successfully, but these errors were encountered: