-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert fast tokenizers to slow #21289
Comments
I don't think it's possible to get the sentencepiece model from the |
hey @Narsil can you please give some insight on this? |
You could try and create inverse scripts for the conversion you found. But it's not going to be trivial. You need to create the protobuf sentencepiece expects. Not sure I can provide much more guidance. Why do you want slow tokenizers if I may ask? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
hey @Narsil Thanks for the reply but I found a fix for my issue :) |
Awesome. Do you mind explaining a little more or giving links for potential readers that would want to do the same? |
For Sure! I noticed that you guys have code for converting a spm model ( A slow tokenizer ) to a tokenizer.json (fast tokenizer). I also noticed for some models you guys did not upload the SPM model even though it was an SPM based tokenizer. To get the SPM model from the tokenizer.json that was uploaded I had to figure out how to manually create an SPM model that had identical information as what's stored in the tokenizer.json For example I had to copy the vocabulary , precompiled_charsmap , and other special tokens and manully edit a blank SPM file ( it already had the correct architecture and some dummy data that I removed while editing). Once all the information was copied over to the SPM file it was working as expected. here is a notebook demonstrating the process https://colab.research.google.com/drive/1kfC_iEuU0upVQ5Y3rnnl5VSngSPuiSQI?usp=sharing |
@ahmedlone127 @Narsil could you guys please link me all the resources on how could i do this ? |
Everything you need is here: https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py There is no simple tutorial, there are many configurations in |
Hello @ahmedlone127 , I have the exact same needs to get the original SentencePiece tokenizer.model from tokenizer.json. |
Feature request
Recently noticed that the models being uploaded now are only their fast versions and the sentencepeice model (that's included in the slow version) is missing. I need the sentence peice model of some tokenizers for a personal project and wanted to know what's the best way to go about that. After I looked through the current code on the repository I saw that there were a lot of methods for handling Conversion from Slow to Fast tokenization so I think it should be possible the other way around too. After a bit of research the only quick and dirty way I could think of was creating a utility script for converting the json files of the fast tokenizer to a the spe model format for a slow tokenizer because I think the information in both is the same so the mechanics should be similar too.
Motivation
I looked through the tokenizers and saw that most of them getting uploaded don't have slow tokenizers.
Your contribution
If there is any way I can help I would love to know , just need some guaidence on how to implement this!
The text was updated successfully, but these errors were encountered: