Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert fast tokenizers to slow #21289

Closed
ahmedlone127 opened this issue Jan 24, 2023 · 10 comments
Closed

convert fast tokenizers to slow #21289

ahmedlone127 opened this issue Jan 24, 2023 · 10 comments

Comments

@ahmedlone127
Copy link

Feature request

Recently noticed that the models being uploaded now are only their fast versions and the sentencepeice model (that's included in the slow version) is missing. I need the sentence peice model of some tokenizers for a personal project and wanted to know what's the best way to go about that. After I looked through the current code on the repository I saw that there were a lot of methods for handling Conversion from Slow to Fast tokenization so I think it should be possible the other way around too. After a bit of research the only quick and dirty way I could think of was creating a utility script for converting the json files of the fast tokenizer to a the spe model format for a slow tokenizer because I think the information in both is the same so the mechanics should be similar too.

Motivation

I looked through the tokenizers and saw that most of them getting uploaded don't have slow tokenizers.

Your contribution

If there is any way I can help I would love to know , just need some guaidence on how to implement this!

@sgugger
Copy link
Collaborator

sgugger commented Jan 24, 2023

I don't think it's possible to get the sentencepiece model from the tokenizer.json file but maybe @Narsil knows a way.

@ahmedlone127
Copy link
Author

hey @Narsil can you please give some insight on this?

@Narsil
Copy link
Contributor

Narsil commented Jan 30, 2023

You could try and create inverse scripts for the conversion you found. But it's not going to be trivial.

You need to create the protobuf sentencepiece expects.

Not sure I can provide much more guidance.

Why do you want slow tokenizers if I may ask?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ahmedlone127
Copy link
Author

hey @Narsil Thanks for the reply but I found a fix for my issue :)

@Narsil
Copy link
Contributor

Narsil commented Feb 24, 2023

Awesome. Do you mind explaining a little more or giving links for potential readers that would want to do the same?

@ahmedlone127
Copy link
Author

For Sure!

I noticed that you guys have code for converting a spm model ( A slow tokenizer ) to a tokenizer.json (fast tokenizer). I also noticed for some models you guys did not upload the SPM model even though it was an SPM based tokenizer. To get the SPM model from the tokenizer.json that was uploaded I had to figure out how to manually create an SPM model that had identical information as what's stored in the tokenizer.json

For example I had to copy the vocabulary , precompiled_charsmap , and other special tokens and manully edit a blank SPM file ( it already had the correct architecture and some dummy data that I removed while editing). Once all the information was copied over to the SPM file it was working as expected.

here is a notebook demonstrating the process

https://colab.research.google.com/drive/1kfC_iEuU0upVQ5Y3rnnl5VSngSPuiSQI?usp=sharing

@StephennFernandes
Copy link

@ahmedlone127 @Narsil
Hey guys, so ive been training my tokenizers using spm. But however i am stuck as i am unable to figure out how to convert my sentencpiece.model to huggingface tokenizer (perferably fast tokenizer).

could you guys please link me all the resources on how could i do this ?

@Narsil
Copy link
Contributor

Narsil commented Nov 27, 2023

Everything you need is here: https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

There is no simple tutorial, there are many configurations in tokenizers that could achieve what you want, with various tradeoffs.
What I recommend is running a diverse set of utf-8 + running all special tokens combinations that might be useful in your test suite to verify IDs do match.

@Derekglk
Copy link

Derekglk commented Mar 8, 2024

Hello @ahmedlone127 , I have the exact same needs to get the original SentencePiece tokenizer.model from tokenizer.json.
Would you mind reshare your notebook please? The file no longer exits under this link.
Much appreciate it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants