-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of tokens (151646) does not match number of vectors (151643) #108
Comments
Hey @su-park , Thanks for reporting! This shouldn't be happening, I'll take a look as soon as possible. Stéphan |
Hey, I figured it out. Somehow, the number of tokens of the backend tokenizer and the number of tokens in the HF tokenizer doesn't match. from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-7B-instruct")
tokenizer.vocab_size
# 151643
tokenizer.backend_tokenizer.get_vocab_size()
# 151646
len(tokenizer.get_vocab())
# 151646 I checked, and there are definitely 151646 tokens in the vocabulary, not 15643. So I think the |
I accidentally closed this because I merged the PR. Sorry about that. If you want, you can pull main and retry. I think it should work. |
Thank you for your swift support. It worked! |
Hello.
I tried testing
model2vec
in the following environment and encountered the error below. Is there a way to resolve this?Environment
Code
Error
The text was updated successfully, but these errors were encountered: