Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of tokens (151646) does not match number of vectors (151643) #108

Closed
su-park opened this issue Oct 23, 2024 · 5 comments · Fixed by #109
Closed

Number of tokens (151646) does not match number of vectors (151643) #108

su-park opened this issue Oct 23, 2024 · 5 comments · Fixed by #109
Assignees
Labels
bug Something isn't working

Comments

@su-park
Copy link

su-park commented Oct 23, 2024

Hello.

I tried testing model2vec in the following environment and encountered the error below. Is there a way to resolve this?

Environment

model2vec==0.3.0
tokenizers==0.19.1

Code

from model2vec.distill.distillation import distill

# Choose a Sentence Transformer model
model_id = "Alibaba-NLP/gte-Qwen2-7B-instruct"

# Distill an output model with the chosen dimensions
model = distill(model_name=model_id, device="cpu", pca_dims=256)

Error

Value Error: Number of tokens (151646) does not match number of vectors (151643)
@stephantul stephantul self-assigned this Oct 23, 2024
@stephantul stephantul added the bug Something isn't working label Oct 23, 2024
@stephantul
Copy link
Collaborator

Hey @su-park ,

Thanks for reporting! This shouldn't be happening, I'll take a look as soon as possible.

Stéphan

@stephantul
Copy link
Collaborator

stephantul commented Oct 23, 2024

Hey, I figured it out. Somehow, the number of tokens of the backend tokenizer and the number of tokens in the HF tokenizer doesn't match.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-7B-instruct")
tokenizer.vocab_size
# 151643
tokenizer.backend_tokenizer.get_vocab_size()
# 151646
len(tokenizer.get_vocab())
# 151646

I checked, and there are definitely 151646 tokens in the vocabulary, not 15643. So I think the vocab_size being off by three is actually a bug in Transformers, not a bug in model2vec. In the mean-time, we could rely on the length of the vocabulary instead. I'll open a PR to change this.

@stephantul
Copy link
Collaborator

PR is here #109 .

I'll see if we can push this together with #107 and release a bugfix release soon.

@stephantul
Copy link
Collaborator

I accidentally closed this because I merged the PR. Sorry about that. If you want, you can pull main and retry. I think it should work.

@su-park
Copy link
Author

su-park commented Oct 24, 2024

Thank you for your swift support.

It worked!

@su-park su-park closed this as completed Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants