Number of tokens (151646) does not match number of vectors (151643) #108

su-park · 2024-10-23T08:25:59Z

Hello.

I tried testing model2vec in the following environment and encountered the error below. Is there a way to resolve this?

Environment

model2vec==0.3.0
tokenizers==0.19.1

Code

from model2vec.distill.distillation import distill

# Choose a Sentence Transformer model
model_id = "Alibaba-NLP/gte-Qwen2-7B-instruct"

# Distill an output model with the chosen dimensions
model = distill(model_name=model_id, device="cpu", pca_dims=256)

Error

Value Error: Number of tokens (151646) does not match number of vectors (151643)

The text was updated successfully, but these errors were encountered:

stephantul · 2024-10-23T08:30:33Z

Hey @su-park ,

Thanks for reporting! This shouldn't be happening, I'll take a look as soon as possible.

Stéphan

stephantul · 2024-10-23T08:39:09Z

Hey, I figured it out. Somehow, the number of tokens of the backend tokenizer and the number of tokens in the HF tokenizer doesn't match.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-7B-instruct")
tokenizer.vocab_size
# 151643
tokenizer.backend_tokenizer.get_vocab_size()
# 151646
len(tokenizer.get_vocab())
# 151646

I checked, and there are definitely 151646 tokens in the vocabulary, not 15643. So I think the vocab_size being off by three is actually a bug in Transformers, not a bug in model2vec. In the mean-time, we could rely on the length of the vocabulary instead. I'll open a PR to change this.

stephantul · 2024-10-23T08:45:24Z

PR is here #109 .

I'll see if we can push this together with #107 and release a bugfix release soon.

stephantul · 2024-10-23T11:32:37Z

I accidentally closed this because I merged the PR. Sorry about that. If you want, you can pull main and retry. I think it should work.

su-park · 2024-10-24T00:22:14Z

Thank you for your swift support.

It worked!

stephantul self-assigned this Oct 23, 2024

stephantul added the bug label Oct 23, 2024

stephantul mentioned this issue Oct 23, 2024

fix: don't rely on reported vocab size, log warning if inconsistent #109

Merged

stephantul closed this as completed in #109 Oct 23, 2024

stephantul reopened this Oct 23, 2024

su-park closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of tokens (151646) does not match number of vectors (151643) #108

Number of tokens (151646) does not match number of vectors (151643) #108

su-park commented Oct 23, 2024

stephantul commented Oct 23, 2024

stephantul commented Oct 23, 2024 •

edited

Loading

stephantul commented Oct 23, 2024

stephantul commented Oct 23, 2024

su-park commented Oct 24, 2024

Number of tokens (151646) does not match number of vectors (151643) #108

Number of tokens (151646) does not match number of vectors (151643) #108

Comments

su-park commented Oct 23, 2024

Environment

Code

Error

stephantul commented Oct 23, 2024

stephantul commented Oct 23, 2024 • edited Loading

stephantul commented Oct 23, 2024

stephantul commented Oct 23, 2024

su-park commented Oct 24, 2024

stephantul commented Oct 23, 2024 •

edited

Loading