-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange warnings with tokenizer for some models #1528
Comments
Fixed by this gist: https://gist.github.com/jneuff/682d47b786329f19291d166957b3274a Seems to be an issue with the tokenizer.json file. |
Which files on the hub are you using? And which tokenizers version? |
@ArthurZucker, I am using tokenizers version 0.19.1 and this tokenizer file: tokenizers = "0.19.1" Edit: pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
Tokenizer::from_file(p).map_err(anyhow::Error::msg)
} But this fixes it: pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
let fixed_path = format!("{}_mistralrs_fixed", p.as_ref().display());
let fixed_path = Path::new(&fixed_path);
if !fixed_path.exists() {
let raw = std::fs::read(p.clone()).map_err(anyhow::Error::msg)?;
let mut tokenizer: Value = serde_json::from_slice(&raw).unwrap();
let added_tokens: Vec<AddedToken> =
serde_json::from_value(tokenizer["added_tokens"].clone()).unwrap();
let vocab: HashMap<String, usize> =
serde_json::from_value(tokenizer["model"]["vocab"].clone()).unwrap();
for token in added_tokens {
if !vocab.contains_key(&token.content) {
tokenizer["model"]["vocab"]
.as_object_mut()
.unwrap()
.insert(token.content, token.id.into())
.ok_or(())
.unwrap_err();
}
}
let raw_fixed = serde_json::to_vec_pretty(&tokenizer).unwrap();
std::fs::write(fixed_path, raw_fixed).unwrap();
}
Tokenizer::from_file(fixed_path).map_err(anyhow::Error::msg)
} |
Yeah, but adding tokens to the |
I could not reproduce this with |
Hello all,
Thank you for your excellent work here! We are using
Tokenizer::from_file
to load thetokenizer.json
file from HF hub. However, it produces many warnings when loading the Phi3 tokenizer:I have also noticed this for Phi2 and Llama3, although I see no tokenization errors in the encoded or decoded.
Is there a way to disable this warning, or am I misconfiguring something? Thank you!
The text was updated successfully, but these errors were encountered: