Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange warnings with tokenizer for some models #1528

Closed
EricLBuehler opened this issue May 9, 2024 · 5 comments · Fixed by EricLBuehler/mistral.rs#314
Closed

Strange warnings with tokenizer for some models #1528

EricLBuehler opened this issue May 9, 2024 · 5 comments · Fixed by EricLBuehler/mistral.rs#314

Comments

@EricLBuehler
Copy link
Member

Hello all,

Thank you for your excellent work here! We are using Tokenizer::from_file to load the tokenizer.json file from HF hub. However, it produces many warnings when loading the Phi3 tokenizer:

2024-05-09T12:11:56.647710Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|endoftext|>' was expected to have ID '32000' but was given ID 'None'    
2024-05-09T12:11:56.647734Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|assistant|>' was expected to have ID '32001' but was given ID 'None'    
2024-05-09T12:11:56.647737Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder1|>' was expected to have ID '32002' but was given ID 'None'    
2024-05-09T12:11:56.647739Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder2|>' was expected to have ID '32003' but was given ID 'None'    
2024-05-09T12:11:56.647742Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder3|>' was expected to have ID '32004' but was given ID 'None'    
2024-05-09T12:11:56.647744Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder4|>' was expected to have ID '32005' but was given ID 'None'    
2024-05-09T12:11:56.647746Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|system|>' was expected to have ID '32006' but was given ID 'None'    
2024-05-09T12:11:56.647748Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|end|>' was expected to have ID '32007' but was given ID 'None'    
2024-05-09T12:11:56.647750Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder5|>' was expected to have ID '32008' but was given ID 'None'    
2024-05-09T12:11:56.647752Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder6|>' was expected to have ID '32009' but was given ID 'None'    
2024-05-09T12:11:56.647760Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|user|>' was expected to have ID '32010' but was given ID 'None'    

I have also noticed this for Phi2 and Llama3, although I see no tokenization errors in the encoded or decoded.

Is there a way to disable this warning, or am I misconfiguring something? Thank you!

@EricLBuehler
Copy link
Member Author

Fixed by this gist: https://gist.github.com/jneuff/682d47b786329f19291d166957b3274a

Seems to be an issue with the tokenizer.json file.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented May 17, 2024

Which files on the hub are you using? And which tokenizers version?
It's a bit weird and should not be happening.

@EricLBuehler
Copy link
Member Author

EricLBuehler commented May 19, 2024

@ArthurZucker, I am using tokenizers version 0.19.1 and this tokenizer file:

tokenizers = "0.19.1"

Edit:
Loading with this function demonstrates the issue:

pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
    Tokenizer::from_file(p).map_err(anyhow::Error::msg)
}

But this fixes it:

pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
    let fixed_path = format!("{}_mistralrs_fixed", p.as_ref().display());
    let fixed_path = Path::new(&fixed_path);

    if !fixed_path.exists() {
        let raw = std::fs::read(p.clone()).map_err(anyhow::Error::msg)?;
        let mut tokenizer: Value = serde_json::from_slice(&raw).unwrap();
        let added_tokens: Vec<AddedToken> =
            serde_json::from_value(tokenizer["added_tokens"].clone()).unwrap();
        let vocab: HashMap<String, usize> =
            serde_json::from_value(tokenizer["model"]["vocab"].clone()).unwrap();
        for token in added_tokens {
            if !vocab.contains_key(&token.content) {
                tokenizer["model"]["vocab"]
                    .as_object_mut()
                    .unwrap()
                    .insert(token.content, token.id.into())
                    .ok_or(())
                    .unwrap_err();
            }
        }
        let raw_fixed = serde_json::to_vec_pretty(&tokenizer).unwrap();
        std::fs::write(fixed_path, raw_fixed).unwrap();
    }

    Tokenizer::from_file(fixed_path).map_err(anyhow::Error::msg)
}

@ArthurZucker
Copy link
Collaborator

Yeah, but adding tokens to the Tokenizer is specifically designed to separate tokens that come from training and tokens that were added in the added_tokens_map .

@ArthurZucker
Copy link
Collaborator

I could not reproduce this with wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/raw/main/tokenizer.json / or with the file you shared with latest tokenizers version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants