Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer Serialization] Fix the broken serialisation #27099

Merged
merged 8 commits into from
Dec 13, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Oct 27, 2023

What does this PR do?

Should fix some serialization issues, mostly save_pretrained with all the init kwargs, and from_pretrained with dicts.
fixes #26732

With main:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Llemma_7b", use_fast=False)
File ~/Work/transformers/src/transformers/tokenization_utils_base.py:2253, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2251     if added_tokens_map != {} and init_kwargs[key] is not None:
   2252         if key != "additional_special_tokens":
-> 2253             init_kwargs[key] = added_tokens_map.get(init_kwargs[key], init_kwargs[key])
   2255 init_kwargs["added_tokens_decoder"] = added_tokens_decoder
   2256 # convert {'__type': 'AddedToken', 'content': '<ent>', 'lstrip': False, 'normalized': True, ...} to AddedTokens

TypeError: unhashable type: 'dict'

This is because the tokenizer had special tokens saved as dicts, and the call to convert_added_tokens. is made after this.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ArthurZucker ArthurZucker marked this pull request as ready for review November 23, 2023 13:37
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ArthurZucker

@ArthurZucker
Copy link
Collaborator Author

Pegasus is the only slow failure I witnessed so checking this now before merging!

@ArthurZucker
Copy link
Collaborator Author

Ok, the issue is that when we force the added tokens encoder in the slow tokenizer, the fast of course can't do this. So the eos token gets replaced at index 0 in slow but not in fast.
Will update to force the default vocab to the default tokens.

@ArthurZucker ArthurZucker merged commit 230ac35 into huggingface:main Dec 13, 2023
3 checks passed
ArthurZucker added a commit that referenced this pull request Dec 14, 2023
* nits

* nits

* actual fix

* style

* ze fix

* fix fix fix style
iantbutler01 pushed a commit to BismuthCloud/transformers that referenced this pull request Dec 16, 2023
…#27099)

* nits

* nits

* actual fix

* style

* ze fix

* fix fix fix style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error while saving checkpoint during training
3 participants