-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable conversational
pipeline for GPTSw3Tokenizer
#24648
Enable conversational
pipeline for GPTSw3Tokenizer
#24648
Conversation
This allows the GPT-SW3 models (and other GPT-2 based models) to be 4-bit quantised using `load_in_4bit` with `bitsandbytes`.
…set-up-gpt-sw3-for-conversational-pipeline
The documentation is not available anymore as the PR was closed or merged. |
Hi @saattrupdan, thanks for this contribution and opening this PR. As it stands, this isn't a change that we'd accept to be merged in. A few notes on why:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! +1 on all of @amyeroberts' previous comment.
A small nit is that we do indeed have a few models with the _build_conversation_input_ids
method, so left comments to properly integrate this.
If you want to add a test it should be a pipeline test, in that case you should make sure the model is in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES
or MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
. I will automatically run a test for them
Fix Conversation type hints Co-authored-by: Arthur <[email protected]>
Thanks for your review @amyeroberts!
I've fixed this now, via @ArthurZucker's suggestion.
I'm a bit confused by this, as this method already exists for 9-10 tokenizers in the package (such as GPT2, Bloom, GPT-neox and more), and is also required by the conversational pipeline here.
That's fair enough if that's a design goal, I've removed it now. I just liked the idea of being able to instantiate a pipeline without having to load in the model first 🙂
Ah right, I just thought it was a mistake that an 88 character line limit wasn't enforced - I've reverted the changes back now I think! |
conversational
pipeline for GPTSw3Tokenizer
, and add load_in_4bit
argument to pipelinesconversational
pipeline for GPTSw3Tokenizer
@saattrupdan @ArthurZucker OK, my bad, I hadn't noticed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, but you should revert the formating changes, as they are not required and don't fit our usual code format!
@ArthurZucker All formatting changes have been reversed now too 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean ! Thanks a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - thanks for adding and iterating!
Just a small nit on the TYPE_CHECKING line.
Really nice! A quick comment from one of the developers of GPT-SW3, and the one responsible for the tokenization pipline. Since there's a mismatch between the huggingface tokenizer and the sentencepiece tokenizer used during training, and how they treat special tokens, I'm a bit wary of this PR as it stands right now. To better match the training-procedure, each turn should be tokenized in isolation by the underlying sp_model, and joined with -tokens. This might result in the same thing, but I'm not 100% sure 😅 |
Regarding the special token issue, do you have small reproducer? I can have a look if needed! Currently working on our sentencepiece compatibility issues |
I just did some experiments to check this. The underlying sentencepiece model cannot deal with the special tokens, since these are dealt with by the >>> tokenizer.tokens_trie.data
{'<': {'p': {'a': {'d': {'>': {'': 1}}}}, 's': {'>': {'': 1}}, 'u': {'n': {'k': {'>': {'': 1}}}}, '|': {'e': {'n': {'d': {'o': {'f': {'t': {'e': {'x': {'t': {'|': {'>': {'': 1}}}}}}}}}}}}}} We see that it correctly deals with Note that, in the def _tokenize(self, text: str, **kwargs) -> List[str]:
text = self.preprocess_text(text)
return self.sp_model.encode(text, out_type=str) If I replace the >>> tokenizer.sp_model.encode('<s>Hej med dig<|endoftext|>', out_type=str)
['▁<', 's', '>', 'Hej', '▁med', '▁dig', '<', '|', 'end', 'of', 'text', '|', '>'] If I'm completely missing the point here, @Apsod, then please let me know 🙂 |
This is an edge-case where the semantic discrepancy between sentencepiece and huggingface tokenization leads to different results. If we encounter I think there's also differences in how sentencepice treats the initial token after a special token (due to whitespace-prefix-stuff), which leads to a general mismatch between the tokenizers:
EDIT: A simpler example of weird interactions between whitespace and special tokens:
Results in:
|
@Apsod Thanks for the clarification. Just tried inspecting the result of using the prompt = "<|endoftext|><s>\nUser:\nJag tycker träd är fina\n<s>\nBot:\n" is being tokenised as I have been chatting to Amaru from the AI Sweden team (which might be you @Apsod? User names are always confusing!), and he said that they actually used multiple different prompts, sampled stochastically during training:
With this flexibility in mind, I propose that we change the above prompt to the following: prompt = "<|endoftext|><s>User: Jag tycker träd är fina<s>Bot: " I compared the encodings of the all_responses_encoded = [self.sp_model.encode(response) for response in all_responses]
sp_encoded_prompt = [self.eos_token_id, self.bos_token_id]
for response in all_responses_encoded:
sp_encoded_prompt += response + [self.bos_token_id]
sp_encoded_prompt += self.sp_model.encode("Bot: ")
prompt = (
f"{self.eos_token}{self.bos_token}"
+ f"{self.bos_token}".join(all_responses)
+ f"{self.bos_token}Bot: "
)
hf_encoded_prompt = self.encode(text=prompt)
assert sp_encoded_prompt == hf_encoded_prompt Another thing: I looked into the mysterious extra whitespace added during decoding, and found that it's all due to these two lines in the
Is there any reason for this, or should it just be removed to ensure that |
Looks good to me! |
@Apsod Great. I've changed the prompt now. I also added a TODO comment to clarify whether these two lines are needed, as they break the decode(encode(doc)) == doc consistency. But that can be dealt with in another PR, if needed. |
@amyeroberts @ArthurZucker I cannot seem to merge in this PR - do any of you need to re-approve it first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
@saattrupdan Yes, the branch is protected so that only certain people can merge. It also needs an approval from a core maintainer (me in this case :) ) Merging for you now. Thanks again for this contribution! |
Also regarding why spaces before / after special tokens is eating in the slow version of transformers:
|
…4648) * feat: Add `_build_conversation_input_ids` to GPT-SW3 tokenizer, adjust line length * feat: Merge in PR huggingface#24504. This allows the GPT-SW3 models (and other GPT-2 based models) to be 4-bit quantised using `load_in_4bit` with `bitsandbytes`. * fix: F-string * fix: F-string * fix: Remove EOS token from all responses * fix: Remove redundant newlines * feat: Add `load_in_4bit` to `Pipeline` * fix: Separate turns with `\n<s>\n` rather than `<s>` * fix: Add missing newline in prompt * tests: Add unit tests for the new `_build_conversation_input_ids` method * style: Automatic style correction * tests: Compare encodings rather than decodings * fix: Remove `load_in_4bit` from pipeline arguments * docs: Add description and references of the GPT-SW3 chat format * style: Line breaks * Apply suggestions from code review Fix Conversation type hints Co-authored-by: Arthur <[email protected]> * fix: Import TYPE_CHECKING * style: Run automatic fixes * tests: Remove `_build_conversation_input_ids` unit tests * tests: Remove import of `Conversation` in GPT-SW3 unit test * style: Revert formatting * style: Move TYPE_CHECKING line after all imports * style: Imports order * fix: Change prompt to ensure that `sp_model.encode` and `encode` yields same result * docs: Add TODO comment related to the addition of whitespace during decoding * style: Automatic style checks * fix: Remove final whitespace in prompt, as prefix whitespace is used by sentencepiece --------- Co-authored-by: Arthur <[email protected]>
What does this PR do?
The
ConversationalPipeline
is great for easily running dialogue models, and also enables smooth interfaces in the associated Hugging Face Hub widget. These seem to require a_build_conversation_input_ids
method on the associated tokenizer, however, which takes aConversation
object and encodes it into the chat format that the model was trained on.With this change, we can now easily use the GPT-SW3 models. Here's an example of asking a single question:
And here is an example with a never-ending multi-turn dialogue session:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil @ArthurZucker @YouJiacheng @ekgren