Enable `conversational` pipeline for `GPTSw3Tokenizer` #24648

saattrupdan · 2023-07-04T14:30:06Z

What does this PR do?

The ConversationalPipeline is great for easily running dialogue models, and also enables smooth interfaces in the associated Hugging Face Hub widget. These seem to require a _build_conversation_input_ids method on the associated tokenizer, however, which takes a Conversation object and encodes it into the chat format that the model was trained on.

With this change, we can now easily use the GPT-SW3 models. Here's an example of asking a single question:

from transformers import pipeline, Conversation
chatbot = pipeline(model="AI-Sweden-Models/gpt-sw3-20b-instruct")
conversation = chatbot(Conversation("Hvad hedder du?"))
output = conversation.generated_responses[-1]
print(output)

And here is an example with a never-ending multi-turn dialogue session:

from transformers import pipeline, Conversation
chatbot = pipeline(model="AI-Sweden-Models/gpt-sw3-20b-instruct")
conversation = Conversation()
while True:
    user_input = input('> ')
    conversation.add_user_input(user_input)
    conversation = chatbot(conversation)
    output = conversation.generated_responses[-1]
    print(output)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil @ArthurZucker @YouJiacheng @ekgren

…t line length

This allows the GPT-SW3 models (and other GPT-2 based models) to be 4-bit quantised using `load_in_4bit` with `bitsandbytes`.

…set-up-gpt-sw3-for-conversational-pipeline

HuggingFaceDocBuilderDev · 2023-07-04T15:04:20Z

The documentation is not available anymore as the PR was closed or merged.

amyeroberts · 2023-07-04T18:40:52Z

Hi @saattrupdan, thanks for this contribution and opening this PR.

As it stands, this isn't a change that we'd accept to be merged in. A few notes on why:

The pipelines are a higher level abstraction than the tokenizers, and so shouldn't be imported into a tokenizer's module.
The job of the tokenizer is to prepare raw text inputs for the model and decode its predicted tokens. _build_conversation_input_ids is higher level logic that belongs outside the class in e.g. a custom script.
It's not necessary to add load_in_4bit to the pipeline - the model can be instantiated with ModelClass.from_pretrained(checkpoint, load_in_4bit=True) and then passed into the pipeline. We try to keep the number of arguments in our public APIs as small as possible.
I think there might be a conflicting configuration, auto formatting from an IDE or different package version, but the line split changes in the PR shouldn't be there.

ArthurZucker

Thanks for contributing! +1 on all of @amyeroberts' previous comment.
A small nit is that we do indeed have a few models with the _build_conversation_input_ids method, so left comments to properly integrate this.
If you want to add a test it should be a pipeline test, in that case you should make sure the model is in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES or MODEL_FOR_CAUSAL_LM_MAPPING_NAMES. I will automatically run a test for them

tests/models/gpt_sw3/test_tokenization_gpt_sw3.py

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py

Fix Conversation type hints Co-authored-by: Arthur <[email protected]>

saattrupdan · 2023-07-05T08:33:55Z

Hi @saattrupdan, thanks for this contribution and opening this PR.

Thanks for your review @amyeroberts!

The pipelines are a higher level abstraction than the tokenizers, and so shouldn't be imported into a tokenizer's module.

I've fixed this now, via @ArthurZucker's suggestion.

The job of the tokenizer is to prepare raw text inputs for the model and decode its predicted tokens. _build_conversation_input_ids is higher level logic that belongs outside the class in e.g. a custom script.

I'm a bit confused by this, as this method already exists for 9-10 tokenizers in the package (such as GPT2, Bloom, GPT-neox and more), and is also required by the conversational pipeline here.

It's not necessary to add load_in_4bit to the pipeline - the model can be instantiated with ModelClass.from_pretrained(checkpoint, load_in_4bit=True) and then passed into the pipeline. We try to keep the number of arguments in our public APIs as small as possible.

That's fair enough if that's a design goal, I've removed it now. I just liked the idea of being able to instantiate a pipeline without having to load in the model first 🙂

I think there might be a conflicting configuration, auto formatting from an IDE or different package version, but the line split changes in the PR shouldn't be there.

Ah right, I just thought it was a mistake that an 88 character line limit wasn't enforced - I've reverted the changes back now I think!

amyeroberts · 2023-07-05T15:51:27Z

@saattrupdan @ArthurZucker OK, my bad, I hadn't noticed the _build_conversation_input_ids before - happy for that to be added then :)

ArthurZucker

Mostly looks good, but you should revert the formating changes, as they are not required and don't fit our usual code format!

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py

tests/models/gpt_sw3/test_tokenization_gpt_sw3.py

saattrupdan · 2023-07-06T09:02:20Z

@ArthurZucker All formatting changes have been reversed now too 🙂

ArthurZucker

Clean ! Thanks a lot

amyeroberts

Nice - thanks for adding and iterating!

Just a small nit on the TYPE_CHECKING line.

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py

Apsod · 2023-07-06T09:53:16Z

Really nice!

A quick comment from one of the developers of GPT-SW3, and the one responsible for the tokenization pipline.

Since there's a mismatch between the huggingface tokenizer and the sentencepiece tokenizer used during training, and how they treat special tokens, I'm a bit wary of this PR as it stands right now. To better match the training-procedure, each turn should be tokenized in isolation by the underlying sp_model, and joined with -tokens. This might result in the same thing, but I'm not 100% sure 😅

ArthurZucker · 2023-07-06T10:45:17Z

Regarding the special token issue, do you have small reproducer? I can have a look if needed! Currently working on our sentencepiece compatibility issues

saattrupdan · 2023-07-06T11:01:31Z

@Apsod Since there's a mismatch between the huggingface tokenizer and the sentencepiece tokenizer used during training, and how they treat special tokens, I'm a bit wary of this PR as it stands right now. To better match the training-procedure, each turn should be tokenized in isolation by the underlying sp_model, and joined with -tokens. This might result in the same thing, but I'm not 100% sure 😅

I just did some experiments to check this. The underlying sentencepiece model cannot deal with the special tokens, since these are dealt with by the tokens_trie, which is used in the tokenize method. Here's a sanity check:

>>> tokenizer.tokens_trie.data
{'<': {'p': {'a': {'d': {'>': {'': 1}}}}, 's': {'>': {'': 1}}, 'u': {'n': {'k': {'>': {'': 1}}}}, '|': {'e': {'n': {'d': {'o': {'f': {'t': {'e': {'x': {'t': {'|': {'>': {'': 1}}}}}}}}}}}}}}

We see that it correctly deals with <pad>, <s>, <unk> and <|endoftext|> special tokens. The encode method uses the encode_plus method, which uses the _encode_plus method, which finally uses the tokenize method, so using encode should be fine here, I think.

Note that, in the tokenize method, after the special tokens have been removed using the tokens_trie, the underlying _tokenize method is used to do the actual tokenization, which is implemented in the GPTSw3Tokenizer as

def _tokenize(self, text: str, **kwargs) -> List[str]:
    text = self.preprocess_text(text)
    return self.sp_model.encode(text, out_type=str)

If I replace the self.encode with self.sp_model.encode in the new function that's being added in this PR, then I end up with an incompatible tokenization:

>>> tokenizer.sp_model.encode('<s>Hej med dig<|endoftext|>', out_type=str)
['▁<', 's', '>', 'Hej', '▁med', '▁dig', '<', '|', 'end', 'of', 'text', '|', '>']

If I'm completely missing the point here, @Apsod, then please let me know 🙂

Apsod · 2023-07-06T12:20:14Z

If I replace the self.encode with self.sp_model.encode in the new function that's being added in this PR, then I end up with an incompatible tokenization:
>>> tokenizer.sp_model.encode('<s>Hej med dig<|endoftext|>', out_type=str)
['▁<', 's', '>', 'Hej', '▁med', '▁dig', '<', '|', 'end', 'of', 'text', '|', '>']

This is an edge-case where the semantic discrepancy between sentencepiece and huggingface tokenization leads to different results.

If we encounter <|endoftext|> in text and tokenizes this using sentencepiece (as was done during training), it would tokenize this as <, |, end, of, text, |, > and not as the special eos-token, since in sentencepiece, special tokens are not textual and can never be produced by tokenizing text.

I think there's also differences in how sentencepice treats the initial token after a special token (due to whitespace-prefix-stuff), which leads to a general mismatch between the tokenizers:

TEXT = """
<|endoftext|><s>
Hej
<s>
Hoj
""".strip()
print(tokenizer.decode(tokenizer.encode(TEXT))
# will print out the following:
#  <|endoftext|><s> Hej<s>Hoj

EDIT:

A simpler example of weird interactions between whitespace and special tokens:

TEXT = """    Hej   <s>"""

print('"', TEXT, '"', sep='')
print('"', tokenizer.decode(tokenizer.encode(TEXT)), '"', sep='')

Results in:

"    Hej   <s>"
"     Hej<s>"

saattrupdan · 2023-07-06T13:32:28Z

@Apsod Thanks for the clarification. Just tried inspecting the result of using the encode method, and it removes some of the newline symbols. More specifically,

prompt = "<|endoftext|><s>\nUser:\nJag tycker träd är fina\n<s>\nBot:\n"

is being tokenised as [3, 2, 15088, 63458, 18, 3947, 1886, 7590, 377, 6173, 2, 22493, 63458, 18], which translates token-by-token to "<|endoftext|><s>User:\nJag tycker träd är fina<s>Bot:\n". Notably, all newlines adjacent to a BOS token have been removed when encoded with this method.

I have been chatting to Amaru from the AI Sweden team (which might be you @Apsod? User names are always confusing!), and he said that they actually used multiple different prompts, sampled stochastically during training:

<eos><bos>{A}User:{B}{Query}{C}<bos>{A}Bot:{B}{Response}{C}...
A ~ ["\n", ""]
B ~ ["\n", " "]
C ~ ["\n", ""]

With this flexibility in mind, I propose that we change the above prompt to the following:

prompt = "<|endoftext|><s>User: Jag tycker träd är fina<s>Bot: "

I compared the encodings of the encode and sp_model.encode methods, and they now yield equivalent tokens. Here's the code that I ran to check:

all_responses_encoded = [self.sp_model.encode(response) for response in all_responses]
sp_encoded_prompt = [self.eos_token_id, self.bos_token_id]
for response in all_responses_encoded:
    sp_encoded_prompt += response + [self.bos_token_id]
sp_encoded_prompt += self.sp_model.encode("Bot: ")

prompt = (
    f"{self.eos_token}{self.bos_token}"
    + f"{self.bos_token}".join(all_responses)
    + f"{self.bos_token}Bot: "
)
hf_encoded_prompt = self.encode(text=prompt)

assert sp_encoded_prompt == hf_encoded_prompt

Another thing: I looked into the mysterious extra whitespace added during decoding, and found that it's all due to these two lines in the GPTSw3Tokenizer.convert_tokens_to_string method (link):

if not prev_is_special:
    out_string += " "

Is there any reason for this, or should it just be removed to ensure that tokenizer.decode(tokenizer.encode(doc)) == doc?

Apsod · 2023-07-06T13:50:40Z

Looks good to me!
The only outstanding issue then is special-token-injection, but I guess that is a more general HF-issue?

…ds same result

…ecoding

saattrupdan · 2023-07-06T14:09:04Z

Looks good to me! The only outstanding issue then is special-token-injection, but I guess that is a more general HF-issue?

@Apsod Great. I've changed the prompt now. I also added a TODO comment to clarify whether these two lines are needed, as they break the decode(encode(doc)) == doc consistency. But that can be dealt with in another PR, if needed.

…by sentencepiece

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py

saattrupdan · 2023-07-07T11:34:38Z

@amyeroberts @ArthurZucker I cannot seem to merge in this PR - do any of you need to re-approve it first?

amyeroberts

Thanks for iterating!

amyeroberts · 2023-07-07T18:52:13Z

@saattrupdan Yes, the branch is protected so that only certain people can merge. It also needs an approval from a core maintainer (me in this case :) )

Merging for you now. Thanks again for this contribution!

ArthurZucker · 2023-07-10T00:18:55Z

Also regarding why spaces before / after special tokens is eating in the slow version of transformers:

add_tokens does not support changing lstrip and rstrip thus by default it will strip. A fix is on its way here 🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909
text after special tokens is not properly handled. This leads to addition of spaces. A fix is also on its way for T5 and Llama but should be pushed to all spm based models. [Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

…4648) * feat: Add `_build_conversation_input_ids` to GPT-SW3 tokenizer, adjust line length * feat: Merge in PR huggingface#24504. This allows the GPT-SW3 models (and other GPT-2 based models) to be 4-bit quantised using `load_in_4bit` with `bitsandbytes`. * fix: F-string * fix: F-string * fix: Remove EOS token from all responses * fix: Remove redundant newlines * feat: Add `load_in_4bit` to `Pipeline` * fix: Separate turns with `\n<s>\n` rather than `<s>` * fix: Add missing newline in prompt * tests: Add unit tests for the new `_build_conversation_input_ids` method * style: Automatic style correction * tests: Compare encodings rather than decodings * fix: Remove `load_in_4bit` from pipeline arguments * docs: Add description and references of the GPT-SW3 chat format * style: Line breaks * Apply suggestions from code review Fix Conversation type hints Co-authored-by: Arthur <[email protected]> * fix: Import TYPE_CHECKING * style: Run automatic fixes * tests: Remove `_build_conversation_input_ids` unit tests * tests: Remove import of `Conversation` in GPT-SW3 unit test * style: Revert formatting * style: Move TYPE_CHECKING line after all imports * style: Imports order * fix: Change prompt to ensure that `sp_model.encode` and `encode` yields same result * docs: Add TODO comment related to the addition of whitespace during decoding * style: Automatic style checks * fix: Remove final whitespace in prompt, as prefix whitespace is used by sentencepiece --------- Co-authored-by: Arthur <[email protected]>

saattrupdan added 13 commits June 27, 2023 16:22

feat: Add _build_conversation_input_ids to GPT-SW3 tokenizer, adjus…

d98db67

…t line length

feat: Merge in PR #24504.

3436aff

This allows the GPT-SW3 models (and other GPT-2 based models) to be 4-bit quantised using `load_in_4bit` with `bitsandbytes`.

fix: F-string

9a6c735

fix: F-string

9648e2b

fix: Remove EOS token from all responses

3a88f2f

fix: Remove redundant newlines

153eb7f

feat: Add load_in_4bit to Pipeline

b27c5a7

Merge branch 'main' of github.com:saattrupdan/transformers into feat/…

e3cc522

…set-up-gpt-sw3-for-conversational-pipeline

fix: Separate turns with \n<s>\n rather than <s>

5d1a044

fix: Add missing newline in prompt

16c2981

tests: Add unit tests for the new _build_conversation_input_ids method

c728f89

style: Automatic style correction

cbb4705

tests: Compare encodings rather than decodings

b02c4da

ArthurZucker reviewed Jul 5, 2023

View reviewed changes

saattrupdan and others added 5 commits July 5, 2023 09:57

fix: Remove load_in_4bit from pipeline arguments

f357685

docs: Add description and references of the GPT-SW3 chat format

da08cfd

style: Line breaks

06441d1

Apply suggestions from code review

5927a82

Fix Conversation type hints Co-authored-by: Arthur <[email protected]>

fix: Import TYPE_CHECKING

9317335

style: Run automatic fixes

403448c

saattrupdan changed the title ~~Enable conversational pipeline for GPTSw3Tokenizer, and add load_in_4bit argument to pipelines~~ Enable conversational pipeline for GPTSw3Tokenizer Jul 5, 2023

ArthurZucker reviewed Jul 6, 2023

View reviewed changes

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Show resolved Hide resolved

tests/models/gpt_sw3/test_tokenization_gpt_sw3.py Outdated Show resolved Hide resolved

saattrupdan added 3 commits July 6, 2023 10:53

tests: Remove _build_conversation_input_ids unit tests

d3b52d6

tests: Remove import of Conversation in GPT-SW3 unit test

45df1e8

style: Revert formatting

d87d9d2

ArthurZucker approved these changes Jul 6, 2023

View reviewed changes

ArthurZucker requested a review from amyeroberts July 6, 2023 09:16

amyeroberts approved these changes Jul 6, 2023

View reviewed changes

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Outdated Show resolved Hide resolved

style: Move TYPE_CHECKING line after all imports

a467aa1

style: Imports order

a5b8253

saattrupdan added 3 commits July 6, 2023 16:03

fix: Change prompt to ensure that sp_model.encode and encode yiel…

bce2188

…ds same result

docs: Add TODO comment related to the addition of whitespace during d…

5d85e5c

…ecoding

style: Automatic style checks

9aa7f0d

fix: Remove final whitespace in prompt, as prefix whitespace is used …

5c4e6fa

…by sentencepiece

saattrupdan requested review from ArthurZucker and amyeroberts July 6, 2023 14:52

Apsod approved these changes Jul 6, 2023

View reviewed changes

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Outdated Show resolved Hide resolved

amyeroberts approved these changes Jul 7, 2023

View reviewed changes

amyeroberts merged commit abaca9f into huggingface:main Jul 7, 2023

saattrupdan deleted the feat/set-up-gpt-sw3-for-conversational-pipeline branch July 7, 2023 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable `conversational` pipeline for `GPTSw3Tokenizer` #24648

Enable `conversational` pipeline for `GPTSw3Tokenizer` #24648

saattrupdan commented Jul 4, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 4, 2023 •

edited

Loading

amyeroberts commented Jul 4, 2023

ArthurZucker left a comment

saattrupdan commented Jul 5, 2023 •

edited

Loading

amyeroberts commented Jul 5, 2023

ArthurZucker left a comment

saattrupdan commented Jul 6, 2023

ArthurZucker left a comment

amyeroberts left a comment

Apsod commented Jul 6, 2023

ArthurZucker commented Jul 6, 2023

saattrupdan commented Jul 6, 2023 •

edited

Loading

Apsod commented Jul 6, 2023 •

edited

Loading

saattrupdan commented Jul 6, 2023

Apsod commented Jul 6, 2023

saattrupdan commented Jul 6, 2023

saattrupdan commented Jul 7, 2023

amyeroberts left a comment

amyeroberts commented Jul 7, 2023

ArthurZucker commented Jul 10, 2023

Enable conversational pipeline for GPTSw3Tokenizer #24648

Enable conversational pipeline for GPTSw3Tokenizer #24648

Conversation

saattrupdan commented Jul 4, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jul 4, 2023 • edited Loading

amyeroberts commented Jul 4, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

saattrupdan commented Jul 5, 2023 • edited Loading

amyeroberts commented Jul 5, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

saattrupdan commented Jul 6, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Apsod commented Jul 6, 2023

ArthurZucker commented Jul 6, 2023

saattrupdan commented Jul 6, 2023 • edited Loading

Apsod commented Jul 6, 2023 • edited Loading

saattrupdan commented Jul 6, 2023

Apsod commented Jul 6, 2023

saattrupdan commented Jul 6, 2023

saattrupdan commented Jul 7, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented Jul 7, 2023

ArthurZucker commented Jul 10, 2023

Enable `conversational` pipeline for `GPTSw3Tokenizer` #24648

Enable `conversational` pipeline for `GPTSw3Tokenizer` #24648

saattrupdan commented Jul 4, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 4, 2023 •

edited

Loading

saattrupdan commented Jul 5, 2023 •

edited

Loading

saattrupdan commented Jul 6, 2023 •

edited

Loading

Apsod commented Jul 6, 2023 •

edited

Loading