Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable conversational pipeline for GPTSw3Tokenizer #24648

Merged
merged 28 commits into from
Jul 7, 2023
Merged

Enable conversational pipeline for GPTSw3Tokenizer #24648

merged 28 commits into from
Jul 7, 2023

Conversation

saattrupdan
Copy link
Contributor

@saattrupdan saattrupdan commented Jul 4, 2023

What does this PR do?

The ConversationalPipeline is great for easily running dialogue models, and also enables smooth interfaces in the associated Hugging Face Hub widget. These seem to require a _build_conversation_input_ids method on the associated tokenizer, however, which takes a Conversation object and encodes it into the chat format that the model was trained on.

With this change, we can now easily use the GPT-SW3 models. Here's an example of asking a single question:

from transformers import pipeline, Conversation
chatbot = pipeline(model="AI-Sweden-Models/gpt-sw3-20b-instruct")
conversation = chatbot(Conversation("Hvad hedder du?"))
output = conversation.generated_responses[-1]
print(output)

And here is an example with a never-ending multi-turn dialogue session:

from transformers import pipeline, Conversation
chatbot = pipeline(model="AI-Sweden-Models/gpt-sw3-20b-instruct")
conversation = Conversation()
while True:
    user_input = input('> ')
    conversation.add_user_input(user_input)
    conversation = chatbot(conversation)
    output = conversation.generated_responses[-1]
    print(output)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil @ArthurZucker @YouJiacheng @ekgren

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 4, 2023

The documentation is not available anymore as the PR was closed or merged.

@amyeroberts
Copy link
Collaborator

Hi @saattrupdan, thanks for this contribution and opening this PR.

As it stands, this isn't a change that we'd accept to be merged in. A few notes on why:

  • The pipelines are a higher level abstraction than the tokenizers, and so shouldn't be imported into a tokenizer's module.
  • The job of the tokenizer is to prepare raw text inputs for the model and decode its predicted tokens. _build_conversation_input_ids is higher level logic that belongs outside the class in e.g. a custom script.
  • It's not necessary to add load_in_4bit to the pipeline - the model can be instantiated with ModelClass.from_pretrained(checkpoint, load_in_4bit=True) and then passed into the pipeline. We try to keep the number of arguments in our public APIs as small as possible.
  • I think there might be a conflicting configuration, auto formatting from an IDE or different package version, but the line split changes in the PR shouldn't be there.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! +1 on all of @amyeroberts' previous comment.
A small nit is that we do indeed have a few models with the _build_conversation_input_ids method, so left comments to properly integrate this.
If you want to add a test it should be a pipeline test, in that case you should make sure the model is in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES or MODEL_FOR_CAUSAL_LM_MAPPING_NAMES. I will automatically run a test for them

tests/models/gpt_sw3/test_tokenization_gpt_sw3.py Outdated Show resolved Hide resolved
src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Outdated Show resolved Hide resolved
src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Outdated Show resolved Hide resolved
src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Outdated Show resolved Hide resolved
@saattrupdan
Copy link
Contributor Author

saattrupdan commented Jul 5, 2023

Hi @saattrupdan, thanks for this contribution and opening this PR.

Thanks for your review @amyeroberts!

  • The pipelines are a higher level abstraction than the tokenizers, and so shouldn't be imported into a tokenizer's module.

I've fixed this now, via @ArthurZucker's suggestion.

  • The job of the tokenizer is to prepare raw text inputs for the model and decode its predicted tokens. _build_conversation_input_ids is higher level logic that belongs outside the class in e.g. a custom script.

I'm a bit confused by this, as this method already exists for 9-10 tokenizers in the package (such as GPT2, Bloom, GPT-neox and more), and is also required by the conversational pipeline here.

  • It's not necessary to add load_in_4bit to the pipeline - the model can be instantiated with ModelClass.from_pretrained(checkpoint, load_in_4bit=True) and then passed into the pipeline. We try to keep the number of arguments in our public APIs as small as possible.

That's fair enough if that's a design goal, I've removed it now. I just liked the idea of being able to instantiate a pipeline without having to load in the model first 🙂

  • I think there might be a conflicting configuration, auto formatting from an IDE or different package version, but the line split changes in the PR shouldn't be there.

Ah right, I just thought it was a mistake that an 88 character line limit wasn't enforced - I've reverted the changes back now I think!

@saattrupdan saattrupdan changed the title Enable conversational pipeline for GPTSw3Tokenizer, and add load_in_4bit argument to pipelines Enable conversational pipeline for GPTSw3Tokenizer Jul 5, 2023
@amyeroberts
Copy link
Collaborator

@saattrupdan @ArthurZucker OK, my bad, I hadn't noticed the _build_conversation_input_ids before - happy for that to be added then :)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, but you should revert the formating changes, as they are not required and don't fit our usual code format!

@saattrupdan
Copy link
Contributor Author

@ArthurZucker All formatting changes have been reversed now too 🙂

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean ! Thanks a lot

@ArthurZucker ArthurZucker requested a review from amyeroberts July 6, 2023 09:16
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - thanks for adding and iterating!

Just a small nit on the TYPE_CHECKING line.

src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Outdated Show resolved Hide resolved
@Apsod
Copy link
Contributor

Apsod commented Jul 6, 2023

Really nice!

A quick comment from one of the developers of GPT-SW3, and the one responsible for the tokenization pipline.

Since there's a mismatch between the huggingface tokenizer and the sentencepiece tokenizer used during training, and how they treat special tokens, I'm a bit wary of this PR as it stands right now. To better match the training-procedure, each turn should be tokenized in isolation by the underlying sp_model, and joined with -tokens. This might result in the same thing, but I'm not 100% sure 😅

@ArthurZucker
Copy link
Collaborator

Regarding the special token issue, do you have small reproducer? I can have a look if needed! Currently working on our sentencepiece compatibility issues

@saattrupdan
Copy link
Contributor Author

saattrupdan commented Jul 6, 2023

@Apsod Since there's a mismatch between the huggingface tokenizer and the sentencepiece tokenizer used during training, and how they treat special tokens, I'm a bit wary of this PR as it stands right now. To better match the training-procedure, each turn should be tokenized in isolation by the underlying sp_model, and joined with -tokens. This might result in the same thing, but I'm not 100% sure 😅

I just did some experiments to check this. The underlying sentencepiece model cannot deal with the special tokens, since these are dealt with by the tokens_trie, which is used in the tokenize method. Here's a sanity check:

>>> tokenizer.tokens_trie.data
{'<': {'p': {'a': {'d': {'>': {'': 1}}}}, 's': {'>': {'': 1}}, 'u': {'n': {'k': {'>': {'': 1}}}}, '|': {'e': {'n': {'d': {'o': {'f': {'t': {'e': {'x': {'t': {'|': {'>': {'': 1}}}}}}}}}}}}}}

We see that it correctly deals with <pad>, <s>, <unk> and <|endoftext|> special tokens. The encode method uses the encode_plus method, which uses the _encode_plus method, which finally uses the tokenize method, so using encode should be fine here, I think.

Note that, in the tokenize method, after the special tokens have been removed using the tokens_trie, the underlying _tokenize method is used to do the actual tokenization, which is implemented in the GPTSw3Tokenizer as

def _tokenize(self, text: str, **kwargs) -> List[str]:
    text = self.preprocess_text(text)
    return self.sp_model.encode(text, out_type=str)

If I replace the self.encode with self.sp_model.encode in the new function that's being added in this PR, then I end up with an incompatible tokenization:

>>> tokenizer.sp_model.encode('<s>Hej med dig<|endoftext|>', out_type=str)
['▁<', 's', '>', 'Hej', '▁med', '▁dig', '<', '|', 'end', 'of', 'text', '|', '>']

If I'm completely missing the point here, @Apsod, then please let me know 🙂

@Apsod
Copy link
Contributor

Apsod commented Jul 6, 2023

If I replace the self.encode with self.sp_model.encode in the new function that's being added in this PR, then I end up with an incompatible tokenization:

>>> tokenizer.sp_model.encode('<s>Hej med dig<|endoftext|>', out_type=str)
['▁<', 's', '>', 'Hej', '▁med', '▁dig', '<', '|', 'end', 'of', 'text', '|', '>']

This is an edge-case where the semantic discrepancy between sentencepiece and huggingface tokenization leads to different results.

If we encounter <|endoftext|> in text and tokenizes this using sentencepiece (as was done during training), it would tokenize this as <, |, end, of, text, |, > and not as the special eos-token, since in sentencepiece, special tokens are not textual and can never be produced by tokenizing text.

I think there's also differences in how sentencepice treats the initial token after a special token (due to whitespace-prefix-stuff), which leads to a general mismatch between the tokenizers:

TEXT = """
<|endoftext|><s>
Hej
<s>
Hoj
""".strip()
print(tokenizer.decode(tokenizer.encode(TEXT))
# will print out the following:
#  <|endoftext|><s> Hej<s>Hoj

EDIT:

A simpler example of weird interactions between whitespace and special tokens:

TEXT = """    Hej   <s>"""

print('"', TEXT, '"', sep='')
print('"', tokenizer.decode(tokenizer.encode(TEXT)), '"', sep='')

Results in:

"    Hej   <s>"
"     Hej<s>"

@saattrupdan
Copy link
Contributor Author

@Apsod Thanks for the clarification. Just tried inspecting the result of using the encode method, and it removes some of the newline symbols. More specifically,

prompt = "<|endoftext|><s>\nUser:\nJag tycker träd är fina\n<s>\nBot:\n"

is being tokenised as [3, 2, 15088, 63458, 18, 3947, 1886, 7590, 377, 6173, 2, 22493, 63458, 18], which translates token-by-token to "<|endoftext|><s>User:\nJag tycker träd är fina<s>Bot:\n". Notably, all newlines adjacent to a BOS token have been removed when encoded with this method.

I have been chatting to Amaru from the AI Sweden team (which might be you @Apsod? User names are always confusing!), and he said that they actually used multiple different prompts, sampled stochastically during training:

<eos><bos>{A}User:{B}{Query}{C}<bos>{A}Bot:{B}{Response}{C}...
A ~ ["\n", ""]
B ~ ["\n", " "]
C ~ ["\n", ""]

With this flexibility in mind, I propose that we change the above prompt to the following:

prompt = "<|endoftext|><s>User: Jag tycker träd är fina<s>Bot: "

I compared the encodings of the encode and sp_model.encode methods, and they now yield equivalent tokens. Here's the code that I ran to check:

all_responses_encoded = [self.sp_model.encode(response) for response in all_responses]
sp_encoded_prompt = [self.eos_token_id, self.bos_token_id]
for response in all_responses_encoded:
    sp_encoded_prompt += response + [self.bos_token_id]
sp_encoded_prompt += self.sp_model.encode("Bot: ")

prompt = (
    f"{self.eos_token}{self.bos_token}"
    + f"{self.bos_token}".join(all_responses)
    + f"{self.bos_token}Bot: "
)
hf_encoded_prompt = self.encode(text=prompt)

assert sp_encoded_prompt == hf_encoded_prompt

Another thing: I looked into the mysterious extra whitespace added during decoding, and found that it's all due to these two lines in the GPTSw3Tokenizer.convert_tokens_to_string method (link):

if not prev_is_special:
    out_string += " "

Is there any reason for this, or should it just be removed to ensure that tokenizer.decode(tokenizer.encode(doc)) == doc?

@Apsod
Copy link
Contributor

Apsod commented Jul 6, 2023

Looks good to me!
The only outstanding issue then is special-token-injection, but I guess that is a more general HF-issue?

@saattrupdan
Copy link
Contributor Author

Looks good to me! The only outstanding issue then is special-token-injection, but I guess that is a more general HF-issue?

@Apsod Great. I've changed the prompt now. I also added a TODO comment to clarify whether these two lines are needed, as they break the decode(encode(doc)) == doc consistency. But that can be dealt with in another PR, if needed.

@saattrupdan
Copy link
Contributor Author

@amyeroberts @ArthurZucker I cannot seem to merge in this PR - do any of you need to re-approve it first?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

@amyeroberts
Copy link
Collaborator

@saattrupdan Yes, the branch is protected so that only certain people can merge. It also needs an approval from a core maintainer (me in this case :) )

Merging for you now. Thanks again for this contribution!

@amyeroberts amyeroberts merged commit abaca9f into huggingface:main Jul 7, 2023
@saattrupdan saattrupdan deleted the feat/set-up-gpt-sw3-for-conversational-pipeline branch July 7, 2023 18:55
@ArthurZucker
Copy link
Collaborator

Also regarding why spaces before / after special tokens is eating in the slow version of transformers:

blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
…4648)

* feat: Add `_build_conversation_input_ids` to GPT-SW3 tokenizer, adjust line length

* feat: Merge in PR huggingface#24504.

This allows the GPT-SW3 models (and other GPT-2 based models) to be 4-bit quantised
using `load_in_4bit` with `bitsandbytes`.

* fix: F-string

* fix: F-string

* fix: Remove EOS token from all responses

* fix: Remove redundant newlines

* feat: Add `load_in_4bit` to `Pipeline`

* fix: Separate turns with `\n<s>\n` rather than `<s>`

* fix: Add missing newline in prompt

* tests: Add unit tests for the new `_build_conversation_input_ids` method

* style: Automatic style correction

* tests: Compare encodings rather than decodings

* fix: Remove `load_in_4bit` from pipeline arguments

* docs: Add description and references of the GPT-SW3 chat format

* style: Line breaks

* Apply suggestions from code review

Fix Conversation type hints

Co-authored-by: Arthur <[email protected]>

* fix: Import TYPE_CHECKING

* style: Run automatic fixes

* tests: Remove `_build_conversation_input_ids` unit tests

* tests: Remove import of `Conversation` in GPT-SW3 unit test

* style: Revert formatting

* style: Move TYPE_CHECKING line after all imports

* style: Imports order

* fix: Change prompt to ensure that `sp_model.encode` and `encode` yields same result

* docs: Add TODO comment related to the addition of whitespace during decoding

* style: Automatic style checks

* fix: Remove final whitespace in prompt, as prefix whitespace is used by sentencepiece

---------

Co-authored-by: Arthur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants