-
Notifications
You must be signed in to change notification settings - Fork 829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers #1250
Comments
@ArthurZucker Maybe ? I don't know which version is "correct". |
possible causes for this issue is because the fast version is based on rust while the "slow" one is based on python. |
@chris-ha458 Thanks, I know this is the difference. There are a lot of discrepancies between the 2 versions. Usually they are more subtle though. This is really bad to get difference encodings since it changes how the end model will behave. |
@Narsil ah, obviously, as one of the most important developers in the project you would know :) I was not trying to "mansplain" this project to you. Coming back to this specific issue, is this something that you'd like a volunteer to investigate in detail? |
Your help is most definitely appreciated, and helps me out ton since I don't have to answer every issue. If you want to dig into this, you have my blessing. It's most likely linked to number of spaces and how both do the preprocessing. the "encode" method most importantly. You can check out the test battery within One important thing to note is that since both slow and fast are now being used, "fixing" one might break existing things which is why that area has been moving quite slow. |
I was operating on the assumption that the rust binding was somewhat of a "canonical" version, but it seems both python and rust versions are of similar standing. it would be difficult to compare and contrast each divergences especially without knowing which is "correct". I think my approach would be to acquire more information especially with my own tokenizers and corpuses so as to be able to speak with more confidence on what the encountered and expected behaviors might be. Hopefully, some of the divergence might come from outright bugs from either implementations that could be fixed. As for the |
Indeed. |
I got a even more weried result: encode 1 sentece take part 2 parts and concat the ids, result not same as encode sentence along: |
Also already mentioned in #16334 |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
#1357 fixed it I believe |
Description
There is a mismatch between the fast and slow implementations of Metaspace-based tokenizers (e.g,. t5, t5-v1_1, flan-t5, ...) when dealing with inputs that contain leading and trailing whitespace.
Code to reproduce:
Expected:
I'm unsure which is correct, but the fast and slow tokenizers should match. If I had to make a guess, the slow tokenizer's output makes more sense since it's strange to keep a "prefix token" (id=3) by itself:
["space", "_", "</s>"]
. The normal t5 tokenizers do not have this issue since they explicitly split on whitespace (they use a sequence: whitespace split followed by metaspace).The text was updated successfully, but these errors were encountered: