Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The regular expressions break all scripts with combining marks in the middle of the syllable #73

Open
ajaykg opened this issue May 12, 2024 · 3 comments

Comments

@ajaykg
Copy link

ajaykg commented May 12, 2024

>>> import regex as re
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> str = r"""हहिन्दी विकिपीडिया"""
>>> print (re.findall(gpt2pat, str ))
['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']

The above got broken at every vovel combining mark
It can be fixed by including \p{M} wherever there is \p{L} in the regular expression

>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> print (re.findall(gpt2pat, str ))
['हहिन्दी', ' विकिपीडिया']

The above correctly split at word boundaries

@ajaykg
Copy link
Author

ajaykg commented May 12, 2024

#71

@ajaykg
Copy link
Author

ajaykg commented May 22, 2024

Ack from tiktoken that they got it wrong. openai/tiktoken#292

@ajaykg
Copy link
Author

ajaykg commented May 22, 2024

@karpathy can you please review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant