The regular expressions break all scripts with combining marks in the middle of the syllable #73

ajaykg · 2024-05-12T06:50:21Z

>>> import regex as re
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> str = r"""हहिन्दी विकिपीडिया"""
>>> print (re.findall(gpt2pat, str ))
['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']

The above got broken at every vovel combining mark
It can be fixed by including \p{M} wherever there is \p{L} in the regular expression

>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> print (re.findall(gpt2pat, str ))
['हहिन्दी', ' विकिपीडिया']

The above correctly split at word boundaries

The text was updated successfully, but these errors were encountered:

ajaykg · 2024-05-12T07:00:41Z

#71

ajaykg · 2024-05-22T05:13:49Z

Ack from tiktoken that they got it wrong. openai/tiktoken#292

ajaykg · 2024-05-22T07:04:09Z

@karpathy can you please review?

ajaykg mentioned this issue May 12, 2024

Update regex.py to correctly parse scripts with combining marks #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The regular expressions break all scripts with combining marks in the middle of the syllable #73

The regular expressions break all scripts with combining marks in the middle of the syllable #73

ajaykg commented May 12, 2024 •

edited

Loading

ajaykg commented May 12, 2024

ajaykg commented May 22, 2024

ajaykg commented May 22, 2024

The regular expressions break all scripts with combining marks in the middle of the syllable #73

The regular expressions break all scripts with combining marks in the middle of the syllable #73

Comments

ajaykg commented May 12, 2024 • edited Loading

ajaykg commented May 12, 2024

ajaykg commented May 22, 2024

ajaykg commented May 22, 2024

ajaykg commented May 12, 2024 •

edited

Loading