Update regex.py to correctly parse scripts with combining marks #71

ajaykg · 2024-05-05T23:27:57Z

Fixing the problem that all tokenizers have with regard to all combining marks like diacritics, Indic Matras (vowels after consonants) Indic Halant, Arabic, Hebrew etc. This was probably breaking most languages except English and CJKs. Verified for Indic languages.

Fixing the problem that all tokenizers have with regard to all combining marks like diacritics, Indic Matras (vowels after consonants) Indic Halant, Arabic vowels, Hebrew vowels etc. This was breaking most languages except English and CJKs.

ajaykg · 2024-05-05T23:29:07Z

>>> import regex as re
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> str = r"""हहिन्दी विकिपीडिया"""
>>> print (re.findall(gpt2pat, str ))
['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']
>>> # The above got broken at every vovel combining mark
>>> # It can be fixed by including \p{M} wherever there is \p{L}
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> print (re.findall(gpt2pat, str ))
['हहिन्दी', ' विकिपीडिया']
>>> The above keep it as is and correctly breaks at word boundaries

This is the utf8 tokenizer that skips the successive utf codepage byte in the word and the chunk to increase the token density.

This was a botched approach. Let's skip.

ajaykg · 2024-05-12T06:58:56Z

#73

ajaykg · 2024-05-22T05:10:53Z

bump.

dustinwloring1988 · 2024-06-15T16:39:18Z

Dose this merge negatively effect anything?

ajaykg · 2024-06-15T23:36:05Z

Should not. Given we are telling the regular expression to not split words between a character and a combining mark after the character. The combining marks in all scripts should by definition not exist independently. Atleast for southasian languages all the vovels following a consonent are combining marks and hence it should significantly improve tokenization that is making it act almost like a character level model.

Update regex.py

f7697cf

Fixing the problem that all tokenizers have with regard to all combining marks like diacritics, Indic Matras (vowels after consonants) Indic Halant, Arabic vowels, Hebrew vowels etc. This was breaking most languages except English and CJKs.

ajaykg added 2 commits May 7, 2024 21:52

Create utf8.py

d44ee6c

This is the utf8 tokenizer that skips the successive utf codepage byte in the word and the chunk to increase the token density.

Delete minbpe/utf8.py

45e9113

This was a botched approach. Let's skip.

ajaykg changed the title ~~Update regex.py~~ Update regex.py to correctly parse scripts with combining marks May 12, 2024

ajaykg mentioned this pull request May 12, 2024

The regular expressions break all scripts with combining marks in the middle of the syllable #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update regex.py to correctly parse scripts with combining marks #71

Update regex.py to correctly parse scripts with combining marks #71

ajaykg commented May 5, 2024

ajaykg commented May 5, 2024

ajaykg commented May 12, 2024

ajaykg commented May 22, 2024

dustinwloring1988 commented Jun 15, 2024

ajaykg commented Jun 15, 2024

Update regex.py to correctly parse scripts with combining marks #71

Are you sure you want to change the base?

Update regex.py to correctly parse scripts with combining marks #71

Conversation

ajaykg commented May 5, 2024

ajaykg commented May 5, 2024

ajaykg commented May 12, 2024

ajaykg commented May 22, 2024

dustinwloring1988 commented Jun 15, 2024

ajaykg commented Jun 15, 2024