Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace regex parser with context free grammar parser #80

Open
maweki opened this issue May 19, 2020 · 4 comments
Open

Replace regex parser with context free grammar parser #80

maweki opened this issue May 19, 2020 · 4 comments

Comments

@maweki
Copy link
Collaborator

maweki commented May 19, 2020

If you've looked at the offending code behind #79 , you'll see that the almagamation of regexes looks like an unmaintainable mess. Especially string-replacement in regex-strings and other combination mechanisms seem dangerous.

Now python regexes are not combinable by usual regex operation and I don't know of a performant regex engine that allows that.

A maintainable alternative could be using a parser/tokenizer combination out of the realm of context-free grammars. In my experience, they can be combined more easily and are therefore more maintainable. The result is a tree instead of a list, which is no problem. Some Tree-nodes we would replace by data constructors (like creating decimals from a comma-seperator and two number-nodes) and other parts of the tree would be flattened to a list, basically as it is now.

I think I have also more confidence in me/us writing/reading the simple regexes for tokens and a context-free grammar than whole regexes for complex expressions that span multiple lines and are later changed through string-processing functions.

In the longer run we could make chomsky proud and even supply some general CFGs for the supported languages and identify verbs, adjectives, and nouns through the grammar, giving us more confidence in extracting recipe steps.

I know it's not a priority but is this something we would put on the agenda?

@kirienko
Copy link
Owner

I cannot claim that CFG are more maintainable than regexes (the latter may indeed look messy, but intrinsically they are finite automata and therefore unambiguous). But being a theoretical physicist, I of course support all kinds of mathematical exercises and perversions 💪

But could you give an example of a really ugly regex? Maybe we can fix it without going into a higher dimensions.

@maweki
Copy link
Collaborator Author

maweki commented May 30, 2020

but intrinsically they are finite automata and therefore unambiguous

I think we value regular languages for their composability (closure under (almost?) all operations). While this is true for regular languages, composability breaks when introducing backreferences and unnamed capture groups. This is why basically no regex library I have seen allows you to combine compiled regexes.

First of all, I want to you to take note of this stackoverflow article "How to combine multiple regex into single one in python?" https://stackoverflow.com/questions/42136040/how-to-combine-multiple-regex-into-single-one-in-python

The answer boils down to re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).findall(sentence) which is barely legible.

But could you give an example of a really ugly regex?

Again, I am more focused on the composability of a grammar that results in a parse tree which we can then analyze easier.

Example:

gourmet/gourmet/convert.py

Lines 748 to 765 in 5d88f0a

NUMBER_REGEXP = "[\d"
#for k in UNICODE_INTEGERS.keys(): NUMBER_REGEXP+=k # COVERED by re.UNICODE
for k in list(UNICODE_FRACTIONS.keys()): NUMBER_REGEXP+=k
NUMBER_START_REGEXP = NUMBER_REGEXP + ']'
NUMBER_END_NO_RANGE_REGEXP = NUMBER_START_REGEXP # Is this right -- quick fix here.
NUMBER_MID_REGEXP = NUMBER_REGEXP + ',.' + SLASH
NUMBER_MID_NO_RANGE_REGEXP = NUMBER_MID_REGEXP + " /]"
NUMBER_MID_REGEXP += " /-"
NUMBER_MID_REGEXP += "]"
NUMBER_END_REGEXP = NUMBER_START_REGEXP
NUMBER_REGEXP = "("+NUMBER_START_REGEXP+"*"+NUMBER_MID_REGEXP+"*"+NUMBER_END_REGEXP
if NUMBER_WORD_REGEXP:
NUMBER_REGEXP = NUMBER_REGEXP + '|' + NUMBER_WORD_REGEXP + ')'
NUMBER_NO_RANGE_REGEXP = '(' + NUMBER_START_REGEXP + '+|' + NUMBER_WORD_REGEXP + ')'
else:
NUMBER_REGEXP = NUMBER_REGEXP + ")"
NUMBER_NO_RANGE_REGEXP = NUMBER_START_REGEXP + '+'
NUMBER_MATCHER = re.compile("^%s$"%NUMBER_REGEXP,re.UNICODE)

NUMBER_REGEXP = "[\d"
for k in list(UNICODE_FRACTIONS.keys()): NUMBER_REGEXP+=k
NUMBER_START_REGEXP = NUMBER_REGEXP + ']'
NUMBER_END_NO_RANGE_REGEXP = NUMBER_START_REGEXP # Is this right -- quick fix here.
NUMBER_MID_REGEXP = NUMBER_REGEXP + ',.' + SLASH
NUMBER_MID_NO_RANGE_REGEXP = NUMBER_MID_REGEXP + " /]"
NUMBER_MID_REGEXP += " /-"
NUMBER_MID_REGEXP += "]"
NUMBER_END_REGEXP = NUMBER_START_REGEXP
NUMBER_REGEXP = "("+NUMBER_START_REGEXP+"*"+NUMBER_MID_REGEXP+"*"+NUMBER_END_REGEXP
if NUMBER_WORD_REGEXP:
     NUMBER_REGEXP = NUMBER_REGEXP + '|' + NUMBER_WORD_REGEXP + ')'
     NUMBER_NO_RANGE_REGEXP = '(' + NUMBER_START_REGEXP + '+|' + NUMBER_WORD_REGEXP + ')'
else:
    NUMBER_REGEXP = NUMBER_REGEXP + ")"
    NUMBER_NO_RANGE_REGEXP = NUMBER_START_REGEXP + '+'
NUMBER_MATCHER = re.compile("^%s$"%NUMBER_REGEXP,re.UNICODE)

I would accept it, if it was only + of strings (even though this also breaks in the aforementioned cases), but the control characters are unnerving. As an alternative we could wrap the strings into some regex-compose-DSL but having something like NUMBER_START_REGEXP + '+|' + NUMBER_WORD_REGEXP with the + '+|' + is surely not what we want.

@kirienko
Copy link
Owner

kirienko commented Jun 1, 2020

I have to confess that I underestimated severity of the problem. I agree that the current code state is far from being maintainable. Maybe we can do better with CFG, I have close to zero practical experience with that and cannot judge. But I also can imagine that there might be a ready to use python solution for that somewhere, and it might be possible to eliminate the complexity from our code. We don't need to do everything by ourselves.

@maweki
Copy link
Collaborator Author

maweki commented Jun 1, 2020

I am not advocating writing my own parser. But converting this regex-kerfuffle into some context-free grammar should be straightforward using the proper library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants