Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to query minimum/maximum length of regex #513

Open
MegaIng opened this issue Nov 24, 2023 · 4 comments
Open

Ability to query minimum/maximum length of regex #513

MegaIng opened this issue Nov 24, 2023 · 4 comments

Comments

@MegaIng
Copy link

MegaIng commented Nov 24, 2023

For the lark parsing library we use the (sadly private) stdlib re._parser library to query the minimum and maximum length of a regex:

https://github.com/lark-parser/lark/blob/942366b49247e996e387cb901ed96c7d861382a0/lark/utils.py#L132-L156

As can be seen from the snippet, since we also support using regex instead of re, we need to take special care when encountering regex specific syntax, like nested sets category patterns. The only value that needs to be correct is if minimum length is 0 or greater since we depend on Regular Expressions being non-empty in a few places.

It would be nice if there was a way a query the minimum and maximum match size from a compiled regex object. The stdlib re module is lower priority since there there is at least a way to accesses this information reliably, but I am probably also going to make a request there.

@mrabarnett
Copy link
Owner

Does you expect the minimum and maximum to be accurate? That would be difficult if there were references to capture groups or calls.

@MegaIng
Copy link
Author

MegaIng commented Nov 25, 2023

accurate in the sense that all possible matches will fall into this range, even if no match with that length actually exists, yes. And preferably of-course the 0/1+ distinction of min should be fully accurate.

For our usecase is would actually be fully fine if this is only correctly supported for purely regular syntax, i.e. no backrefrences or non-regular extensions like nested calls.

@mrabarnett
Copy link
Owner

The minimum length is already available internally, except that it doesn't include references or calls (it assumes that they have zero length). The maximum length is more of a problem...

@rrthomas
Copy link

I too would love this functionality in a dependable way, in my case for the rpl text search/replace batch utility, where knowing the maximum possible length of a regex match allows me to improve worst-case performance hugely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants