Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flags / regex-engine (PCRE?) -- transitioning patterns from regex101 to Python regex #454

Open
p-i- opened this issue Feb 10, 2022 · 3 comments

Comments

@p-i-
Copy link

p-i- commented Feb 10, 2022

Please might you consider updating the README.md with some preliminary orientation?

It would be helpful to have a section that allows someone using a third-party tool such as regex101.com to test patterns and transition them into Python regex with confidence.

Currently code that is working on regex101 is failing on regex, and it's hard to figure out why.

I've tracked it down now.
https://regex101.com/r/HQuvtj/1 works (PCRE2 is set) but regex misses the second value.

Setting PCRE https://regex101.com/r/HQuvtj/2 ... now the regex101 output aligns with the regex output.

Fixing it for PCRE https://regex101.com/r/HQuvtj/3, now it works in Python

# https://regex101.com/r/HQuvtj/1 -- PCRE2, works but fails here
# https://regex101.com/r/HQuvtj/2 -- PCRE, same output as here (fail)
pattern_string_fail =  r'''
        (?&N)
        ^ \W*? ENTRY              \W* (?P<entries>    (?&Range)    )     (?&N)

        (?(DEFINE)
             (?P<Decimal>
                 [ ]*? \d+ (?:[.,] \d+)? [ ]*?
             )
             (?P<Range>
                 (?&Decimal) - (?&Decimal) | (?&Decimal)
                 #(?&d) (?: - (?&d))?
             )
             (?P<N>
                 [\s\S]*?
             )
        )
    '''

# https://regex101.com/r/HQuvtj/3 -- works both
pattern_string_ok =  r'''
        (?&N)
        ^ \W* ENTRY              \W* (?P<entries>    (?&Range)    )     (?&N)

        (?(DEFINE)
             (?P<Decimal>
                 [ ]* \d+ (?:[.,] \d+)? [ ]*
             )
             (?P<Range>
                 (?&Decimal) - (?&Decimal) | (?&Decimal)
                 #(?&d) (?: - (?&d))?
             )
             (?P<N>
                 [\s\S]*?
             )
        )
'''

flags = regex.MULTILINE | regex.VERBOSE # | regex.DOTALL  | regex.V1 #| regex.IGNORECASE | regex.UNICODE

s = 'ENTRY: 0.0975 - 0.101'
print(regex.compile(pattern_string_fail, flags=flags).match(s).groupdict())
print(regex.compile(pattern_string_ok, flags=flags).match(s).groupdict())

... gives:

{'entries': '0.0975', 'Decimal': None, 'Range': None, 'N': None}
{'entries': '0.0975 - 0.101', 'Decimal': None, 'Range': None, 'N': None}

Problems I was bumping into:

🔸 VERSION1/2 PCRE/PCRE2/Python
regex101 allows to select flavour. Flavours include Python PCRE PCRE2

regex doesn't specify if it is using PCRE PCRE2 or something different. I think this should be right at the top of the README.md.

regex.DEFAULT_VERSION, regex.VERSION0, regex.V0, regex.VERSION1, regex.V1 gives (8192, 8192, 8192, 256, 256). I think it should be documented that DEFAULT_VERSION is V0. Also what do V0 and V1 correspond to?

Is V0 PCRE and V1 PCRE2?

Does Python's re use PCRE?

The author of this repo probably has a context that the reader

So that's the first problem: to understand the situation regarding engines (PCRE/PCRE2/?) as respects re and regex and explain WHAT EXACTLY setting V1/V2 does.

🔸 FLAGS
Problem is, I don't know how to configure regex to ensure it's using the same engine as regex101 or the same flags.

regex101 offers:

g - global -- don't return after first match
m - multiline -- ^ and $ match start/end of line
i - insensitive -- case insensitive match
x - extended -- ignore whitespace
s - single line -- Dot matches newline
u - unicode -- match with full unicode
U - ungreedy -- make quantifiers lazy
A - anchored -- anchor to start of pattern, or at the end of the most recent match
J - changed -- allow duplicate subpattern names
D - dollar end only -- $ matches only end of pattern

Are these part of a regex standard?

regex offers:

The scoped flags are: FULLCASE, IGNORECASE, MULTILINE, DOTALL, VERBOSE, WORD.

The global flags are: ASCII, BESTMATCH, ENHANCEMATCH, LOCALE, POSIX, REVERSE, UNICODE, VERSION0, VERSION1.

If those one-letter flagnames match these flags, it would be useful to have a table. Some are obvious, but not all.

Also I should be able to (?gmi) at the start of my regex to set flags from within the pattern -- this would resolve ambiguity nicely. But g does not work.

🔸 🔸 🔸

I'm submitting this in the hope that some flounder-time can be saved with some preamble in the README.md, explaining the mess we're in regarding regex standards.

I think it would help drive adoption of your library.

@mrabarnett
Copy link
Owner

The README explains that this module was written to be compatible with the re module and provide a superset of what re provides. It also explains the purpose of V0 and V1.

Neither the re module nor the regex module are based on PCRE, and I hadn't heard of PCRE2. PCRE was intended to be Perl Compatible Regular Expressions, but Perl has changed some of its regex behaviour since then, so PCRE isn't strictly compatible with Perl any more!

When regex101.com says "Python" it means Python with the standard re module.

As for the g flag, it's how regex101.com lets you specify whether you want all of the matches or only the first, but the details of how that's done will depend on the particular API of each regex engine. In Perl, for example, it would be a suffix after the regex and would be used only in the condition of a while loop. I don't know of any regex engine that has it as an inline flag.

The A flag appears to do what \K does in a pattern (in those engines that support it). Does any engine have an A flag?

In fact, in general, those uppercase flags look like they're for controlling features specific to certain engines.

@rootsmusic
Copy link

rootsmusic commented Dec 13, 2023

Neither the re module nor the regex module are based on PCRE
When regex101.com says "Python" it means Python with the standard re module.

No @MRBarnett, the FAQ says: "For Python, regex101 implements it on top of PCRE library, by suppressing features not available in Python." There's an issue requesting your module be added as a Python flavor. It's labeled for discussion because regex101's developer believes that Python users (like this issue's author) can choose its pcre2 flavor (pcre is at end of life) instead, which should be like your module.

@KubaO
Copy link

KubaO commented Nov 27, 2024

Why should this regex implementation worry about some 3rd party website being broken? regex101 should be using actual Python re module, with the VM compiled to webassembly, to parse and execute python regular expressions, bug-for-bug. They should do the same for this regex module, since it's in fairly wide use as well. This is really a no-brainer if someone is serious about that stuff.

Asking mrab-regex to adapt to regex101's brokenness is IMHO an absurd way to go.

It would be helpful to have a section that allows someone using a third-party tool such as regex101.com to test patterns and transition them into Python regex with confidence.

Yes. That's on the third-party tool. And if their developer doesn't think it's their job - oh well.

Regex101 can't really be used in most professional closed-source applications anyway because it's a server-based solution, and leaks your code to a third party by design.

Currently code that is working on regex101 is failing on regex, and it's hard to figure out why.

Yes, but that is squarely on regex101. When they say "Python regex", they are not being frank with you. They "fake" it, they don't actually use a Python re engine like they should be.

This one should be closed IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants