Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode named groups for regular expressions #35459

Closed
BenjaminGalliot opened this issue Apr 13, 2020 · 3 comments · Fixed by #35607
Closed

Unicode named groups for regular expressions #35459

BenjaminGalliot opened this issue Apr 13, 2020 · 3 comments · Fixed by #35607
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request strings "Strings!"

Comments

@BenjaminGalliot
Copy link
Contributor

Hello,

This issue is strongly related to my old (more than 1 year) message here.

It is to allow Unicode named groups for regular expressions, it seems it was just related to a limitation of PCRE, which was updated to fix it last year.

I remember I tried near the end of last year the latest PCRE2 and it worked, and because I just tried again on the latest Julia 1.4 without success, I read again this topic and it is true they asked me to open an issue here! Done!

Sincerely.

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request strings "Strings!" labels Apr 13, 2020
@Micket
Copy link
Contributor

Micket commented Apr 13, 2020

Related #35322 ?

@BenjaminGalliot
Copy link
Contributor Author

It is related, yes (I did not see this recent issue, at least I am not alone!), and PCRE was updated last year for this by Philip Hazel. I copy here the mails he sent to me (in January 2019), if it can help.

Thank you for raising this issue. When names were first invented for
PCRE, before Perl had them, it seemed sensible to restrict them to
letters, underscore, and digits. Perl eventually introduced named
groups, but I didn't make any changes to PCRE, and I haven't thought
about non-ASCII characters in names. I have just had a quick look at
Perl's current documentation. It says two things:

  1. "name must not begin with a number, nor contain hyphens", which isn't
    very helpful.

  2. "Currently NAME is restricted to simple identifiers only. In other
    words, it must match "/^[_A-Za-z][_A-Za-z0-9]*\z/" or its Unicode
    extension (see utf8), though it isn't extended by the locale (see
    perllocale)."

I have yet to find where in the copious Perl documentation it has a
definition for this extension. However, a quick test indicates that it
does allow, for example, ABáC as a name. I will do some more research
and then investigate upgrading PCRE2 in line with Perl. The next PCRE2
release (10.33) is likely to come out in a couple of months' time. I
will let you know when there is something to test.

Regards,
Philip

I have just committed a patch that upgrades PCRE2 to be more like Perl.
If it is running in UTF mode, capture group names are now allowed to
contain Unicode letters and Unicode decimal digits (and underscore). In
other words, names must match this pattern:

^[\p{L}][\p{L}\p{Nd}]*\z

The next release (10.33) is likely to happen in a couple of months'
time. In the meantime, you can check out the the latest code like this:

svn co svn://vcs.exim.org/pcre2/code/trunk pcre2

Regards,
Philip

Micket added a commit to Micket/julia that referenced this issue Apr 27, 2020
Keno added a commit that referenced this issue May 5, 2020
@JeffBezanson JeffBezanson reopened this May 5, 2020
StefanKarpinski added a commit that referenced this issue Sep 22, 2020
@KristofferC KristofferC reopened this Sep 24, 2020
@vtjnash
Copy link
Member

vtjnash commented Feb 3, 2021

Fixed by #39310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request strings "Strings!"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants