Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Unicode property escapes (and /u flag) #116

Closed
modernserf opened this issue Jan 8, 2019 · 4 comments
Closed

Support for Unicode property escapes (and /u flag) #116

modernserf opened this issue Jan 8, 2019 · 4 comments

Comments

@modernserf
Copy link

ES2018 added support for unicode property escapes. This allows you to match complex unicode ranges (e.g. chars valid in identifiers) much more compactly than with explicit unicode ranges. For example, this regex matches all valid JS identifiers:

let re = /[$_\p{ID_Start}][$\p{ID_Continue}]*/u
let foo = re.test("foo")
let $π123 = re.test("$π123")
let ভরা = re.test("ভরা")

Compare with the regex used by acorn:

https://github.com/acornjs/acorn/blob/2ffed00236071aece0a79813b98c36f302ff1f9d/acorn/src/identifier.js#L22-L31

However, this requires the /u flag, which is currently forbidden:

moo/moo.js

Lines 44 to 49 in 13e1157

// TODO: consider /u support
if (obj.ignoreCase) throw new Error('RegExp /i flag not allowed')
if (obj.global) throw new Error('RegExp /g flag is implied')
if (obj.sticky) throw new Error('RegExp /y flag is implied')
if (obj.multiline) throw new Error('RegExp /m flag is implied')
if (obj.unicode) throw new Error('RegExp /u flag is not allowed')

I presume the /u flag was disabled because it added complexity to the implementation but (previously) had no significant advantages; however, I believe that these new property escapes would make proper unicode support in grammars built with moo dramatically simpler.

It has pretty good support in current browsers and with Babel. I have no idea what the performance implications of the /u flag are, but I would expect that support could be implemented as purely opt-in.

@tjvr
Copy link
Collaborator

tjvr commented Jan 10, 2019

Hi! I definitely appreciate why the Unicode flag is useful 😊

Moo builds a single RegExp which combines all of the tokens, so the flags effectively have to be the same for all of your tokens.

Out of interest, since Babel already supports compiling RegExps with the unicode flag, if you use Babel with Moo I imagine it would "just work"... care to try? :)

@modernserf
Copy link
Author

since Babel already supports compiling RegExps with the unicode flag, if you use Babel with Moo I imagine it would "just work"

It depends on what environment you're targeting with Babel. If you're targeting es5, it works, since it generates a RegExp without the /u flag.

However if you're targeting environments that support the /u flag (and may or may not support \p properties) it doesn't work.

And, of course, it doesn't work without Babel.

In my branch, I apply the /u flag to the big RegExp if any of the constituent RegExps also use the /u flag. It might be safer to have a more explicit opt-in for unicode, since it affects how every pattern is interpreted, but I'm not sure what the actual implications of that are.

If you think this is worth discussing further, I can submit my branch as a PR, and we can continue the discussion there.

@nathan
Copy link
Collaborator

nathan commented Jan 12, 2019

I think it makes more sense to enable the u flag if every constituent regex has the u flag and to forbid mixing regexes with different flags. E.g., this would work:

moo.compile({
  id: /[$_\p{ID_Start}][$\p{ID_Continue}]*/u,
  plus: '+',
  ws: /\p{WSpace}+/u,
})

and this would work:

moo.compile({
  id: /[$_a-zA-Z][$\w]*/,
  plus: '+',
  ws: /\s+/,
})

But this would not:

moo.compile({
  id: /[$_\p{ID_Start}][$\p{ID_Continue}]*/u,
  plus: '+',
  mostOfBmp: /./,
})

That seems like a good thing, because adding the u flag to the mostOfBmp regex changes its meaning.

Importantly, a string converted to a regular expression does not change its meaning when the u flag is added, so this is a less objectionable feature than adding i if every regular expression has i—since that would also have an effect on tokens expressed as strings.

@tjvr
Copy link
Collaborator

tjvr commented Jan 12, 2019

I agree with Nathan, I was going to suggest the same thing. If you'd like to PR this that would be great :)

Sent with GitHawk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants