Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict identifier grammar #11508

Merged

Conversation

straight-shoota
Copy link
Member

This patch restricts the grammar for identifiers to the suggested class Default Identifiers from UAX #31 (revision 35).

Also adds character data for the Unicode general category Pc and a couple of nodoc methods for querying specific Unicode categories.

Resolves #11216

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser topic:lang labels Nov 29, 2021
Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one little concern, the rest is 👌

src/compiler/crystal/syntax/lexer.cr Outdated Show resolved Hide resolved
@straight-shoota straight-shoota added this to the 1.3.0 milestone Dec 17, 2021
@straight-shoota straight-shoota merged commit 1653cf3 into crystal-lang:master Dec 18, 2021
@straight-shoota straight-shoota deleted the feature/identifier-grammar branch December 18, 2021 21:39
@Blacksmoke16
Copy link
Member

From discord, someone may have a found a regression:

:▲    # => Error: unexpected token: ":"
pp :▲ # => Error: unknown token: '▲'

I'm assuming this is what was meant by #11216, but wanted to at least report it to be sure.

@HertzDevil
Copy link
Contributor

HertzDevil commented Jan 3, 2022

Symbol literals should accept a larger grammar than identifiers, since they already allow things like :+ after all, and + is definitely not a valid identifier.

@straight-shoota
Copy link
Member Author

I suppose, for symbols we can probably use ident_part? even for the first character, because the : prefix is already a proper start designation.

@straight-shoota
Copy link
Member Author

straight-shoota commented Jan 3, 2022

However, U+25B2 () is actually in general category So (Other Symbol), so it wouldn't even be valid for ident_part?.

straight-shoota added a commit that referenced this pull request Jan 3, 2022
@straight-shoota straight-shoota removed this from the 1.3.0 milestone Jan 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser topic:lang
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stricter grammar for Unicode identifier names
4 participants