Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emulation for Java Character class in generated lexers #349

Open
BurtHarris opened this issue Dec 28, 2017 · 4 comments
Open

Emulation for Java Character class in generated lexers #349

BurtHarris opened this issue Dec 28, 2017 · 4 comments
Assignees

Comments

@BurtHarris
Copy link
Collaborator

It seems like otherwise-portable lexers in the ANTLR communality frequently use something like this (from LexBasic.g4):

// covers all characters above 0xFF which are not a surrogate
// and UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
fragment JavaUnicodeChars
	: ~[\u0000-\u00FF\uD800-\uDBFF]		{Character.isJavaIdentifierPart(_input.LA(-1))}?
	|  [\uD800-\uDBFF] [\uDC00-\uDFFF]	{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
	;

Unfortunately this generates code that doesn't work in antlr4ts. There are three problems:

  1. References to _input need to be prefixed with this.
  2. The class Character doesn't exist in ECMAScript/TypeScript
  3. The cast (char) doesn't work in ECMAScript/TypeScript

Issue #1 can be fixed simply with a local variable _input in the code emitted for sematic predicates
Issue #2 can be fixed by defining a Character class w/isJavaIdentiferPart in the antlr4ts runtime.

Thus a lowly cast is the remaining issue... Unfortunately it's not limited to the lack of a char type in ECMAScript, the syntax of casts in typescript is also different!

@BurtHarris
Copy link
Collaborator Author

BurtHarris commented Dec 28, 2017

One possible solution for #3 might be to have the code generation tool apply some simple transforms to any code it encounters. A simple regex s/\(char\)/<number>/ might do a world of good.

@BurtHarris BurtHarris self-assigned this Dec 28, 2017
@ChuckJonas
Copy link

+1, as this issue threw me off when I first tried to use antlr4ts

@sharwell
Copy link
Member

One possible solution for #3 might be to have the code generation tool apply some simple transforms to any code it encounters. A simple regex s/\(char\)/<number>/ might do a world of good.

This should either be implemented as a target-language-agnostic DSL for grammar predicates/actions, or not be implemented at all. The proposed implementation introduces risk of breaking actions written in the correct target language, and may or may not work for any given action (a maintenance and usability nightmare).

The current implementation doesn't support rewriting actions automatically into the target language, but at least the behavior is consistent across all the targets. 😄

@sharwell
Copy link
Member

📝 See #350 (comment) for my take on the Character class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants