-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow unicode letters in identifiers #716
Conversation
@michaelficarra Oh, nice! :) Seems like |
I believe that's what the sequence |
@michaelficarra Um… I’m not sure I understand… 😕 Let me rephrase is what I meant. So, we have Unicode letters like “ц” and “ţ” that are outside But then we have Unicode letters like MATHEMATICAL ITALIC SMALL A which have a 5-digit-hex code and which we’d encode as Maybe there is another way, maybe we might just use the raw characters and paste them in the source code, I mean instead of a regex like BTW: even here, I got |
I have updated the patch: since we can’t encode 5-digit letters, I’ve decided not to use Unicode escape sequences at all, and just use raw characters. This way we can have all the letters. 👍 One tricky thing that I’ve found is that JS doesn’t allow specifying a RegExp range that would start or end with a 5-digit letter: you can’t have his range with raw letters: [\u1d44e-\u1d467], you get a “SyntaxError: Invalid regular expression: …: Range out of order in character class”. This is why I had to “inline” those ranges. 😄 |
Err, I'm sure there must be a better way than keeping a block of random unicode characters in the code. |
Good ideas are always welcome. 😉 |
@antonkovalyov, c’mon man let’s work this out: did you have something concrete in mind? Help me out a bit. Let’s get this done. |
perhaps this will help... https://github.com/bestiejs/punycode.js/blob/a6ea357aae89c90733dedd83cc4dac6982a69478/punycode.js#L128-147 (which i came across by reading http://mathiasbynens.be/notes/javascript-escapes) that code allows for something like this ucs2encode([119886]); // -> MATHEMATICAL ITALIC SMALL A a little massaging to make it fit this use case and i think it should be suitable for generating the string of chars you want to use in the regular expression. |
@neonstalwart Thank you for jumping into this. 😃 I feel dumb… it seems like I’m on a different wave-length than the rest of the world because I can’t get to the core of other people arguments (there were @michaelficarra and @antonkovalyov before) 😳. Let me at least try to understand what you’re proposing. 😳 So we collect all those codes, as an array of codes, including the 5-digit ones, and then we use |
@gurdiga that is almost the idea i was getting at. i was assuming that the characters you're interested in are represented by a continuous range of values - eg var current = 0x1d44e,
upper = 0x1d4671,
range = '';
while (current <= upper) {
range += encode(current++);
}
// then go on to build the regular expression with the range of chars you might extract this into a function that takes a the key part i was linking to in that other code is the way you can create characters outside of the range that can be expressed by |
OK, so, in that collection of letters we have 173 ranges (60 of which involves 5-digit letters) and 1045 “singles”. Now, the choice we’d have with If I’m not missing anything, this this sounds a bit of overkill compared to just putting the raw letters in the code, which, in the end are just… letters… aren’t they? 😄 |
@gurdiga: From http://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs,
See http://mothereff.in/js-escapes#0%F0%9D%8C%86 for an example. |
That is good to know… I’m wondering if we can use this knowledge for our issue here… ❓ |
Seriously? edit: Quoting you:
Use surrogate Pairs. Problem solved. |
@michaelficarra Thank you for the hint. Yes, this helps encoding the 5-hex-digit Unicode letters. 👍 |
Updated the |
Extended test to include capitalized identifiers. |
...what happened to using unicode escapes and character ranges? |
@michaelficarra it seems to me that the benefit is too small for the effort: we’d get ~4x more JS code than with raw letters, part of which we’d have to loop over (the 5-digit ranges) to make them usable, which I’d expect to make the parser slower. And in the end, those are just letters, I mean what’s the point to escape them? |
Not if you use ranges in character classes. That first link I sent you makes extensive use of ranges. Here's an excerpt:
That's not much longer than enumerating them. @antonkovalyov's comment above makes me think that it's desirable to avoid the characters themselves. |
I don't really care whether it's ~4x more JavaScript code or not because most people use JSHint locally. We will need to move character classes into a separate file, though. You can check out branch I don't like the idea of using actual characters because they might bring all kind of problems in various editors. Also looking at dozens of lines of this: I'm in favor of taking a Unicode table and overall approach from Traceur. |
They use a function to determine if a character is an Unicode letter or not: function isUnicodeLetter(ch) {
var cc = ch.charCodeAt(0);
for (var i = 0; i < unicodeLetterTable.length; i++) {
if (cc < unicodeLetterTable[i][0])
return false;
if (cc <= unicodeLetterTable[i][1])
return true;
}
return false;
} and we use regular expressions to match tokens, so I’m not sure how to apply that to our context. And if we stick to regular expressions there is no way to include 5-digit-code letters other than inlining those ranges. If there are no other options I’ll go ahead this way. I got the idea of how to move them out for JS platforms that have a
if you mean this https://github.com/tolmasky/language/blob/53f75464902bae56712f79a30015a0b87336a17c/languages/JavaScript.language#L656-665 than it looks to me like pseudo-code, I mean they just show the codes, that’s it. So far I didn’t find a way to include 5-digit-code letter ranges in a regular expression other than inlining them. I’ve also tried splitting the 5-digit-code into surrogate pairs as they do it here: http://mothereff.in/js-escapes#0%F0%9D%8C%86, so for /[\u1D770-\u1D7C9]/ which does’t work, I ended up with /[\uD835\uDFAA-\uD835\uDFC9]/, which still errors:
Am I missing something? 😳 |
@gurdiga You are right about Traceur, sorry. We should go with ranges then. I am actually working on an (experimental) branch that moves both lexer and regular expressions out of We do use |
Picked the hex codes for the Unicode 'Letter, Uppercase' and 'Letter, Lowercase' categories from fileformat.info.
@antonkovalyov I’ve exported the letter codes to |
@antonkovalyov I’m wondering if there is anything more that I can help with to get this in. ❓ Sorry if I’m being impatient, I’m just implementing server-side JS linting and wanted to take advantage of this tweak. 😃 |
Nope. Sorry for being slow, I was working on a lexer/tokenizer rewrite (now merged in) and put all patches on hold. I'll get to your patch within a few days. |
@gurdiga So a couple of weeks ago I rewrote our lexer to make it more robust and get rid of obnoxiously long regular expressions. That also meant that we were able to use Traceur's approach to handling unicode symbols with a table. So I went ahead and implemented that. But thanks for your pull request, it definitely got the ball rolling! |
This commit adds support for unicode identifiers such as Антон or π. It also supports escaped unicode sequences since stuff like \d1d44 is a valid identifier. Implementation approach and data was mostly borrowed from Google's Traceur compiler and rewritten a little bit to match our lexer's structure. Closes jshintGH-716. Closes jshintGH-301.
Hey Anton!
This patch is re issue #301.
I’m writing an law-specialized app and I’ve chosen to write them in Romanian because it would have been more difficult to translate the domain terms. :o) This is the reason why I need jshint to validate my code with Unicode letters in identifiers.
So here is what I did:
Picked the hex codes for the Unicode 'Letter, Uppercase' and 'Letter, Lowercase' categories from fileformat.info. However, because I could not find a way to encode 5-digit (ECMA allows only 4) Unicode escape sequence, I had to leave out 1016 out of 3192 Unicode letters.
Please take a look and let me know what you think.