Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode code point escapes #4248

Closed
ukoloff opened this issue Apr 11, 2016 · 13 comments · Fixed by #4498
Closed

Unicode code point escapes #4248

ukoloff opened this issue Apr 11, 2016 · 13 comments · Fixed by #4498

Comments

@ukoloff
Copy link

ukoloff commented Apr 11, 2016

New ECMAScript 6 unicode code point escapes ("\u{1F4A9}") are invalid in v1.10, but works in Node.

They should be passed from CS to JS as is, I believe.

@gfung
Copy link

gfung commented Jun 28, 2016

A temporary fix is to go to lib/lexer.js and remove (u(?![\da-fA-F]{4}).{0,4})) section (last part of the line) from the INVALID_ESCAPE variable.

FROM:
INVALID_ESCAPE = /((?:^|[^\\])(?:\\\\)*)\\(?:(0[0-7]|[1-7])|(x(?![\da-fA-F]{2}).{0,2})|(u(?![\da-fA-F]{4}).{0,4}))/;
TO:
INVALID_ESCAPE = ((?:^|[^\\])(?:\\\\)*)\\(?:(0[0-7]|[1-7])|(x(?![\da-fA-F]{2}).{0,2}))

@lydell
Copy link
Collaborator

lydell commented Jun 29, 2016

A better thing to do is to fix that regex instead. Here's my suggestion:

INVALID_ESCAPE      = ///
  ( (?:^|[^\\]) (?:\\\\)* )        # make sure the escape isn’t escaped
  \\ (
     ?: (0[0-7]|[1-7])             # octal escape
      | (x(?![\da-fA-F]{2}).{0,2}) # hex escape
      | (u\{(?![\da-fA-F]{1,6}\}).{0,7}) # unicode code point escape
      | (u(?!\{|[\da-fA-F]{4}).{0,4}) # unicode escape
  )
///

@loveencounterflow
Copy link

Celebrating the first anniversary of this (rather trivial, IMHO) bug.

Come to think of it, it is not immediately clear to me why CS has to parse the escape sequence at all. Couldn't the compiler just pass everything through to JS and let the runtime worry about it? The same goes for RegExp flags, where /x/y surfaced another long-standing bug in CS, if memory serves.

@GeoffreyBooth
Copy link
Collaborator

@helixbass this looks very similar to #4489. Care to take a look?

@loveencounterflow
Copy link

Well, yes, in a way. Thing is that at least for those specific and non-trivial sub-syntaxes as (single-slashed) RegExps, CS should, I think, just keep out and only do the bare minimum (e.g. find the closing slash and hop over the trailing flag letters); it can then pass the entire construct through to the JS source it generates.

Likewise, string literals are instances of a (comparatively minimal) embedded syntax (this point was made very clear by Larry Wall speaking about Perl 5 and 6 a few years ago, and I think his point of view is a justified and valuable one in this case). Arguably, CS should do nothing to those parts of the source beyond cherry-picking whatever enhancements it implements (such as string interpolation). Because, hey, it's "just JavaScript", right?

This does have the drawback that a given source may compile without error, and then fail with a nasty SyntaxError only upon getting loaded by whatever JS runtime it is to be consumed by. On the bright side, it also means one less point of failure for the parser; what's more, users can then just use new features they know are supported by their targeted engines and do not have to wait for CS to catch up.

The occasional late syntax error that may creep into the process might make some devs adversary to this proposal; OTOH, we already have that feature / problem in the guise of backtick-quoted JS literals where you can put anything at all and eschew any CS checks.

@loveencounterflow
Copy link

loveencounterflow commented Apr 12, 2017

Update Coming to think of it—while it is noble and notable that CS supports arbitrary CS within variable interpolations (e.g.

coffee> "hello #{ "foo" } world"
'hello foo world'
coffee> "hello #{ "foo #{ 42 ** 3 } wat" } world"
'hello foo 74088 wat world'
coffee> "hello #{ "foo #{ 42 ** "-#{ 1 + 1 + 1 }" } wat" } world"
'hello foo 0.000013497462477054314 wat world'

all work) it is also questionable whether this amount of syntactical recursivity is ever needed in ecologically responsible and cat-friendly source. I have much more of an urgent need to nest my block comments (which CS does not support) than to deeply nest my string interpolations. Granted, the line between 'sensible, if rare' and 'mad hatter meets march hare' is hard to draw.

@jashkenas
Copy link
Owner

I have much more of an urgent need to nest my block comments (which CS does not support) than to deeply nest my string interpolations.

Oh? Do tell.

helixbass added a commit to helixbass/copheescript that referenced this issue Apr 13, 2017
helixbass added a commit to helixbass/copheescript that referenced this issue Apr 13, 2017
@helixbass
Copy link
Collaborator

@GeoffreyBooth I think @lydell has the right idea above, submitted a pull request based on it

@loveencounterflow
Copy link

loveencounterflow commented Apr 13, 2017

@jashkenas it is often that you want to temporarily comment out entire swathes of source code, to mark it for imminent removal or just make sure it doesn't interfere with whatever you're experimenting with. One way to do that is to mark the affected lines and hit a shortcut key to make all of them line comments; another one is to put them inside block comments.

The first method always works, but it has the disadvantage of changing a lot of single lines, something you have to undo later; you also want to have a suitable editor to do that.

The second method is going to fail in case there already are block comments in the portions to be hidden. In CS, there seems to be no easy fix for that, since both ends of block comments use the same markup. In other languages / syntaxes, block comments also seem to be regularly non-nesting only. In JS, nested /*.../*...*/...*/ comments are misunderstood as malformed regexen (NodeJS at least). In HTML with its insane comment definition taken over from SGML, this is especially annoying b/c HTML has no line comments, so there's often no way to just disable part of a page other than to delete those portions.

Of course, one might argue that if block comments could be nested in a given language—they would still only work if the out-commented parts did not contain stray end-of-comment marks, so that reduces the utility of nested block comments. OTOH, MIME boundaries in emails, Perl here docs and PostgreSQL dollar-quoted string constants are constructs that do allow you to quote / comment anything regardless of content provided you can come up with a proper unique string of characters, which you always can.

Sorry for the lengthy text. Just wanting to say that CoffeeScript's syntax is awesome because it allows me to have matroshka code with nested interpolations inside nested interpolations. It is also too complex at this particular point (although ideally it could be a by-product of a general recursive rule, and thus not burden the compiler) unless it can be demonstrated that nested interpolations have some useful property (which to me they have not; I even replace simple interpolations with explicit concatenation where I think it is clearer).

@loveencounterflow
Copy link

Update Is this useful or wat?

"hello #{ "foo #{ 42 ** "-#{ 1 + 1 + 1 }" } wat" } world"

"""hello #{ "foo #{ 42 ** "-#{
sum = 0
for x in [ 1 .. 10 ]
  console.log "oops ##{x}"
  sum += x
x
}" } wat" } world"""
var sum, x;

"hello " + ("foo " + (Math.pow(42, "-" + (1 + 1 + 1))) + " wat") + " world";

"hello " + ("foo " + (Math.pow(42, "-" + ((function() {
  var i;
  sum = 0;
  for (x = i = 1; i <= 10; x = ++i) {
    console.log("oops #" + x);
    sum += x;
  }
  return x;
})()))) + " wat") + " world";

@GeoffreyBooth
Copy link
Collaborator

GeoffreyBooth commented Apr 13, 2017

@loveencounterflow One option is to use triple backticks and /* … */. Assuming you don’t have any /* … */ style comments or triple backticks within the block you’re trying to comment out, you could comment out huge swaths this way, including CoffeeScript block comments:

foo()
``` /*
###
  A CoffeeScript block comment
###
bar()
*/ ```
baz()

@jashkenas
Copy link
Owner

@loveencounterflow

Even with different delimiters for the start and end of block comments — your approach #2 still fails, as you note, because the end delimiter of the interior block comment closes out the outer comment prematurely.

Highlighting + the hot key to line-comment all of the lines, is the right way to do this.

@loveencounterflow
Copy link

@GeoffreyBooth neat, didn't think about that.

@jashkenas you're probably right and this is how I do it all the time.

GeoffreyBooth pushed a commit that referenced this issue Apr 22, 2017
* Fix #4248: Unicode code point escapes

* rewrite unicode code point escapes as unicode escapes

* smarter defaults

* and resimplify

* correct surrogate pairs

* fixes from code review

* handle adjacent code point escapes

* smarter regex

* fix from code review

* refactor toJS() to shared test helper
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants