Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite parser for flexibility, add much better support for escapes #4

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dead-claudia
Copy link

Fixes #1

I rewrote the parser in LiveScript to make it easier to create a proper parser for character escapes. It would've taken probably twice as many lines and 5 times the work to try to fit it into PEG.js' EBNF grammar.

In doing so, I also enabled escapes to be used in atoms as well, which although it might not be immediately useful for Eslisp outside of regular expressions, it may for other consumers.

I have incorporated the changes in 3b4d707 (fixing #2) as well. Returning a list simplified the entry point, but the shebang required some inelegant work.

| 'F', 'f' => 15
| _ => fail "Expected a hex digit, found #that", 'hex digit', that

parseUnicodeEscape = ->
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anko

The small amount of bit twiddling + all the logic here is basically a necessity. It is what converts the string "ABCD" into the hex number 0xabcd. And in order to stringify astral characters (in the Unicode range U+10000 - U+10FFFF), I had to convert it into two character codes using the first formula here because of how JavaScript handles them internally (JS characters are UCS-2, but parts of the spec require UTF-8 handling, and the ES6 \u{N} escape returns a UTF-8 code point, possibly 2 UCS-2 characters long).

Probably closer to heavy wizardry than magic. And trust me when I say that Acorn is worse. Esprima isn't much better WRT this, even though it uses plain multiplication. I'll push a cosmetic patch to try to better document this whole thing.

Here's a description of the algorithm for parsing Unicode escapes in prose:

GetNextHex()

When this is called, do the following:

  1. Let next be the current character. Advance to the next character.
  2. If next is in the range 0 to 9, return the numeric equivalent of that character.
  3. If next is A or a, return 10.
  4. If next is B or b, return 11.
  5. If next is C or c, return 12.
  6. If next is D or d, return 13.
  7. If next is E or e, return 14.
  8. If next is F or f, return 15.
  9. Otherwise, throw an error, as this isn't a hex digit.

ParseUnicodeEscape()

When this is called, do the following:

  1. Assert: we've already parsed \u.
  2. If the current character is {, then do the following:
    1. Advance to the next character.
    2. Let code be GetNextHex().
    3. Repeat until either the current character is } or this loop has run 5 times:
      1. Multiply code by 4.
      2. Let hex be GetNextHex().
      3. Add hex to code.
    4. Check that the current character is }. If not, throw an error.
    5. Advance to the next character.
  3. Else, do the following:
    1. Let hex1 be GetNextHex() * 163.
    2. Let hex2 be GetNextHex() * 162.
    3. Let hex3 be GetNextHex() * 16.
    4. Let hex4 be GetNextHex().
    5. Let code be the sum of hex1, hex2, hex3, and hex4.
  4. If code is greater than the hex value 0x10FFFF, then throw an error, as this is too large of a code point to be represented in UTF-16.
  5. If code is greater than the hex value 0xFFFF, then
    1. Let character be the Unicode code point represented by code
    2. Convert character into a surrogate pair as defined in the Unicode standard.
    3. Return the concatenation of each surrogate resulting from the above step.
  6. Else, return the Unicode code point represented by code

(To be honest, the only way I can document this more thoroughly than the above would be to write an entire essay on it. It requires some low-level Unicode knowledge as well as some CS background.)

Help with code readability and clarity. Point-free style is kinda
pointless... ;-)
@dead-claudia
Copy link
Author

@anko Is this better? I refactored it to be much clearer and less tersely functional. It also has quite a few more comments.

@dead-claudia
Copy link
Author

@anko

It's probably also faster, since I'm not creating nearly as many functions. The test suite runs almost instantly on my underpowered computer, and that's checking over 1000 small unit tests (the majority testing that the escapes work for strings and atoms).

And although it's an implementation detail, you could, in theory, escape whitespace characters (and Windows CRLF) in atoms by using foo\ bar, which would parse out to {type: "atom", value: "foo bar"}. If you'd prefer, I can make parseEscape and readStringChar take an argument to prevent this.

@dead-claudia
Copy link
Author

And where you suggested in a private message about maybe implementing reader macros, this format will be much easier to do that with. It would be a matter of changing this block and using a trie as a read table, so I can know how far to backtrack, and can easily call functions. It would be nearly trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strings should have ASCII/Unicode/etc. escapes
2 participants