-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite parser for flexibility, add much better support for escapes #4
base: master
Are you sure you want to change the base?
Conversation
| 'F', 'f' => 15 | ||
| _ => fail "Expected a hex digit, found #that", 'hex digit', that | ||
|
||
parseUnicodeEscape = -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The small amount of bit twiddling + all the logic here is basically a necessity. It is what converts the string "ABCD"
into the hex number 0xabcd
. And in order to stringify astral characters (in the Unicode range U+10000 - U+10FFFF), I had to convert it into two character codes using the first formula here because of how JavaScript handles them internally (JS characters are UCS-2, but parts of the spec require UTF-8 handling, and the ES6 \u{N}
escape returns a UTF-8 code point, possibly 2 UCS-2 characters long).
Probably closer to heavy wizardry than magic. And trust me when I say that Acorn is worse. Esprima isn't much better WRT this, even though it uses plain multiplication. I'll push a cosmetic patch to try to better document this whole thing.
Here's a description of the algorithm for parsing Unicode escapes in prose:
GetNextHex()
When this is called, do the following:
- Let next be the current character. Advance to the next character.
- If next is in the range
0
to9
, return the numeric equivalent of that character. - If next is
A
ora
, return 10. - If next is
B
orb
, return 11. - If next is
C
orc
, return 12. - If next is
D
ord
, return 13. - If next is
E
ore
, return 14. - If next is
F
orf
, return 15. - Otherwise, throw an error, as this isn't a hex digit.
ParseUnicodeEscape()
When this is called, do the following:
- Assert: we've already parsed
\u
. - If the current character is
{
, then do the following:- Advance to the next character.
- Let code be GetNextHex().
- Repeat until either the current character is
}
or this loop has run 5 times:- Multiply code by 4.
- Let hex be GetNextHex().
- Add hex to code.
- Check that the current character is
}
. If not, throw an error. - Advance to the next character.
- Else, do the following:
- Let hex1 be GetNextHex() * 163.
- Let hex2 be GetNextHex() * 162.
- Let hex3 be GetNextHex() * 16.
- Let hex4 be GetNextHex().
- Let code be the sum of hex1, hex2, hex3, and hex4.
- If code is greater than the hex value
0x10FFFF
, then throw an error, as this is too large of a code point to be represented in UTF-16. - If code is greater than the hex value
0xFFFF
, then- Let character be the Unicode code point represented by code
- Convert character into a surrogate pair as defined in the Unicode standard.
- Return the concatenation of each surrogate resulting from the above step.
- Else, return the Unicode code point represented by code
(To be honest, the only way I can document this more thoroughly than the above would be to write an entire essay on it. It requires some low-level Unicode knowledge as well as some CS background.)
Help with code readability and clarity. Point-free style is kinda pointless... ;-)
@anko Is this better? I refactored it to be much clearer and less tersely functional. It also has quite a few more comments. |
It's probably also faster, since I'm not creating nearly as many functions. The test suite runs almost instantly on my underpowered computer, and that's checking over 1000 small unit tests (the majority testing that the escapes work for strings and atoms). And although it's an implementation detail, you could, in theory, escape whitespace characters (and Windows CRLF) in atoms by using |
And where you suggested in a private message about maybe implementing reader macros, this format will be much easier to do that with. It would be a matter of changing this block and using a trie as a read table, so I can know how far to backtrack, and can easily call functions. It would be nearly trivial. |
Fixes #1
I rewrote the parser in LiveScript to make it easier to create a proper parser for character escapes. It would've taken probably twice as many lines and 5 times the work to try to fit it into PEG.js' EBNF grammar.
In doing so, I also enabled escapes to be used in atoms as well, which although it might not be immediately useful for Eslisp outside of regular expressions, it may for other consumers.
I have incorporated the changes in 3b4d707 (fixing #2) as well. Returning a list simplified the entry point, but the shebang required some inelegant work.