Rewrite parser for flexibility, add much better support for escapes #4

dead-claudia · 2015-12-28T15:01:31Z

Fixes #1

I rewrote the parser in LiveScript to make it easier to create a proper parser for character escapes. It would've taken probably twice as many lines and 5 times the work to try to fit it into PEG.js' EBNF grammar.

In doing so, I also enabled escapes to be used in atoms as well, which although it might not be immediately useful for Eslisp outside of regular expressions, it may for other consumers.

I have incorporated the changes in 3b4d707 (fixing #2) as well. Returning a list simplified the entry point, but the shebang required some inelegant work.

…ADME

dead-claudia · 2016-01-06T05:56:09Z

parser.ls

+            | 'F', 'f' => 15
+            | _ => fail "Expected a hex digit, found #that", 'hex digit', that
+
+    parseUnicodeEscape = ->


@anko

The small amount of bit twiddling + all the logic here is basically a necessity. It is what converts the string "ABCD" into the hex number 0xabcd. And in order to stringify astral characters (in the Unicode range U+10000 - U+10FFFF), I had to convert it into two character codes using the first formula here because of how JavaScript handles them internally (JS characters are UCS-2, but parts of the spec require UTF-8 handling, and the ES6 \u{N} escape returns a UTF-8 code point, possibly 2 UCS-2 characters long).

Probably closer to heavy wizardry than magic. And trust me when I say that Acorn is worse. Esprima isn't much better WRT this, even though it uses plain multiplication. I'll push a cosmetic patch to try to better document this whole thing.

Here's a description of the algorithm for parsing Unicode escapes in prose:

GetNextHex()

When this is called, do the following:

Let next be the current character. Advance to the next character.

If next is in the range 0 to 9, return the numeric equivalent of that character.

If next is A or a, return 10.

If next is B or b, return 11.

If next is C or c, return 12.

If next is D or d, return 13.

If next is E or e, return 14.

If next is F or f, return 15.

Otherwise, throw an error, as this isn't a hex digit.

ParseUnicodeEscape()

When this is called, do the following:

Assert: we've already parsed \u.

If the current character is {, then do the following:

Advance to the next character.

Let code be GetNextHex().

Repeat until either the current character is } or this loop has run 5 times:

Multiply code by 4.

Let hex be GetNextHex().

Add hex to code.

Check that the current character is }. If not, throw an error.

Advance to the next character.

Else, do the following:

Let hex1 be GetNextHex() * 16³.

Let hex2 be GetNextHex() * 16².

Let hex3 be GetNextHex() * 16.

Let hex4 be GetNextHex().

Let code be the sum of hex1, hex2, hex3, and hex4.

If code is greater than the hex value 0x10FFFF, then throw an error, as this is too large of a code point to be represented in UTF-16.

If code is greater than the hex value 0xFFFF, then

Let character be the Unicode code point represented by code

Convert character into a surrogate pair as defined in the Unicode standard.

Return the concatenation of each surrogate resulting from the above step.

Else, return the Unicode code point represented by code

(To be honest, the only way I can document this more thoroughly than the above would be to write an entire essay on it. It requires some low-level Unicode knowledge as well as some CS background.)

Help with code readability and clarity. Point-free style is kinda pointless... ;-)

dead-claudia · 2016-01-06T07:35:35Z

@anko Is this better? I refactored it to be much clearer and less tersely functional. It also has quite a few more comments.

dead-claudia · 2016-01-06T07:47:47Z

@anko

It's probably also faster, since I'm not creating nearly as many functions. The test suite runs almost instantly on my underpowered computer, and that's checking over 1000 small unit tests (the majority testing that the escapes work for strings and atoms).

And although it's an implementation detail, you could, in theory, escape whitespace characters (and Windows CRLF) in atoms by using foo\ bar, which would parse out to {type: "atom", value: "foo bar"}. If you'd prefer, I can make parseEscape and readStringChar take an argument to prevent this.

dead-claudia · 2016-01-06T08:08:45Z

And where you suggested in a private message about maybe implementing reader macros, this format will be much easier to do that with. It would be a matter of changing this block and using a trie as a read table, so I can know how far to backtrack, and can easily call functions. It would be nearly trivial.

impinball added 3 commits December 28, 2015 09:01

Rewrite parser, improve string/atom parsing a lot

add3177

Merge branch 'master' of https://github.com/anko/sexpr-plus into rewrite

b8897fc

Incorporate new 7.0 features, fix bad Unicode escape handling, fix RE…

45b9522

…ADME

dead-claudia reviewed Jan 6, 2016
View reviewed changes

Comment parser better, use OO instead of overly terse FP style

6a3997e

Help with code readability and clarity. Point-free style is kinda pointless... ;-)

This was referenced Jan 6, 2016

The quoting-related operators should be optionally available in their raw form. #3

Closed

Replace transform macros with proper reader macros anko/eslisp#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite parser for flexibility, add much better support for escapes #4

Rewrite parser for flexibility, add much better support for escapes #4

dead-claudia commented Dec 28, 2015

dead-claudia Jan 6, 2016

dead-claudia commented Jan 6, 2016

dead-claudia commented Jan 6, 2016

dead-claudia commented Jan 6, 2016

Rewrite parser for flexibility, add much better support for escapes #4

Are you sure you want to change the base?

Rewrite parser for flexibility, add much better support for escapes #4

Conversation

dead-claudia commented Dec 28, 2015

dead-claudia Jan 6, 2016

Choose a reason for hiding this comment

dead-claudia commented Jan 6, 2016

dead-claudia commented Jan 6, 2016

dead-claudia commented Jan 6, 2016