Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor logos-derive from tree to graph #94

Merged
merged 101 commits into from
Mar 30, 2020
Merged

Refactor logos-derive from tree to graph #94

merged 101 commits into from
Mar 30, 2020

Conversation

maciejhirsz
Copy link
Owner

@maciejhirsz maciejhirsz commented Mar 22, 2020

Progress tracking:

  • new Graph primitives
  • parsing #[token] declarations to Graph
  • parsing #[regex] declarations to Graph
    • byte concatenations
    • zero-or-one (?) groups
    • zero-or-more (*) groups
    • one-or-more (+) groups
    • alternations (a|b)
    • group concatenation
    • reimplement full unicode handling with utf8-ranges
  • merging non-looping branches (foo with foobar)
  • merging looping branches ([0-9]+ with [0-9]+\.[0-9]+)
  • merging looping branches with non-looping branches ([a-z]+ with foobar)
  • token disambiguation
  • handle trivia
  • code generation, remove lexicon from the Logos trait
  • code optimization: parse multiple bytes at a time
  • callbacks and extras
  • code optimization: turn complex forks into jump tables

I now believe that most of the issues current implementation has (#87, #81, #80, #79, #78, #70, and probably more) is due to the fact that trying to construct a tree is just not the right way of even trying to solve the problem.

What I think is a solution is a complete rewrite of the logos-derive from tree to a graph that can more adequately represent loops and arbitrary state jumps without the explosive nature of trying to build up all possible permutations in a tree. All the nodes of the graph are going to be stored on a single Vec based struct (called Graph), and will be referenced by their index in that Vec. The nodes are going to be immutable, so any permutations (merging forks) will have to create a new node with a new id.

Here is a current (custom) debug print for what I imagine a simple [a-z]* regex should look like in the graph:

[
    :0 "IDENT",
    :1 [
        [a-z] ⇒ :1,
        ⤷ :0,
    ],
]

The node :0 is a token, the node :1 is a fork with a single arm matching one byte to range [a-z], on success we navigate to node :1 (creating a loop), on miss we navigate to node :0 and return a token. Generating Rust code out of this should be pretty straight forward, we can make every node a function definition, every jump to a function call (loops shouldn't lose performance due to tail call recursion). There is going to be room for optimization in code generation, although LLVM is probably going to do a better job at figuring out how and when to inline stuff than I ever will.

This removes the need of marking forks as Plain | Maybe | Repeat, and it should also remove the need for the fallback on branches, that really was just a hack to make identifiers working alongside named keywords.

Going to leave this draft open for comments (CC #88).

@maciejhirsz
Copy link
Owner Author

maciejhirsz commented Mar 22, 2020

While I'm at it, derive will now report more than one error at once, and will do so spanned to where the problem is:

error: `Token::End` has a discriminant value set. This is not allowed for Tokens.
 --> tests/tests/dev.rs:8:5
  |
8 | /     #[end]
9 | |     End = 1,
  | |___________^

error: Only one #[error] variant can be declared.
  --> tests/tests/dev.rs:14:5
   |
14 | /     #[token = "hello!"]
15 | |     #[error]
16 | |     Hello,
   | |_________^

error: Previously declared #[error]:
 --> tests/tests/dev.rs:5:5
  |
5 | /     #[error]
6 | |     Error,
  | |_________^

error: aborting due to 3 previous errors

@wirelyre
Copy link

wirelyre commented Mar 22, 2020

I also started exploring this in wirelyre/logos@8db611da674deccdf9dbb2df4d37cc4d0f16a6d8.

My design was to not rely on tail recursion by using labeled loops:

let mut token = Error;

'n0: loop { match src.next() {
    'a' => 'n1: loop { match src.next() {
        'a' => { token = A; continue 'n1; }
        'b' => continue 'n0,
    }}
    'c' => 'n2: loop { /* etc. */ }
}

return token;

This works whenever nodes are nested (no jumps from 'n1 to 'n2), which in practice is almost always. In other cases you have to duplicate part of the graph inside the loops.

I have some notes on how well LLVM tends to optimize that; I'm not sure how well it will inline the tail recursive nodes but it's bound to be less predictable.

I found it useful to:

  • Remove Regex and match only one character at a time
  • Replace Pattern with a bitset

Contrary to the code I wrote, I planned to:

  • Construct an NFA from all patterns at once, then generate the graph directly from that NFA
  • Remove the graph library since the algorithms aren't useful

@wirelyre
Copy link

wirelyre commented Mar 22, 2020

Also I was going to use this disambiguation strategy:

  1. Take the longest sequence of bytes that match, and consider all tokens or regexes that match those bytes.
  2. If both a token and regex match, return the token.
  3. If two regexes match, return the first one. (But show a warning at compile time.)

This correctly deals with identifiers and keywords (let vs. letter); and integers and floats (123 vs. 123.45).

@maciejhirsz
Copy link
Owner Author

maciejhirsz commented Mar 23, 2020

  • Remove Regex and match only one character at a time.

There is an optimization in place where for any branch, and any fork with branches that lave length > 1, you can read multiple bytes at once and avoid doing bounds checking. This is especially useful when a branch is a byte sequence of 4 or 8 bytes, you can load [u8; 4] or [u8; 8] and LLVM can optimize those compares to 32/64 bit integer instructions.

Pattern is definitely a mess. In this rewrite I've already opted not to use it. All my sequences are just bytes, while the fork is using a table 256 ids long, so I just splatter the all the ranges onto that and never have to worry about having to find where the sets are overlapping.

@wirelyre
Copy link

wirelyre commented Mar 23, 2020

you can read multiple bytes at once

Yes, sorry, I meant only "matching" single bytes in the graph, not actually in the generated code.

That seemed promising because it simplified the graph structure. And then any optimizations could logically be done on the graph as a complete data structure, rather than making the edges more complex, if that makes sense.

For instance this would seem to leave room for:

// tokens "aaaa", "aabb"; regex "[a-z]+"

match chunk_of_4 {
    b"aaaa" => _,
    b"aabb" => _,
    _ => { if regex(b[0]) && regex(b[1]) /* ... */ }
}

because even though all three patterns share the first three nodes, so they'll be in separate Branches, if we have the whole graph we can recognize the larger pattern.

Although writing this it's now clear that loops are a big tradeoff, because in this case you'd need to duplicate the regex state like eight times.

@maciejhirsz
Copy link
Owner Author

Debugging took a while, but we are all green ✔️. There might still be some edge cases, turns out the graph makes everything simpler and easier, but it's hard to produce a canonical structure for every possible permutation of regex.

Unwinding loops and simple lookup tables for multiple ranges, and numbers are getting there:

test identifiers                       ... bench:         813 ns/iter (+/- 56) = 958 MB/s
test keywords_operators_and_punctators ... bench:       2,563 ns/iter (+/- 105) = 831 MB/s

Next I need to do jump tables for expensive match branches, and more aggressively read multiple bytes at a time when possible, and then I reckon this thing is home.

@maciejhirsz
Copy link
Owner Author

Optimized branching :)

test identifiers                       ... bench:         681 ns/iter (+/- 54) = 1143 MB/s
test keywords_operators_and_punctators ... bench:       2,315 ns/iter (+/- 162) = 920 MB/s

@ratmice
Copy link
Contributor

ratmice commented Mar 30, 2020

So, tried this out on a branch of a project of mine,

I had to fix a bug in my lexer (mystery solved!), but it went green,
after fixing that bug though 0.10-rc2 which was green broke,
and 0.9.7 is still red.

So, this is the only branch/release currently passing, my ci tests.
There are a few minor differences in error messages (relative to pre-mystery state), I'll have to look into.
previously passing
this branch

@maciejhirsz
Copy link
Owner Author

@ratmice great to hear. If you can either submit a PR or point me to whatever edge case regex you have that is breaking 0.10-rc2 or 0.9.7, I'm happy to have it running as a test here to ensure we don't get regressions.

@ratmice
Copy link
Contributor

ratmice commented Mar 30, 2020

@maciejhirsz I don't know exactly whats going on. I had an errant #[token = r"\p{Whitespace}"] which should be a regex rather than token, I haven't dug into it yet to see whats happening yet.

the 0.9.7 cases are instances fixed by #53 multi-byte reads in errors.
I've also tried a second parser and that one worked fine,

@maciejhirsz maciejhirsz mentioned this pull request Mar 30, 2020
@maciejhirsz maciejhirsz marked this pull request as ready for review March 30, 2020 21:33
@maciejhirsz
Copy link
Owner Author

There are still things I know are suboptimal in the generated code, so some more fine tuning is coming. For now though, I think things are not bad at all.

test identifiers                       ... bench:         701 ns/iter (+/- 30) = 1111 MB/s
test keywords_operators_and_punctators ... bench:       2,006 ns/iter (+/- 70) = 1062 MB/s

Going to publish this as 0.10.0-rc3, to see if the regressions people reported are gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants