-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor logos-derive from tree to graph #94
Conversation
While I'm at it, derive will now report more than one error at once, and will do so spanned to where the problem is:
|
I also started exploring this in wirelyre/logos@8db611da674deccdf9dbb2df4d37cc4d0f16a6d8. My design was to not rely on tail recursion by using labeled loops: let mut token = Error;
'n0: loop { match src.next() {
'a' => 'n1: loop { match src.next() {
'a' => { token = A; continue 'n1; }
'b' => continue 'n0,
}}
'c' => 'n2: loop { /* etc. */ }
}
return token; This works whenever nodes are nested (no jumps from I have some notes on how well LLVM tends to optimize that; I'm not sure how well it will inline the tail recursive nodes but it's bound to be less predictable. I found it useful to:
Contrary to the code I wrote, I planned to:
|
Also I was going to use this disambiguation strategy:
This correctly deals with identifiers and keywords ( |
There is an optimization in place where for any branch, and any fork with branches that lave length > 1, you can read multiple bytes at once and avoid doing bounds checking. This is especially useful when a branch is a byte sequence of 4 or 8 bytes, you can load [u8; 4] or [u8; 8] and LLVM can optimize those compares to 32/64 bit integer instructions.
Pattern is definitely a mess. In this rewrite I've already opted not to use it. All my sequences are just bytes, while the fork is using a table 256 ids long, so I just splatter the all the ranges onto that and never have to worry about having to find where the sets are overlapping. |
Yes, sorry, I meant only "matching" single bytes in the graph, not actually in the generated code. That seemed promising because it simplified the graph structure. And then any optimizations could logically be done on the graph as a complete data structure, rather than making the edges more complex, if that makes sense. For instance this would seem to leave room for: // tokens "aaaa", "aabb"; regex "[a-z]+"
match chunk_of_4 {
b"aaaa" => _,
b"aabb" => _,
_ => { if regex(b[0]) && regex(b[1]) /* ... */ }
} because even though all three patterns share the first three nodes, so they'll be in separate Although writing this it's now clear that loops are a big tradeoff, because in this case you'd need to duplicate the regex state like eight times. |
Debugging took a while, but we are all green ✔️. There might still be some edge cases, turns out the graph makes everything simpler and easier, but it's hard to produce a canonical structure for every possible permutation of regex. Unwinding loops and simple lookup tables for multiple ranges, and numbers are getting there:
Next I need to do jump tables for expensive match branches, and more aggressively read multiple bytes at a time when possible, and then I reckon this thing is home. |
Optimized branching :)
|
So, tried this out on a branch of a project of mine, I had to fix a bug in my lexer (mystery solved!), but it went green, So, this is the only branch/release currently passing, my ci tests. |
@ratmice great to hear. If you can either submit a PR or point me to whatever edge case regex you have that is breaking 0.10-rc2 or 0.9.7, I'm happy to have it running as a test here to ensure we don't get regressions. |
@maciejhirsz I don't know exactly whats going on. I had an errant the 0.9.7 cases are instances fixed by #53 multi-byte reads in errors. |
There are still things I know are suboptimal in the generated code, so some more fine tuning is coming. For now though, I think things are not bad at all.
Going to publish this as 0.10.0-rc3, to see if the regressions people reported are gone. |
Progress tracking:
Graph
primitives#[token]
declarations toGraph
#[regex]
declarations toGraph
?
) groups*
) groups+
) groupsa|b
)utf8-ranges
foo
withfoobar
)[0-9]+
with[0-9]+\.[0-9]+
)[a-z]+
withfoobar
)lexicon
from theLogos
traitI now believe that most of the issues current implementation has (#87, #81, #80, #79, #78, #70, and probably more) is due to the fact that trying to construct a tree is just not the right way of even trying to solve the problem.
What I think is a solution is a complete rewrite of the
logos-derive
from tree to a graph that can more adequately represent loops and arbitrary state jumps without the explosive nature of trying to build up all possible permutations in a tree. All the nodes of the graph are going to be stored on a singleVec
based struct (calledGraph
), and will be referenced by their index in thatVec
. The nodes are going to be immutable, so any permutations (merging forks) will have to create a new node with a new id.Here is a current (custom) debug print for what I imagine a simple
[a-z]*
regex should look like in the graph:The node
:0
is a token, the node:1
is a fork with a single arm matching one byte to range[a-z]
, on success we navigate to node:1
(creating a loop), on miss we navigate to node:0
and return a token. Generating Rust code out of this should be pretty straight forward, we can make every node a function definition, every jump to a function call (loops shouldn't lose performance due to tail call recursion). There is going to be room for optimization in code generation, although LLVM is probably going to do a better job at figuring out how and when to inline stuff than I ever will.This removes the need of marking forks as
Plain | Maybe | Repeat
, and it should also remove the need for thefallback
on branches, that really was just a hack to make identifiers working alongside named keywords.Going to leave this draft open for comments (CC #88).