-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the grammar once again #120
Conversation
In this case, please at least commit the
You can use Important question: will queries remain compatible? |
Ah, didn't know that. How do you generate the parser from that file?
That's how we're building it, but @wenkokke added a patch for I regenerated the
Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be. |
Just
Possibly because the output dir has changed? But that should be fixed upstream.
Yes, the Github release artifacts (only; it shouldn't be committed to the repo).
Thanks for the heads-up. I hope the queries here can serve as a guide for nvim-treesitter. |
Wow @tek this sounds incredible, clearly a labour of love, and I'm hugely grateful that you undertook such a huge amount of work for the community. The sheer quantity of listed improvements is astounding. I'll try my best to look over it, but given the large amount of grammar changes, and that I've mostly just dealt with the scanner in the past, you're clearly the one who needs to decide when this is ready :) 🚀 🚀 🚀 |
-- This should be an error, since a conid followed by two dots parses as the | ||
-- qualified composition operator, but we allow it since there's no way to use | ||
-- an operator here, like it would be in a section. | ||
a = [A..a] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
Next step is to fix this in GHC's parser, yeah?
λ> [False..True]
<interactive>:1:2: error:
Not in scope: ‘False..’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah 😅 GHC makes up for it by permitting ..
to be referred to as an operator if you use a qualifying module, like a A... a
😁
The reason is that web-tree-sitter doesn't support all of libc, and there's a file in the web-tree-sitter source that determines exactly what symbols are supported. Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable. |
That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it? |
@414owen thanks so much! I definitely went a bit overboard with the obsession over the last few weeks 😁 |
Ah, I'll make sure to bring the wasm build and release up to speed then |
FYI @wenkokke it appears that the file with the exports was refactored in tree-sitter/tree-sitter@d290051, now not containing standard library symbols anymore edit: Ah! It's in here now, and appears to contain |
examples/semantic/semantic/test/fixtures/haskell/corpus/type-synonyms.A.hs | ||
examples/semantic/semantic/test/fixtures/haskell/corpus/type-synonyms.B.hs | ||
examples/semantic/semantic/test/fixtures/haskell/corpus/tempate-haskell.A.hs | ||
examples/semantic/semantic/test/fixtures/haskell/corpus/template-haskell.B.hs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 fantastic
"semi", | ||
"cpp-else", | ||
"cpp", | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me to make a PR tying this together with the enum more tightly, using an X macro
src/scanner.c
Outdated
static int32_t peek(uint32_t n) { | ||
if (state->lookahead->offset + n < state->lookahead->len) return unsafe_peek(n); | ||
else { | ||
advance_before(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uneasy about advancing inside a function named peek
. Is there a good reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you consider a better name?
It's returning the nth lookahead character without advancing over it, if it hasn't been advanced over before.
I think it's possible that I've rearranged the steps in a way that the second part never happens, by now.
So, the peek
part applies to the n
only, I guess. The functions peek{0,1,2}
embody this concept a bit clearer maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it looks like unsafe_peek_abs
implements a pure peek with n
lookahead, and I'm not entirely sure what peek
does. It has a few layers, and I don't understand them. They seem to try to advance in relation to a parameter and the lookahead offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the record:
peek(n)
:= return lookahead character offset + n
without advancing over it, but advance over all preceding characters if necessary (may have advanced over offset + n
before, in which case it is returned directly from our buffer). Or alternatively: Ensure that our buffer contains offset + n
, advance the tree-sitter lookahead as far as necessary if it doesn't, copying characters to our buffer.
peek_unsafe(n)
:= return lookahead character offset + n
without advancing, return 0 if lookahead buffer is too small
peek_unsafe_abs(n)
:= return lookahead character n
without advancing, return 0 if lookahead buffer is too small
This isn't a WASM issue, but rather an emscripten feature that lets you not ship the entirety of libc and libc++ for your web app. That means that tree-sitter, as the entry point, bundles libc and libc++ and must pick what parts they want. They generally pick only those parts they need for there library itself to run. |
Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported. |
Not true; they also expose (more and more) for external scanners to use. Hence my recommendation to raise an upstream issue. |
@clason Since I've worked on a lot of the queries in nvim-treesitter, and also maintain neotest-haskell, which also uses tree-sitter queries, I'll be taking part in the survey. |
@mrcjkb Thank you, then I don't need to worry :) (To be clear, I didn't want to argue for not changing the queries; I just wanted a heads-up how much breakage I should expect so I can brace for it.) |
src/scanner.c
Outdated
// -------------------------------------------------------------------------------------------------------- | ||
|
||
static void advance_over(uint32_t n) { | ||
for (uint32_t i = state->lookahead->len; i <= state->lookahead->offset + n; i++) S_ADVANCE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we're advancing with relation to the peek position, and I'm not convinced this for loop is right anyway. Shouldn't the length of the lookahead be a limit, not a starting state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs now explain it, but for the record:
lookahead
is the range that's already been advanced over, so this starts at its end.
offset
is the reference point for indexing lookahead
that can be reset at any point for convenience, so the target index is offset + n
.
If that is smaller than lookahead
's current size, the target index has already been consumed and the loop immediately returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so there's an internal state, and a tree-sitter state.
advance
, now pushes the current token to a the internal-state lookahead, and advances the tree-sitter character.
peek
peeks into a internal lookahead buffer, and may advance the tree-sitter state to fill that buffer, if necessary.
So when parsing a file from scratch, does that mean that the entire file is pushed character-by-character into the lookahead buffer?
By the looks of it reset_lookahead
doesn't reset the lookahead, so much as it advances the internal state to the end of the lookahead buffer. Maybe there's a better name for this operation?
For my own interest, do you think it's possible for the scanner to work with only one character of lookahead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so there's an internal state, and a tree-sitter state.
advance
, now pushes the current token to a the internal-state lookahead, and advances the tree-sitter character.
peek
peeks into a internal lookahead buffer, and may advance the tree-sitter state to fill that buffer, if necessary.So when parsing a file from scratch, does that mean that the entire file is pushed character-by-character into the lookahead buffer?
If the question is whether the entire file is stored in that buffer entirely at some point, then no – the buffer is discarded after each run.
Right now the array is reused across runs and only overwritten to achieve that, but like the global pointers this isn't thread safe, so I'll have to revert to the initial approach I used and allocate a new one in each run (unlike the state pointers, this would only be unsafe if multiple nodes in the same file would be processed concurrently, so it might still be feasible if we knew that TS doesn't allow that).
It's probably useful to point out that none of this has any impact on performance.
I've never observed any differences in benchmarks when changing implementation details of the scanner on a "technical" level, only when modifying logic so that the number of times the scanner is called varies.
(On that note, moving stuff from the grammar to the scanner usually results in significant perf benefits 😅 e.g. when I moved pragmas in, HLS parse time reduced from 420ms to 360ms)
I've also looked at memory usage but the profiling capabilities are limited with simple tools.
I only measured RSS, which varied between 19MB and 21MB for ghc/compiler
before and after rewrite.
By the looks of it
reset_lookahead
doesn't reset the lookahead, so much as it advances the internal state to the end of the lookahead buffer. Maybe there's a better name for this operation?
I don't have a clear idea of what "resetting lookahead" canonically means, so if you feel confident about an alternative I'm happy to change it 🙂
My interpretation is that peek0()
is the analogue of env->lexer->lookahead
, and resetting the offset makes the former return the same as the latter.
For my own interest, do you think it's possible for the scanner to work with only one character of lookahead?
I think it's possible that some things would need to be spread over two runs like newline lookahead, but you could probably implement it in a convoluted way.
The main motivation is to be able to write the individual fragments independently though, without having to pay attention to their order.
src/scanner.c
Outdated
static int32_t peek(uint32_t n) { | ||
if (state->lookahead->offset + n < state->lookahead->len) return unsafe_peek(n); | ||
else { | ||
advance_before(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it looks like unsafe_peek_abs
implements a pure peek with n
lookahead, and I'm not entirely sure what peek
does. It has a few layers, and I don't understand them. They seem to try to advance in relation to a parameter and the lookahead offset?
@wenkokke I fiddled with the test runner to get it to work for wasm, and it appears that everything is working – when I run |
@414owen I'll add comments to the scanner later, so you won't have to keep guessing 😄 |
2396cc5
to
86fcdb1
Compare
src/scanner.c
Outdated
State *state = calloc(sizeof(State), 1); | ||
VEC_RESIZE(&state->contexts, 8); | ||
state->lookahead = calloc(sizeof(Lookahead), 1); | ||
VEC_RESIZE(state->lookahead, 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to allocate lookahead
separately? Can we not have the Lookahead
struct as a (non-pointer) member of the struct State
(like contexts
), to avoid the triple-dereferences when peeking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that works!
@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in |
test/parse-lib
Outdated
@@ -38,7 +38,7 @@ then | |||
elif [[ "$mode" == "wasm" ]] | |||
then | |||
# Ensure tree-sitter-haskell.wasm was compiled | |||
make tree-sitter-haskell.wasm -s | |||
tree-sitter build-wasm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's tree-sitter build --wasm -o ...
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. 0.22 doesn't appear to be available from npm or nixpkgs yet though
here goes nothing |
So this was merged to the |
@clason not sure – for now I would leave (in case it's unclear – the |
Ah, |
I would prefer that, but as far as I could tell there are consumers who want to have a (I've also created artifact release workflows btw.) |
Yes, but does it have to be in a git branch? You can just generate the parser.c in-tree and then zip the whole thing up; the output will be identical from cloning the repo (ignoring the |
my motivation was that if someone uses the plain github URL in their application to pull the repo, they'll get the generated parser like it used to be the case in the past. When I researched this issue I got the impression that some projects depend on this, but if you have better insight into the ecosystem I'd be happy to scrap that workflow. |
Well, I understand not wanting to make sudden breaking changes on (Unless I misunderstood and you are pushing on every commit to |
no, you understood that fine. I don't like it either, but that's what I learned 🤷🏻 I saw that tree-sitter-swift had multiple requests to include the generated parser, even though they have a separate release branch! so my reasoning was that keeping Of course you could argue that not playing that game would accelerate adoption... |
You should just check in parser.c into master for the time being, and not have a "main" branch + "release" branch. In the future, we will be getting rid of the need for this but for now, this is the way to go. I also wish you had waited till I checked this out - I'm not too familiar with Haskell so I wouldn't be particularly useful in commenting on the grammar, but I would have like to checked changes around everything else in this PR. |
I could understand that if people wanted to pull in that branch as a git submodule (in which case 🤢)... I do think the best strategy is "keep parser.c out of repo (unless it's small), and create release artefacts for 1. the full repo tree plus the generated parser.c and 2. the generated wasm". (You certainly shouldn't commit the generated wasm to the repo...) Downstream users who can't handle generating their own parser shouldn't YOLO track master, either. |
People will open issues without reading the docs no matter what you do ;) But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch. |
well they certainly won't open any about
I would have expected you to be eager to consume the tags or artifacts instead, irrespective of the existence of a legacy branch. |
We are not set up for that yet (because the ecosystem isn't) but that is definitely where we want to be going. (But that will require the release tarballs to have the exact same layout as the repo.) |
oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the |
hrm, somehow rebasing broke the working tree 🤔 |
Not really, time is finite and the number of parsers is seemingly infinite. The org was just offering a home to a number of parsers looking for help with automated maintenance. But, yes, the upstream guidance is to omit parser.c (but not grammar.json) from the repo in favor of versioned release artifacts with generated files, and we are very much looking forward to being able to rely on the latter more widely. Upstream is actively working on adding tooling to make this easier. For now, my curiosity is satisfied that there will not be a breaking update to the master branch (which we track and so any breaking changes there force us to act) anytime soon, so I can wait and see how things play out. If anybody jumps the gun and makes a PR switching the branch, I'll now know what's what. |
I decided to redesign the scanner and parts of the grammar due to the large number of issues without obvious fix that
have accumulated over the years.
The new implementation provides a significant set of improvements, listed below.
Since the repo has become uncloneable in some
situations, I would also prefer to change the workflow so that
parser.c
is only generated on a release branch, not forevery commit, like some other grammars do it.
However, since the current situation is already bad, it seems that the only way out would be to reset the history, which
would break consumers like nixpkgs who rely on being able to access older commits.
An alternative could be to use a new Github location, but that would also be awkward.
In any case, it's probably better to handle this later.
I've overhauled the tree structure a bit, mostly for higher-level nodes, since I found some parts to be lacking or badly
named.
For example, I added
header
/imports
/declarations
for the top-level structure.Since I don't have much experience using the grammar via
tree-sitter
API directly, I've launched a survey on theHaskell discourse to get some more feedback, so I can use the opportunity of introducing breaking changes to
improve the grammar for users.
Not sure about the wasm artifact – is it still necessary to use that patch included in the
Makefile
?I'd appreciate opinions and feedback!
Please take a look, @414owen @amaanq @wenkokke
Parses the GHC codebase!
I'm using a trimmed set of the source directories of the compiler and most core libraries in this repo.
This used to break horribly in many files because explicit brace layouts weren't supported very well.
Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken too seriously, using the test codebases in
test/libs
:Old:
New:
GHC's
compiler
directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), runtest/parse-libs
. I also added an interface for runninghyperfine
, exposed as a Nix app – executenix run .#bench-libs -- stm mtl transformers
with the desired set of libraries intest/libs
ortest/libs/tsh-test-ghc/libraries
.Smaller size of the shared object.
tree-sitter generate
produces ahaskell.so
with a size of 4.4MB for the old grammar, and 3.0MB for the new one.Smaller size of the generated
parser.c
: 24MB -> 14MB.Significantly faster time to generate, and slightly faster build.
On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.
Smaller size of
parser.c
, from 24MB down to 14MB.All terminals now have proper text nodes when possible, like the
.
in modules. Fixes Include . from qualified modules and variables #102, Include ! from strictness annotations #107, Qualified/unqualified module paths colored differently #115 (partially?).Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes Instance with associated type, following TH top level splice, misparsed as function #89, Components parser as type when they are not #105, Incorrect parse due to top-level splices #111.
Comments aren't pulled into preceding layouts anymore. Fixes Comments following function included in function pattern #82, Incorrect parse for function with where-clause and comments #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock)
Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes Grammar defines trailing whitespace as part of lambda case statement #74.
Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes exp_section_right not parsed when containing a hash #108.
Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes Typed Template Haskell quotations / splices not handled correctly #116.
Explicit brace layouts are now handled correctly. Fixes Misparse of explicit-braced code #92.
Function application with multiple block arguments is handled correctly.
Example:
Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection.
Haddock comments have dedicated nodes now.
Use named precedences instead of closely replicating the GHC parser's productions.
Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout.
Fixed CPP bug where mid-line
#endif
would be false positive.CPP only matches legal directives now.
Generally more lenient parsing than GHC, and in the presence of errors:
if/then
List comprehensions can have multiple sets of qualifiers (
ParallelListComp
).Deriving clauses after GADTs don't require layout anymore.
Newtype instance heads are working properly now.
Escaping newlines in CPP works now.
One remaining issue is that qualified left sections that contain infix ops are broken:Solved this by implementing qualified operator lookahead.(a + a A.+)
I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.