Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics #523

DavisVaughan · 2024-09-13T15:58:57Z

This PR also contains #529 which was merged into it

Pulls in a whopping 34 more commits from tree-sitter-r r-lib/tree-sitter-r@63ee9b1...99bf614. It's really only 3 main changes. We pull in all 3 at once because they are a bit intermingled with each other.

Remove unmatched_delimiter r-lib/tree-sitter-r#92
- Improves tree-sitter's error recovery by better allowing it to "recover" missing ), }, and ] tokens
Add open and close fields everywhere appropriate r-lib/tree-sitter-r#119
- Helpful for targeting by field name
Remove optional() fields where possible r-lib/tree-sitter-r#132
- Improves tree-sitter's error recovery by removing a massive amount of grammar ambiguity / states

With these 3 changes, I was able to greatly improve our diagnostics engine.

It has been split into two parts - a syntax path, and a semantic path:

diagnostics.rs holds the entrypoint and the semantic path
diagnostics_syntactic.rs holds the syntax path

In a future PR I'll do a mostly pure "rearrangement" PR to clean this structure up a bit more. I haven't done that yet to make it clear what has been moved out of diagnostics.rs.

Syntax diagnostics

Diagnostics based purely on ERROR and MISSING nodes in the tree-sitter AST.

MISSING nodes give us a nice way of detecting missing closing delimiters like } or ) and seems to work nicely
ERROR nodes allow us to report syntax errors in general. We don't get that much info from tree-sitter about exactly what went wrong, so I just report Syntax error most of the time.

A few more improvements in this realm:

If a syntax error spans > 20 lines, it is now truncated to only show a squiggle on the start_point of the range, and it states that this is a Syntax error. Starts here and ends on row {row}. This helps us avoid overwhelming the user in the (very few!) cases where this can still happen. In my experience, mostly when they have an unmatched string delimiter like a stray " that causes everything after it to look like a weird unclosed string.
ERROR nodes are no longer "recursively" reported. i.e. it is very common for an ERROR node in tree-sitter to have children that are also ERROR nodes. Like what you see below. Previously we actually reported both ERRORs! This was particularly problematic because those outer ERROR nodes often span a huge number of lines, while the inner ERROR nodes can be very precise and target the exact problem quite nicely. We now only report what I call "terminal" ERROR nodes that don't have any ERROR children. This greatly improved that "aggressive diagnostic" issue where the whole file would light up in squiggles.

#> ── Text ────────────────────────────────────────────────────────────────────────
#> 1 + }
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (program [(0, 0), (0, 5)]
#>   (float [(0, 0), (0, 1)])
#>   (ERROR [(0, 2), (0, 5)]
#>     "+" [(0, 2), (0, 3)]
#>     (ERROR [(0, 4), (0, 5)])
#>   )
#> )

Semantic diagnostics

Semantic diagnostics are now run in a separate path from syntax diagnostics, this has the following really nice benefit - we only run semantic diagnostics on top level expressions (i.e. children of root) that node.has_error() returns false for. In other words, we only consider running semantic diagnostics down a section of the tree if we know that section of the tree does not contain any syntax errors.

This actually works quite nicely in practice.

Top level expressions with no syntax issues get semantic diagnostics
- And, importantly, all the code in the semantic path can be written knowing there aren't any potential ERROR nodes to be wary of
Top level expressions with syntax issues get syntax diagnostics
- Once the user fixes the syntax issues, they get semantic diagnostics for that fixed chunk too
The whole file still gets a mix of both semantic and syntax diagnostics, depending on what could be parsed

Improvement examples

This shows the "truncation" idea once a syntax error spans >20 rows

Screen.Recording.2024-09-13.at.2.37.24.PM.mov

This shows improvements in the example in posit-dev/positron#2943, which used to light up the whole file

Screen.Recording.2024-09-13.at.2.39.33.PM.mov

This is a tree-sitter-r test file, with many syntax errors. Note that 1) it doesn't light up the whole file and 2) it still shows some semantic issues too (symbol not found errors)

Screen.Recording.2024-09-13.at.2.48.23.PM.mov

Improvements on the example from posit-dev/positron#4177. We often target the missing opening/closing node now. At the very end it shows that } doesn't have a matching opening {, and I do think that is still a technically correct syntax error message, even if we'd like to show the unmatched ( (Positron does at least highlight that unmatched ( in red here, which is nice)

Screen.Recording.2024-09-13.at.2.51.09.PM.mov

crates/ark/src/lsp/snapshots/indent.R

DavisVaughan · 2024-09-13T18:58:49Z

crates/ark/src/lsp/snapshots/indent.R

-else
-  if (condition2) {


Like, this is not a valid top level if statement

DavisVaughan · 2024-09-13T18:59:30Z

crates/ark/src/lsp/snapshots/indent.R

+{
+  if (cond1)


I carefully retained all the existing indentation here, I just bumped it out by 1 ident with the addition of the braces to make it valid R code

crates/ark/src/lsp/diagnostics.rs

DavisVaughan · 2024-09-13T19:01:47Z

crates/ark/src/lsp/diagnostics.rs

+// TODO: This should be a syntax check, as the grammar should not allow
+// two `Argument` nodes side by side
+fn check_call_like_next_sibling(


r-lib/tree-sitter-r#140

DavisVaughan · 2024-09-13T19:03:11Z

crates/ark/src/lsp/diagnostics_syntactic.rs

+// When we hit an `ERROR` node, i.e. a syntax error, it often has its own children
+// which can also be `ERROR`s. The goal is to target the deepest (most precise) `ERROR`
+// nodes and only report syntax errors for those.


Special handling of ERROR nodes that may have ERROR children is here

lionel-

Great work! It's nice to see we can get this far with the TS parser.

My main comment is about potentially excessive tree traversals in the syntactic pass, I think we could do better and avoid some unnecessary passes.

crates/ark/src/treesitter.rs

lionel- · 2024-09-20T07:40:41Z

crates/ark/src/treesitter.rs

+pub(crate) fn node_find_string<'a>(node: &'a Node) -> Option<Node<'a>> {
+    if node.is_string() {
+        // Already on a string
+        return Some(*node);
+    }
+    // If we are on one of the following, we return the string parent:
+    // - Anonymous node inside a string, like `"'"`
+    // - `NodeType::StringContent`
+    // - `NodeType::EscapeSequence`
+    node.find_parent(|parent| parent.is_string())
+}
+


This could just use .ancestors() since the starting node is inclusive.

Do we want .strict_ancestors() for the non-inclusive case? I would expect .ancestors() to be exclusive but we probably should follow rust-analyzer here.

Oh hmm .ancestors() not being exclusive is surprising to me, I was not expecting node to be an "ancestor" of itself.

The unwrapped call makes more sense since it explicitly takes first, making it slightly less surprising when you also get first back on the first iteration

std::iter::successors(Some(*self), |p| p.parent())

crates/ark/src/lsp/completions/sources/unique/string.rs

crates/ark/src/lsp/diagnostics.rs

crates/ark/src/lsp/diagnostics_syntactic.rs

To get a combination of fixes that work nicely together

Can often occur when you have unmatched string delimiters, which there isn't much else we can do about

Otherwise we get a diagnostic about dplyr not being installed

Otherwise example 34 now indents out one level

Bump tree-sitter-r 5 - Leading newline skipping, tree-sitter 0.23.0 update

DavisVaughan marked this pull request as ready for review September 13, 2024 18:56

DavisVaughan commented Sep 13, 2024

View reviewed changes

DavisVaughan changed the title ~~Bump tree-sitter 4 - Syntax vs Semantic Diagnostics~~ Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics Sep 13, 2024

DavisVaughan force-pushed the feature/bump-tree-sitter-3 branch from d1ba3a4 to fbf2300 Compare September 16, 2024 17:55

DavisVaughan requested a review from lionel- September 18, 2024 18:09

lionel- reviewed Sep 20, 2024

View reviewed changes

DavisVaughan mentioned this pull request Sep 20, 2024

Bump tree-sitter-r 3 - New "open"/"close" fields, can use node.is_missing(), no more ..i diagnostic #517

Merged

DavisVaughan force-pushed the feature/bump-tree-sitter-3 branch from fbf2300 to 8d55790 Compare September 20, 2024 13:24

Base automatically changed from feature/bump-tree-sitter-3 to main September 20, 2024 13:28

DavisVaughan added 14 commits September 20, 2024 09:35

Fix invalid R code in indent.R causing issues

f5a7f57

Embrace { } illegal if/else statements in indent.R

b5fd52e

Bump tree-sitter-r

735f399

Remove UnmatchedDelimiter node

9a8dba2

First pass at diagnostics update

15e99ec

Bump tree-sitter-r again

4689964

To get a combination of fixes that work nicely together

Split syntax and semantic diagnostics

128c052

Tweak some message formatting

fe548d3

Allow semantic diagnostics on subtrees with no errors

8263fc1

Add customized syntax error for unmatched closing token

d3866a4

Don't need a mutable context

0aebbfd

More on not needing a mutable context

261dc81

Add syntax error range truncation to avoid overwhelming the user

e3ad712

Can often occur when you have unmatched string delimiters, which there isn't much else we can do about

Ha, use a package that will always be installed

a1d9544

Otherwise we get a diagnostic about dplyr not being installed

DavisVaughan force-pushed the feature/bump-tree-sitter-4 branch from 90138d5 to a1d9544 Compare September 20, 2024 13:38

DavisVaughan added 6 commits September 20, 2024 09:57

Simplify node_find_string() a little

40317f1

Remove build_ prefix

15c93d8

Tweak message

0a08eb2

Bump to latest tree-sitter-r, includes tree-sitter 0.23.0

130a34b

Adapt to new tree-sitter LANGUAGE conventions

1a09f77

Update to LANGUAGE in tests too

ab6ea88

DavisVaughan and others added 13 commits September 23, 2024 10:31

Add test for previous (0, 0) program node issue

7b2bc6d

Fix more invalid R syntax in indent.R

49fc325

Otherwise example 34 now indents out one level

Custom panic message in statement_range_test()

f09ce90

Bump tree-sitter-r one more time for r-lib/tree-sitter-r#145

c2907d4

Fix tests related to (0, 0) program node issue

ff95d91

Adapt diagnostics tests to tree-sitter-r grammar changes

119e165

Bump tree-sitter-r once again

0952232

Merge pull request #529 from posit-dev/feature/bump-tree-sitter-5

610b409

Bump tree-sitter-r 5 - Leading newline skipping, tree-sitter 0.23.0 update

Rename syntactic -> syntax

6a31dff

Remove TODOs in indent.R

b6d4972

Token -> Delimiter

7258cbc

Switch to snapshot testing for diagnostics

d8ff18c

Use simpler recursion scheme for only reporting terminal ERRORs

1006e79

DavisVaughan merged commit 0223ef4 into main Sep 23, 2024
3 checks passed

DavisVaughan deleted the feature/bump-tree-sitter-4 branch September 23, 2024 17:49

github-actions bot locked and limited conversation to collaborators Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics #523

Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics #523

DavisVaughan commented Sep 13, 2024 •

edited

Loading

DavisVaughan Sep 13, 2024

DavisVaughan Sep 13, 2024

DavisVaughan Sep 13, 2024

DavisVaughan Sep 13, 2024

lionel- left a comment

lionel- Sep 20, 2024

DavisVaughan Sep 20, 2024

Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics #523

Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics #523

Conversation

DavisVaughan commented Sep 13, 2024 • edited Loading

Syntax diagnostics

Semantic diagnostics

Improvement examples

DavisVaughan Sep 13, 2024

Choose a reason for hiding this comment

DavisVaughan Sep 13, 2024

Choose a reason for hiding this comment

DavisVaughan Sep 13, 2024

Choose a reason for hiding this comment

DavisVaughan Sep 13, 2024

Choose a reason for hiding this comment

lionel- left a comment

Choose a reason for hiding this comment

lionel- Sep 20, 2024

Choose a reason for hiding this comment

DavisVaughan Sep 20, 2024

Choose a reason for hiding this comment

DavisVaughan commented Sep 13, 2024 •

edited

Loading