Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics #523

Merged
merged 33 commits into from
Sep 23, 2024

Conversation

DavisVaughan
Copy link
Contributor

@DavisVaughan DavisVaughan commented Sep 13, 2024

Addresses posit-dev/positron#2943

This PR also contains #529 which was merged into it

Pulls in a whopping 34 more commits from tree-sitter-r r-lib/tree-sitter-r@63ee9b1...99bf614. It's really only 3 main changes. We pull in all 3 at once because they are a bit intermingled with each other.

With these 3 changes, I was able to greatly improve our diagnostics engine.

It has been split into two parts - a syntax path, and a semantic path:

  • diagnostics.rs holds the entrypoint and the semantic path
  • diagnostics_syntactic.rs holds the syntax path

In a future PR I'll do a mostly pure "rearrangement" PR to clean this structure up a bit more. I haven't done that yet to make it clear what has been moved out of diagnostics.rs.

Syntax diagnostics

Diagnostics based purely on ERROR and MISSING nodes in the tree-sitter AST.

  • MISSING nodes give us a nice way of detecting missing closing delimiters like } or ) and seems to work nicely

  • ERROR nodes allow us to report syntax errors in general. We don't get that much info from tree-sitter about exactly what went wrong, so I just report Syntax error most of the time.

A few more improvements in this realm:

  • If a syntax error spans > 20 lines, it is now truncated to only show a squiggle on the start_point of the range, and it states that this is a Syntax error. Starts here and ends on row {row}. This helps us avoid overwhelming the user in the (very few!) cases where this can still happen. In my experience, mostly when they have an unmatched string delimiter like a stray " that causes everything after it to look like a weird unclosed string.

  • ERROR nodes are no longer "recursively" reported. i.e. it is very common for an ERROR node in tree-sitter to have children that are also ERROR nodes. Like what you see below. Previously we actually reported both ERRORs! This was particularly problematic because those outer ERROR nodes often span a huge number of lines, while the inner ERROR nodes can be very precise and target the exact problem quite nicely. We now only report what I call "terminal" ERROR nodes that don't have any ERROR children. This greatly improved that "aggressive diagnostic" issue where the whole file would light up in squiggles.

#> ── Text ────────────────────────────────────────────────────────────────────────
#> 1 + }
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (program [(0, 0), (0, 5)]
#>   (float [(0, 0), (0, 1)])
#>   (ERROR [(0, 2), (0, 5)]
#>     "+" [(0, 2), (0, 3)]
#>     (ERROR [(0, 4), (0, 5)])
#>   )
#> )

Semantic diagnostics

Semantic diagnostics are now run in a separate path from syntax diagnostics, this has the following really nice benefit - we only run semantic diagnostics on top level expressions (i.e. children of root) that node.has_error() returns false for. In other words, we only consider running semantic diagnostics down a section of the tree if we know that section of the tree does not contain any syntax errors.

This actually works quite nicely in practice.

  • Top level expressions with no syntax issues get semantic diagnostics
    • And, importantly, all the code in the semantic path can be written knowing there aren't any potential ERROR nodes to be wary of
  • Top level expressions with syntax issues get syntax diagnostics
    • Once the user fixes the syntax issues, they get semantic diagnostics for that fixed chunk too
  • The whole file still gets a mix of both semantic and syntax diagnostics, depending on what could be parsed

Improvement examples

This shows the "truncation" idea once a syntax error spans >20 rows

Screen.Recording.2024-09-13.at.2.37.24.PM.mov

This shows improvements in the example in posit-dev/positron#2943, which used to light up the whole file

Screen.Recording.2024-09-13.at.2.39.33.PM.mov

This is a tree-sitter-r test file, with many syntax errors. Note that 1) it doesn't light up the whole file and 2) it still shows some semantic issues too (symbol not found errors)

Screen.Recording.2024-09-13.at.2.48.23.PM.mov

Improvements on the example from posit-dev/positron#4177. We often target the missing opening/closing node now. At the very end it shows that } doesn't have a matching opening {, and I do think that is still a technically correct syntax error message, even if we'd like to show the unmatched ( (Positron does at least highlight that unmatched ( in red here, which is nice)

Screen.Recording.2024-09-13.at.2.51.09.PM.mov

@DavisVaughan DavisVaughan marked this pull request as ready for review September 13, 2024 18:56
crates/ark/src/lsp/snapshots/indent.R Show resolved Hide resolved
Comment on lines -700 to -701
else
if (condition2) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, this is not a valid top level if statement

Comment on lines +735 to +736
{
if (cond1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I carefully retained all the existing indentation here, I just bumped it out by 1 ident with the addition of the braces to make it valid R code

crates/ark/src/lsp/diagnostics.rs Show resolved Hide resolved
crates/ark/src/lsp/diagnostics.rs Show resolved Hide resolved
crates/ark/src/lsp/diagnostics.rs Outdated Show resolved Hide resolved
Comment on lines +595 to +602
// TODO: This should be a syntax check, as the grammar should not allow
// two `Argument` nodes side by side
fn check_call_like_next_sibling(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 65 to 67
// When we hit an `ERROR` node, i.e. a syntax error, it often has its own children
// which can also be `ERROR`s. The goal is to target the deepest (most precise) `ERROR`
// nodes and only report syntax errors for those.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special handling of ERROR nodes that may have ERROR children is here

@DavisVaughan DavisVaughan changed the title Bump tree-sitter 4 - Syntax vs Semantic Diagnostics Bump tree-sitter-r 4 - Syntax vs Semantic Diagnostics Sep 13, 2024
@DavisVaughan DavisVaughan force-pushed the feature/bump-tree-sitter-3 branch from d1ba3a4 to fbf2300 Compare September 16, 2024 17:55
Copy link
Contributor

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! It's nice to see we can get this far with the TS parser.

My main comment is about potentially excessive tree traversals in the syntactic pass, I think we could do better and avoid some unnecessary passes.

crates/ark/src/treesitter.rs Outdated Show resolved Hide resolved
Comment on lines 417 to 418
pub(crate) fn node_find_string<'a>(node: &'a Node) -> Option<Node<'a>> {
if node.is_string() {
// Already on a string
return Some(*node);
}
// If we are on one of the following, we return the string parent:
// - Anonymous node inside a string, like `"'"`
// - `NodeType::StringContent`
// - `NodeType::EscapeSequence`
node.find_parent(|parent| parent.is_string())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just use .ancestors() since the starting node is inclusive.

Do we want .strict_ancestors() for the non-inclusive case? I would expect .ancestors() to be exclusive but we probably should follow rust-analyzer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hmm .ancestors() not being exclusive is surprising to me, I was not expecting node to be an "ancestor" of itself.

The unwrapped call makes more sense since it explicitly takes first, making it slightly less surprising when you also get first back on the first iteration

std::iter::successors(Some(*self), |p| p.parent())

crates/ark/src/lsp/completions/sources/unique/string.rs Outdated Show resolved Hide resolved
crates/ark/src/lsp/diagnostics.rs Show resolved Hide resolved
crates/ark/src/lsp/diagnostics_syntactic.rs Outdated Show resolved Hide resolved
crates/ark/src/lsp/diagnostics_syntactic.rs Outdated Show resolved Hide resolved
crates/ark/src/lsp/diagnostics_syntactic.rs Outdated Show resolved Hide resolved
crates/ark/src/lsp/diagnostics_syntactic.rs Outdated Show resolved Hide resolved
@DavisVaughan DavisVaughan force-pushed the feature/bump-tree-sitter-3 branch from fbf2300 to 8d55790 Compare September 20, 2024 13:24
Base automatically changed from feature/bump-tree-sitter-3 to main September 20, 2024 13:28
@DavisVaughan DavisVaughan force-pushed the feature/bump-tree-sitter-4 branch from 90138d5 to a1d9544 Compare September 20, 2024 13:38
@DavisVaughan DavisVaughan merged commit 0223ef4 into main Sep 23, 2024
3 checks passed
@DavisVaughan DavisVaughan deleted the feature/bump-tree-sitter-4 branch September 23, 2024 17:49
@github-actions github-actions bot locked and limited conversation to collaborators Sep 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants