Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tree-sitter queries for language indents instead of TOML #114

Closed
hovsater opened this issue Jun 4, 2021 · 14 comments · Fixed by #1562
Closed

Using tree-sitter queries for language indents instead of TOML #114

hovsater opened this issue Jun 4, 2021 · 14 comments · Fixed by #1562
Labels
A-tree-sitter Area: Tree-sitter C-discussion Category: Discussion or questions that doesn't represent real issues C-enhancement Category: Improvements

Comments

@hovsater
Copy link
Contributor

hovsater commented Jun 4, 2021

neovim solves indentation using indent queries, see Rust's indents.scm for an example. Given we're already leveraging tree-sitter for syntax highlighting, wouldn't it make sense to use it for indentation as well?

A big benefit of doing so, would be the ability to re-use already written indents.scm queries written by the nvim-treesitter project.

@hovsater hovsater changed the title Is there a reason to why we use TOML for indents instead of another tree-sitter query? Using tree-sitter queries for language indents instead of TOML Jun 4, 2021
@archseer
Copy link
Member

archseer commented Jun 5, 2021

I initially tried doing that, but there's some tradeoffs:

We need to traverse the nodes from the innermost one up to the root and check whether each node is considered an indentation level. tree-sitter queries can take a range to scan, but since we go all the way up to root that usually means querying the whole document.

We're only able to get an iterator over all the matches on the QueryCursor so we can't traverse the tree directly in the same way we can just do it manually. We will also see nodes that could sit in a different subtree than the one we're looking at. nvim-treesitter circumvents this by holding a map of node ids, with true if the node matched, then still manually traverse the tree and check if the nodes are on that list. However, this will include all the nodes in the document, and there might be hundreds of them! Usually we only need to traverse a couple.

I think the behavior also might not fully line up with nvim-treesitter's implementation, so while their queries are a good start they might require some tweaking.

I tried looking at the parsed Query itself to see whether we could pull the scopes out, but that wasn't possible.

I considered using the .scm syntax and just implementing a dumb parser that looks for a [ ... ] @indent, but I figured it would be confusing if users tried adding more complex queries and it broke.

@pickfire
Copy link
Contributor

pickfire commented Jun 5, 2021

We need to traverse the nodes from the innermost one up to the root and check whether each node is considered an indentation level. tree-sitter queries can take a range to scan, but since we go all the way up to root that usually means querying the whole document.

What are the performance implication of that. How large of a code file will it encountered an issue? Maybe it is insignificant, how much memory does it take?

@archseer
Copy link
Member

archseer commented Jun 6, 2021

If doing this on a large file, there will be a lot of nodes. That's why tree-sitter exposes the iterator, so it's not rendered into memory at once. We could do some measurements though.

@pickfire
Copy link
Contributor

pickfire commented Jun 6, 2021

Yeah, then we use the iterator. Will that be an issue?

@hovsater
Copy link
Contributor Author

hovsater commented Jun 6, 2021

I think it's worth experimenting with at least. It's a major advantage being able to use the already written indents.scm. Perhaps some benchmarks would be in order? 🙂

@cessen
Copy link
Contributor

cessen commented Jun 11, 2021

A cheaper way to detect indentation than traversing tree-sitter is to write a bit of language-agnostic heuristic code. In Led I do this:
https://github.com/cessen/led/blob/27572c8838a1c664ee378a19358604063881cc1d/src/editor/mod.rs#L164-L241

The basic idea is to track differences in leading whitespace between adjacent lines, and build a histogram of the increases (not decreases). And you essentially just pick the most numerous item from the histogram as your indentation.

My implementation is pretty naive, and there's still a lot that can be improved about it (e.g. skipping blank lines, maybe slightly favoring more common indentations, etc.). But it works reliably even as-is, and I've basically had zero issues with it.

At the very least, something like this as a fallback for languages that don't have a tree-sitter implementation would be nice.

@pickfire
Copy link
Contributor

In kakoune https://github.com/mawww/kakoune/blob/master/rc/filetype/rust.kak#L118 is done, although it can handle most cases, there are a lot of cases still it cannot be handled, especially when the long lines are wrapped. I think a tree-sitter based approach could solve this issue.

@archseer
Copy link
Member

Yeah I think a general purpose fallback would be useful. I've left a placeholder in the code:

} else {
// TODO: heuristics for non-tree sitter grammars
0
}

@cessen
Copy link
Contributor

cessen commented Jun 12, 2021

@archseer

Yeah I think a general purpose fallback would be useful. I've left a placeholder in the code:

Awesome! Mind if I take a crack at implementing that bit? Since I already implemented it in Led, it should be pretty quick.

@cessen
Copy link
Contributor

cessen commented Jun 12, 2021

Oh, looking at the actual code, that's not what I thought it was. I'm talking about detecting the indentation style of files on load (e.g. tabs vs spaces, and if spaces how many), which then get stored as a (user-modifiable) setting in the buffer.

Does Helix have anything like that, or is something like that planned?
I'll open a separate issue.

(Edit: opened #239)

@kirawi kirawi added A-tree-sitter Area: Tree-sitter C-discussion Category: Discussion or questions that doesn't represent real issues C-enhancement Category: Improvements labels Aug 22, 2021
@VarLad
Copy link
Contributor

VarLad commented Aug 28, 2021

Any update on this?

This would be really convenient, since a lot of languages would get indent and highlight features

@archseer
Copy link
Member

@VarLad

We already use tree-sitter queries for highlights (check runtime/queries) locals and injections, this issue is specifically about the indents being toml instead of scm.

I explained my reasoning above: #114 (comment)

Even if the queries were in .scm format they'd still be a separate set of queries from highlights.scm because they query for slightly different sets of nodes.

@kirawi kirawi mentioned this issue Nov 6, 2021
@kirawi
Copy link
Member

kirawi commented Nov 6, 2021

A cheaper way to detect indentation than traversing tree-sitter is to write a bit of language-agnostic heuristic code. In Led I do this: https://github.com/cessen/led/blob/27572c8838a1c664ee378a19358604063881cc1d/src/editor/mod.rs#L164-L241

The basic idea is to track differences in leading whitespace between adjacent lines, and build a histogram of the increases (not decreases). And you essentially just pick the most numerous item from the histogram as your indentation.

My implementation is pretty naive, and there's still a lot that can be improved about it (e.g. skipping blank lines, maybe slightly favoring more common indentations, etc.). But it works reliably even as-is, and I've basically had zero issues with it.

At the very least, something like this as a fallback for languages that don't have a tree-sitter implementation would be nice.

This could probably be converted into a separate issue.

@hovsater
Copy link
Contributor Author

@archseer I believe this can be closed, right? I don't think we will pursue this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tree-sitter Area: Tree-sitter C-discussion Category: Discussion or questions that doesn't represent real issues C-enhancement Category: Improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants