Using tree-sitter queries for language indents instead of TOML #114

hovsater · 2021-06-04T16:23:50Z

neovim solves indentation using indent queries, see Rust's indents.scm for an example. Given we're already leveraging tree-sitter for syntax highlighting, wouldn't it make sense to use it for indentation as well?

A big benefit of doing so, would be the ability to re-use already written indents.scm queries written by the nvim-treesitter project.

The text was updated successfully, but these errors were encountered:

archseer · 2021-06-05T04:09:54Z

I initially tried doing that, but there's some tradeoffs:

We need to traverse the nodes from the innermost one up to the root and check whether each node is considered an indentation level. tree-sitter queries can take a range to scan, but since we go all the way up to root that usually means querying the whole document.

We're only able to get an iterator over all the matches on the QueryCursor so we can't traverse the tree directly in the same way we can just do it manually. We will also see nodes that could sit in a different subtree than the one we're looking at. nvim-treesitter circumvents this by holding a map of node ids, with true if the node matched, then still manually traverse the tree and check if the nodes are on that list. However, this will include all the nodes in the document, and there might be hundreds of them! Usually we only need to traverse a couple.

I think the behavior also might not fully line up with nvim-treesitter's implementation, so while their queries are a good start they might require some tweaking.

I tried looking at the parsed Query itself to see whether we could pull the scopes out, but that wasn't possible.

I considered using the .scm syntax and just implementing a dumb parser that looks for a [ ... ] @indent, but I figured it would be confusing if users tried adding more complex queries and it broke.

pickfire · 2021-06-05T12:09:24Z

We need to traverse the nodes from the innermost one up to the root and check whether each node is considered an indentation level. tree-sitter queries can take a range to scan, but since we go all the way up to root that usually means querying the whole document.

What are the performance implication of that. How large of a code file will it encountered an issue? Maybe it is insignificant, how much memory does it take?

archseer · 2021-06-06T02:17:49Z

If doing this on a large file, there will be a lot of nodes. That's why tree-sitter exposes the iterator, so it's not rendered into memory at once. We could do some measurements though.

pickfire · 2021-06-06T05:23:00Z

Yeah, then we use the iterator. Will that be an issue?

hovsater · 2021-06-06T06:25:02Z

I think it's worth experimenting with at least. It's a major advantage being able to use the already written indents.scm. Perhaps some benchmarks would be in order? 🙂

cessen · 2021-06-11T16:42:20Z

A cheaper way to detect indentation than traversing tree-sitter is to write a bit of language-agnostic heuristic code. In Led I do this:
https://github.com/cessen/led/blob/27572c8838a1c664ee378a19358604063881cc1d/src/editor/mod.rs#L164-L241

The basic idea is to track differences in leading whitespace between adjacent lines, and build a histogram of the increases (not decreases). And you essentially just pick the most numerous item from the histogram as your indentation.

My implementation is pretty naive, and there's still a lot that can be improved about it (e.g. skipping blank lines, maybe slightly favoring more common indentations, etc.). But it works reliably even as-is, and I've basically had zero issues with it.

At the very least, something like this as a fallback for languages that don't have a tree-sitter implementation would be nice.

pickfire · 2021-06-12T06:52:36Z

In kakoune https://github.com/mawww/kakoune/blob/master/rc/filetype/rust.kak#L118 is done, although it can handle most cases, there are a lot of cases still it cannot be handled, especially when the long lines are wrapped. I think a tree-sitter based approach could solve this issue.

archseer · 2021-06-12T07:10:02Z

Yeah I think a general purpose fallback would be useful. I've left a placeholder in the code:

helix/helix-core/src/indent.rs

Lines 144 to 147 in c754df1

    
           } else { 
        
               // TODO: heuristics for non-tree sitter grammars 
        
               0 
        
           }

cessen · 2021-06-12T16:02:45Z

@archseer

Yeah I think a general purpose fallback would be useful. I've left a placeholder in the code:

Awesome! Mind if I take a crack at implementing that bit? Since I already implemented it in Led, it should be pretty quick.

cessen · 2021-06-12T18:26:45Z

Oh, looking at the actual code, that's not what I thought it was. I'm talking about detecting the indentation style of files on load (e.g. tabs vs spaces, and if spaces how many), which then get stored as a (user-modifiable) setting in the buffer.

~~Does Helix have anything like that, or is something like that planned?~~
I'll open a separate issue.

(Edit: opened #239)

VarLad · 2021-08-28T12:47:33Z

Any update on this?

This would be really convenient, since a lot of languages would get indent and highlight features

archseer · 2021-08-28T12:52:56Z

@VarLad

We already use tree-sitter queries for highlights (check runtime/queries) locals and injections, this issue is specifically about the indents being toml instead of scm.

I explained my reasoning above: #114 (comment)

Even if the queries were in .scm format they'd still be a separate set of queries from highlights.scm because they query for slightly different sets of nodes.

kirawi · 2021-11-06T21:45:15Z

A cheaper way to detect indentation than traversing tree-sitter is to write a bit of language-agnostic heuristic code. In Led I do this: https://github.com/cessen/led/blob/27572c8838a1c664ee378a19358604063881cc1d/src/editor/mod.rs#L164-L241

The basic idea is to track differences in leading whitespace between adjacent lines, and build a histogram of the increases (not decreases). And you essentially just pick the most numerous item from the histogram as your indentation.

My implementation is pretty naive, and there's still a lot that can be improved about it (e.g. skipping blank lines, maybe slightly favoring more common indentations, etc.). But it works reliably even as-is, and I've basically had zero issues with it.

At the very least, something like this as a fallback for languages that don't have a tree-sitter implementation would be nice.

This could probably be converted into a separate issue.

hovsater · 2022-01-15T15:11:40Z

@archseer I believe this can be closed, right? I don't think we will pursue this further.

hovsater changed the title ~~Is there a reason to why we use TOML for indents instead of another tree-sitter query?~~ Using tree-sitter queries for language indents instead of TOML Jun 4, 2021

kirawi added A-tree-sitter Area: Tree-sitter C-discussion Category: Discussion or questions that doesn't represent real issues C-enhancement Category: Improvements labels Aug 22, 2021

kirawi mentioned this issue Nov 6, 2021

glsl support #993

Merged

Triton171 mentioned this issue Feb 10, 2022

Indentation rework #1562

Merged

pokey mentioned this issue Mar 23, 2022

Change scopeType matchers to rely on tree-sitter style scheme queries cursorless-dev/cursorless#616

Open

archseer closed this as completed in #1562 Mar 30, 2022

valpackett mentioned this issue Apr 29, 2022

fix(docs): cleanup obsolete indents.toml mentions #2327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using tree-sitter queries for language indents instead of TOML #114

Using tree-sitter queries for language indents instead of TOML #114

hovsater commented Jun 4, 2021 •

edited

Loading

archseer commented Jun 5, 2021 •

edited

Loading

pickfire commented Jun 5, 2021

archseer commented Jun 6, 2021

pickfire commented Jun 6, 2021

hovsater commented Jun 6, 2021

cessen commented Jun 11, 2021

pickfire commented Jun 12, 2021

archseer commented Jun 12, 2021

cessen commented Jun 12, 2021

cessen commented Jun 12, 2021 •

edited

Loading

VarLad commented Aug 28, 2021

archseer commented Aug 28, 2021

kirawi commented Nov 6, 2021

hovsater commented Jan 15, 2022

Using tree-sitter queries for language indents instead of TOML #114

Using tree-sitter queries for language indents instead of TOML #114

Comments

hovsater commented Jun 4, 2021 • edited Loading

archseer commented Jun 5, 2021 • edited Loading

pickfire commented Jun 5, 2021

archseer commented Jun 6, 2021

pickfire commented Jun 6, 2021

hovsater commented Jun 6, 2021

cessen commented Jun 11, 2021

pickfire commented Jun 12, 2021

archseer commented Jun 12, 2021

cessen commented Jun 12, 2021

cessen commented Jun 12, 2021 • edited Loading

VarLad commented Aug 28, 2021

archseer commented Aug 28, 2021

kirawi commented Nov 6, 2021

hovsater commented Jan 15, 2022

hovsater commented Jun 4, 2021 •

edited

Loading

archseer commented Jun 5, 2021 •

edited

Loading

cessen commented Jun 12, 2021 •

edited

Loading