Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a script instead of submodules for grammars #1560

Closed
wants to merge 20 commits into from

Conversation

the-mikedavis
Copy link
Member

@the-mikedavis the-mikedavis commented Jan 22, 2022

closes #1549

there's still a bunch to do before this is ready:

  • grammars.cmd for windows
  • replace any existing documentation for fetching and adding new grammars
  • fall back to a regular clone when the remote doesn't support cloning at exact revisions (I believe most/all popular hosted git remotes support fetch-at-revision these days)
  • a check/status command that checks if you're currently synced
  • a fetch command that downloads a new grammar repository (maybe this isn't necessary, you can always just git clone)
  • a save command that updates revisions.txt with your current local grammar set (useful for adding new grammars) (eh, I think it makes sense to download new grammars using the revisions.txt file declaratively)

But the results are already pretty nice. By taking advantage of shallow fetches at exact revisions, we get down to ~30s to fetch all grammars and ~1.5s to re-run sync (idempotently). For perspective:

  • a full recursive clone of helix currently takes ~660s (11 minutes)
  • git submodules update --init takes ~250s (a little over 4 minutes)
  • git submodule update --init --depth 1 takes ~80s

Ideally I'd like this script to allow one to hack on a grammar as they're integrating it, so it's meant to not care about other remotes and to fail when checking out a revision that would overwrite local modifications.

@Omnikar
Copy link
Contributor

Omnikar commented Jan 22, 2022

You'll also need to change the documentation that mentions the submodule system, such as README.md and the adding languages docs.

@archseer
Copy link
Member

For perspective, a regular recursive clone of helix currently takes me ~660s (11 minutes) on a very good connection and the first run of git submodules update --init takes ~250s (a little over 4 minutes).

Note: You can use git submodule update --init --depth 1 to do a shallow clone there too (or git clone --shallow-submodules -j8 https://github.com/helix-editor/helix as per README). I do agree that the new scripts will be more flexible though

scripts/revisions.txt Outdated Show resolved Hide resolved
@archseer
Copy link
Member

archseer commented Jan 23, 2022

Bonus points:

  • convert the helix-syntax/build.rs into a cargo xtask action
  • move the helix-syntax/src/lib.rs into helix-core/src/syntax.rs and delete the helix-syntax crate

This would make it easier to build specific grammars and avoid issues with helix-syntax cache where one needs to cargo clean -p helix-syntax right now to trigger a grammar build.

We could also turn the build.rs into a shell script, that would allow users to compile grammars without needing a Rust installation (all we're really doing is invoking the C compiler)

@dead10ck
Copy link
Member

dead10ck commented Jan 24, 2022

There seem to be a few people that are in favor of this PR, so could someone help me understand? It seems like this approach is also:

  • Requiring all grammars be built
  • Pinning down all the grammars to a specific commit with a bespoke config format
  • Implementing a bespoke CLI for updating grammars, except with no built in notification when your local repos are out of sync with the config. The user has to explicitly opt into even checking if their local subrepos are in sync.
  • Needs at least two implementations of the same said CLI

Is this actually faster than git submodule update --init --depth 1?

In general, I like the idea of moving away from submodules because they can be a huge pain, but if the purpose is to improve flexibility, then why pin versions? It seems like we're just reimplementing git submodules here in bash. I understand the reasons in general for pinning versions, as it provides a reproducible set of deps that were tested against, but they naturally come with maintenance overhead. And will this really improve that?

It seems to me like between flexibility and reproducibility, this is like the worst of both worlds.

@archseer
Copy link
Member

Rather than improve performance, the goal is to have the set of grammars easier to configure. Using submodules isn't scaling for us anymore: CI tests now frequently fail on fetching these submodules due to network issues because we have so many.

With a script the user can:

  • Build grammars more selectively
  • Add new grammars locally without having to upstream them
  • Build the grammars without requiring a Rust compiler / having to rebuild Helix
  • The user has to explicitly opt into even checking if their local subrepos are in sync.

I think most users already ignore the git submodule warnings. We frequently get questions on why X doesn't work because they forgot to pull the grammars, and PRs often include accidental submodule reverts.

but if the purpose is to improve flexibility, then why pin versions?

We're pinning versions because grammars often have breaking changes that break existing highlights. If they add new grammar items we also need to add them to our queries.

@archseer
Copy link
Member

  • Needs at least two implementations of the same said CLI

We could make this a part of cargo xtask but that would bring back the compiler dependency. We can keep the scripts small though, all we really need is fetching the dependencies then running a C compiler. (https://github.com/tree-sitter/tree-sitter/tree/master/script also has Windows versions of various scripts)

@dead10ck
Copy link
Member

dead10ck commented Jan 24, 2022

Rather than improve performance, the goal is to have the set of grammars easier to configure. Using submodules isn't scaling for us anymore: CI tests now frequently fail on fetching these submodules due to network issues because we have so many.

Does this approach help? Don't we have to clone just as many repos? Or is it that right now, we must clone everything, but this way we don't?

With a script the user can:

  • Build grammars more selectively

How will this work? Will the user edit the revisions.txt that's checked in and make dirty builds? Or are we going to make some kind of mechanism to override this file out of tree?

  • Add new grammars locally without having to upstream them

Can't they do this now? If I'm not mistaken, users are currently able to do this by just building the .so, and putting that and the queries in the runtime folder, right? Would this change that?

If you mean that the user workflow for adding a new language becomes just adding a line to revisions.txt, then that's not really true either. I don't disagree with this decision, but helix implements its queries differently than other popular tree-sitter editors/plugins, and so a consequence of this is that the user can't simply add another tree-sitter grammar repo and have it work; they also need to learn tree-sitter from the ground up and write all the queries in a way that's compatible just with helix. The user could do this, of course, but I don't see how the way helix manages its dependencies has any bearing on how easy it is to add other languages to their own runtime.

This might be going off topic a bit, but this is also why I think having languages be plugins would be the best long term solution. The tree-sitter dependency management, build process, and helix tree-sitter queries can live completely out of tree. This is how vim, VSCode, and probably others do it.

  • Build the grammars without requiring a Rust compiler / having to rebuild Helix

Does how we fetch and manage the grammars have a bearing on how we build them? To be honest, I'm not really sure why the Rust toolchain is involved in them at all. Is that a vestige of when the grammars were statically linked into the hx binary?

I think most users already ignore the git submodule warnings. We frequently get questions on why X doesn't work because they forgot to pull the grammars, and PRs often include accidental submodule reverts.

Yeah I understand users will be users, but this means that even for devs who do pay attention, it now becomes much more easy to not realize that your grammars are out of date. There is nothing to tell you so after doing a git pull of the latest master, for example. Just all of a sudden the highlighting and tree-sitter navigation is broken and you have no idea it's because the grammar and queries were updated. Now a dev has to always think about updating their grammars, whereas now, git tells you when you have to.

but if the purpose is to improve flexibility, then why pin versions?

We're pinning versions because grammars often have breaking changes that break existing highlights. If they add new grammar items we also need to add them to our queries.

That certainly makes sense, but then what kind of flexibility are we after?

@dead10ck
Copy link
Member

Don't get me wrong here, I hate git submodules with a burning passion, and I would love to get rid of them. I would not be loathe to work with bash instead of submodules. But as much as I hate them, they do provide some measure of benefit, and just based on the albeit limited experience I have with the repo so far, this particular way of replacing them seems like an inequitable trade-off, and it seems like there are other desires at play that have varying relevance to dependency management.

@archseer
Copy link
Member

archseer commented Jan 24, 2022

Does this approach help? Don't we have to clone just as many repos? Or is it that right now, we must clone everything, but this way we don't?

You can modify the list of repositories before running the script, or use a custom list.

We can also avoid git altogether by fetching https://github.com/<repo>/archive/<commit_sha>.zip for grammars hosted on github. That's what NixOS does for fetchFromGithub to speed things up. It would make #1536 even easier to do.

Can't they do this now? If I'm not mistaken, users are currently able to do this by just building the .so, and putting that and the queries in the runtime folder, right? Would this change that?

Right, but nobody actually does that since you need to figure out the correct compiler incantation to do so. Having a script for it would make this a viable option.

This is how vim, VSCode, and probably others do it.

Neovim does, but it requires you to have a properly set up compiler so it can compile the grammars, there's some downsides there (as opposed to fetching them from a release). VSCode doesn't actually use tree-sitter.

What we really want in the long term is tree-sitter/tree-sitter#930 stretch goal where grammars are .wasm files. Then all the .wasm grammars can be downloaded and checked against a checksum.

There is nothing to tell you so after doing a git pull of the latest master, for example.

We could still hook into this in build.rs or add a Makefile or some sort of build wrapper. The current process is not without problems: If you pull in new grammars, they won't get rebuilt since helix-syntax/src doesn't get changed. You need to cargo clean -p helix-syntax to trigger a grammar build. That might be fixable with https://doc.rust-lang.org/cargo/reference/build-scripts.html#rerun-if-changed though

@dead10ck
Copy link
Member

You can modify the list of repositories before running the script, or use a custom list.

Hm, if we're editing the file that's checked in, this will cause probably more accidental commits than submodules. Maybe the script could check for a ~/.config/helix/revisions.txt instead first?

This is how vim, VSCode, and probably others do it.

Neovim does, but it requires you to have a properly set up compiler so it can compile the grammars, there's some downsides there (as opposed to fetching them from a release). VSCode doesn't actually use tree-sitter.

Is this different than the reality now? If your grammars don't all have prebuilt binaries (do any of them?), then your machine must be capable of compiling them.

What we really want in the long term is tree-sitter/tree-sitter#930 stretch goal where grammars are .wasm files. Then all the .wasm grammars can be downloaded and checked against a checksum.

This would be nice, and very similar in spirit to the plugin system imagined for helix, but even when/if that lands, are we going to be able to expect all tree-sitter grammars from the community that we depend on to compile and distribute prebuilt wasm files? There are dozens. There isn't going to be a tree-sitter distribution platform. The more likely scenario is that many will still have to be compiled from source, right? I imagine the same will be true of the plugin system.

@the-mikedavis
Copy link
Member Author

The script ends up being about twice as fast as submodules, even with the --depth=1 (I think submodules are doing a shallow clone and then a shallow fetch, so twice the work).

I think you're right @dead10ck that this change doesn't really significantly improve the grammars situation, plus it's not great that it introduces a bespoke CLI and there's no git helper to tell you you're out of date. But this is meant to be a stepping stone for future improvements, namely the one people are most interested in afaict: being able to have a per-user subset of grammars. For that, I think having the script look for ~/.config/helix/revisions.txt is a reasonable next step.

... it now becomes much more easy to not realize that your grammars are out of date... Just all of a sudden the highlighting and tree-sitter navigation is broken and you have no idea it's because the grammar and queries were updated

Thanks for bringing this up, this will be a problem. There needs to be a way to specify a version of the queries used with a particular version of the grammar or else query analysis will fail when the grammar makes a breaking change to the set of nodes. The most minor "fix" to this would be to improve the error message that happens when query analysis fails and have it say that you need to ./scripts/grammars sync.

I think a more maintainable solution would be to do what nvim-treesitter does and have the queries in a repository with the revisions, so when you install/update a language it pulls in a version of the grammar and a version of the queries that work together. This approach might also work well in a WASM-based future: we could have a github actions automation that compiles each language's grammar into the necessary WASM and then the user simply downloads the WASM and the queries (by hand, with a script, or maybe a built-in command).

Alternatively, it wouldn't be much work to use something more formal
like a CSV rather than this SSV (space separated values) document.
Since there isn't much information to track here, I thought it'd be
wise to keep the format as simple as possible
macos does not have `readarray`, so we need to do this with a regular
array construction

it's a bit less readable but it seems to work the same
Comment on lines +1 to +53
https://github.com/tree-sitter/tree-sitter-agda ca69cdf485e9ce2b2ef0991a720aa88d87d30231
https://github.com/tree-sitter/tree-sitter-bash a8eb5cb57c66f74c63ab950de081207cccf52017
https://github.com/tree-sitter/tree-sitter-c f05e279aedde06a25801c3f2b2cc8ac17fac52ae
https://github.com/uyha/tree-sitter-cmake f6616f1e417ee8b62daf251aa1daa5d73781c596
https://github.com/stsewd/tree-sitter-comment 5dd3c62f1bbe378b220fe16b317b85247898639e
https://github.com/tree-sitter/tree-sitter-cpp e8dcc9d2b404c542fd236ea5f7208f90be8a6e89
https://github.com/tree-sitter/tree-sitter-c-sharp 53a65a908167d6556e1fcdb67f1ee62aac101dda
https://github.com/tree-sitter/tree-sitter-css 94e10230939e702b4fa3fa2cb5c3bc7173b95d07
https://github.com/UserNobody14/tree-sitter-dart 6a25376685d1d47968c2cef06d4db8d84a70025e
https://github.com/camdencheek/tree-sitter-dockerfile 7af32bc04a66ab196f5b9f92ac471f29372ae2ce
https://github.com/elixir-lang/tree-sitter-elixir f5d7bda543da788bd507b05bd722627dde66c9ec
https://github.com/elm-tooling/tree-sitter-elm bd50ccf66b42c55252ac8efc1086af4ac6bab8cd
https://github.com/ram02z/tree-sitter-fish 04e54ab6585dfd4fee6ddfe5849af56f101b6d4f
https://github.com/the-mikedavis/tree-sitter-git-commit 066e395e1107df17183cf3ae4230f1a1406cc972
https://github.com/the-mikedavis/tree-sitter-git-config 0e4f0baf90b57e5aeb62dcdbf03062c6315d43ea
https://github.com/the-mikedavis/tree-sitter-git-diff c12e6ecb54485f764250556ffd7ccb18f8e2942b
https://github.com/the-mikedavis/tree-sitter-git-rebase 332dc528f27044bc4427024dbb33e6941fc131f2
https://github.com/theHamsta/tree-sitter-glsl 88408ffc5e27abcffced7010fc77396ae3636d7e
https://github.com/tree-sitter/tree-sitter-go 0fa917a7022d1cd2e9b779a6a8fc5dc7fad69c75
https://github.com/bkegley/tree-sitter-graphql 5e66e961eee421786bdda8495ed1db045e06b5fe
https://github.com/tree-sitter/tree-sitter-haskell b6ec26f181dd059eedd506fa5fbeae1b8e5556c8
https://github.com/tree-sitter/tree-sitter-html d93af487cc75120c89257195e6be46c999c6ba18
https://github.com/tree-sitter/tree-sitter-java bd6186c24d5eb13b4623efac9d944dcc095c0dad
https://github.com/tree-sitter/tree-sitter-javascript 4a95461c4761c624f2263725aca79eeaefd36cad
https://github.com/tree-sitter/tree-sitter-json 65bceef69c3b0f24c0b19ce67d79f57c96e90fcb
https://github.com/tree-sitter/tree-sitter-julia 12ea597262125fc22fd2e91aa953ac69b19c26ca
https://github.com/latex-lsp/tree-sitter-latex 7f720661de5316c0f8fee956526d4002fa1086d8
https://github.com/Julian/tree-sitter-lean d98426109258b266e1e92358c5f11716d2e8f638
https://github.com/cbarrete/tree-sitter-ledger 0cdeb0e51411a3ba5493662952c3039de08939ca
https://github.com/benwilliamgraham/tree-sitter-llvm 3b213925b9c4f42c1acfe2e10bfbb438d9c6834d
https://github.com/Flakebi/tree-sitter-llvm-mir 06fabca19454b2dc00c1b211a7cb7ad0bc2585f1
https://github.com/nvim-treesitter/tree-sitter-lua 6f5d40190ec8a0aa8c8410699353d820f4f7d7a6
https://github.com/alemuller/tree-sitter-make a4b9187417d6be349ee5fd4b6e77b4172c6827dd
https://github.com/MDeiml/tree-sitter-markdown ad8c32917a16dfbb387d1da567bf0c3fb6fffde2
https://github.com/cstrahan/tree-sitter-nix 50f38ceab667f9d482640edfee803d74f4edeba5
https://github.com/tree-sitter/tree-sitter-ocaml 23d419ba45789c5a47d31448061557716b02750a
https://github.com/ganezdragon/tree-sitter-perl 0ac2c6da562c7a2c26ed7e8691d4a590f7e8b90a
https://github.com/tree-sitter/tree-sitter-php 57f855461aeeca73bd4218754fb26b5ac143f98f
https://github.com/yusdacra/tree-sitter-protobuf 19c211a01434d9f03efff99f85e19f967591b175
https://github.com/tree-sitter/tree-sitter-python d6210ceab11e8d812d4ab59c07c81458ec6e5184
https://github.com/tree-sitter/tree-sitter-regex e1cfca3c79896ff79842f057ea13e529b66af636
https://github.com/tree-sitter/tree-sitter-ruby dfff673b41df7fadcbb609c6338f38da3cdd018e
https://github.com/tree-sitter/tree-sitter-rust a360da0a29a19c281d08295a35ecd0544d2da211
https://github.com/tree-sitter/tree-sitter-scala 0a3dd53a7fc4b352a538397d054380aaa28be54c
https://github.com/Himujjal/tree-sitter-svelte 349a5984513b4a4a9e143a6e746120c6ff6cf6ed
https://github.com/tree-sitter/tree-sitter-swift a22fa5e19bae50098e2252ea96cba3aba43f4c58
https://github.com/Flakebi/tree-sitter-tablegen 568dd8a937347175fd58db83d4c4cdaeb6069bd2
https://github.com/ikatyang/tree-sitter-toml 7cff70bbcbbc62001b465603ca1ea88edd668704
https://github.com/tree-sitter/tree-sitter-tsq b665659d3238e6036e22ed0e24935e60efb39415
https://github.com/tree-sitter/tree-sitter-typescript 3e897ea5925f037cfae2e551f8e6b12eec2a201a
https://github.com/ikatyang/tree-sitter-vue 91fe2754796cd8fba5f229505a23fa08f3546c06
https://github.com/szebniok/tree-sitter-wgsl f00ff52251edbd58f4d39c9c3204383253032c11
https://github.com/ikatyang/tree-sitter-yaml 0e36bed171768908f331ff7dff9d956bae016efb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this just introduce an additional step to manually edit this file and put in the submodule? I think this may be harder for new contributors to use rather than just git submodule add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it here removes any submodule steps, so no need to git submodule add and set it as shallow. It's not really an improvement though: you end up still needing to ./scripts/grammars sync once you add it to the file, so it ends up being the same number of steps to add a new grammar.

One area where it helps a lot is when a grammar changes remotes (like #830). The script handles that just with an update to revisions.txt but with submodules it's a total nightmare, you have to go edit multiple files under .git/ to fix the submodule. This is usually only relevant if you need to fork a grammar and make some changes when adding it, I don't know how common that is for others 🙃

The main thing the script improves is being able to have a per-user subset of grammars (albeit it's not implemented in this branch), which I can't think of a way to do well with submodules.

@the-mikedavis
Copy link
Member Author

I ported the helix-syntax/build.rs builder to an xtask instead of adding a build command for the scripts: some grammars have subdirectories to build which I think need a more expressive language to describe than what revisions.txt can offer.

Still to do: update the flake to read revisions.txt and fetch the grammars.

@archseer
Copy link
Member

archseer commented Feb 2, 2022

need a more expressive language to describe than what revisions.txt can offer.

Here's how the code looks inside the tree-sitter CLI if it helps: https://github.com/tree-sitter/tree-sitter/blob/15860392c5952e3ee6f93524fc75de5ef14493c0/cli/loader/src/lib.rs#L301-L444

It seems to assume a certain structure, but I'm guessing that doesn't work for everything.

@the-mikedavis
Copy link
Member Author

I think tree-sitter ends up consulting a grammar repository's package.json to detect repositories that have grammars in subdirectories (see here), and I think in helix that package.json info ends up rewritten in languages.toml.

Maybe it's better to have the revisions and remotes that are currently in revisions.txt be in languages.toml instead, so all of your language configuration goes in one place? Writing a toml parser in bash sounds like a fun project but writing a toml parser in windows batch script would probably drive me crazy 😅: maybe the fetching part of this script should be an xtask as well? It seems a bit wasteful to use xtask for basically shelling out to git, but I think it'd be easier to maintain cross-platform. Plus it'd be reusable when/if adding this functionality to helix runtime or making a new CLI to do it.

What do you think?

@the-mikedavis the-mikedavis deleted the md-script branch February 14, 2022 03:32
the-mikedavis added a commit to the-mikedavis/helix that referenced this pull request Mar 6, 2022
build_grammars adapts the functionality that previously came from
helix-syntax to be used at runtime from the command line flags.

fetch_grammars wraps command-line git to perform the same actions
previously done in the scripts in helix-editor#1560.
the-mikedavis added a commit to the-mikedavis/helix that referenced this pull request Mar 6, 2022
build_grammars adapts the functionality that previously came from
helix-syntax to be used at runtime from the command line flags.

fetch_grammars wraps command-line git to perform the same actions
previously done in the scripts in helix-editor#1560.
the-mikedavis added a commit to the-mikedavis/helix that referenced this pull request Mar 8, 2022
build_grammars adapts the functionality that previously came from
helix-syntax to be used at runtime from the command line flags.

fetch_grammars wraps command-line git to perform the same actions
previously done in the scripts in helix-editor#1560.
the-mikedavis added a commit to the-mikedavis/helix that referenced this pull request Mar 8, 2022
build_grammars adapts the functionality that previously came from
helix-syntax to be used at runtime from the command line flags.

fetch_grammars wraps command-line git to perform the same actions
previously done in the scripts in helix-editor#1560.
the-mikedavis added a commit to the-mikedavis/helix that referenced this pull request Mar 9, 2022
build_grammars adapts the functionality that previously came from
helix-syntax to be used at runtime from the command line flags.

fetch_grammars wraps command-line git to perform the same actions
previously done in the scripts in helix-editor#1560.
archseer pushed a commit that referenced this pull request Mar 10, 2022
build_grammars adapts the functionality that previously came from
helix-syntax to be used at runtime from the command line flags.

fetch_grammars wraps command-line git to perform the same actions
previously done in the scripts in #1560.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate replacing the submodules system for grammars
5 participants