Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellchecking integration #11660

Open
the-mikedavis opened this issue Sep 8, 2024 · 28 comments
Open

Spellchecking integration #11660

the-mikedavis opened this issue Sep 8, 2024 · 28 comments
Labels
C-enhancement Category: Improvements

Comments

@the-mikedavis
Copy link
Member

I've switched https://github.com/helix-editor/spellbook (Hunspell-like spellchecking library) to public and published a v0.1.0 release. It's still a work-in-progress but it's matured to the point where the check interface works robustly for at least en_US. Other languages should work too but there may be bugs with more advanced rules dictionaries can use. suggest functionality is what I'm working on now but now that check is in place it's a good time to start thinking about the integration with Helix which is also a potentially large amount of work.

I'm using it in my driver branch in (commit) where the integration is very naive: it checks all words in the viewport render and highlights misspelled words.

@pascalkuthe and I have chatted about this and here's what we're thinking for a proper integration:

  • Load dictionaries off the main thread: spellbook::Dictionary::new can take tens of milliseconds depending on the size of a dictionary.
  • Read dictionary files (like you might find at https://github.com/LibreOffice/dictionaries) from a runtime directory. We could try to find system-installed dictionaries to be a bit fancier but that can be done as a follow-up if at all.
  • Spellcheck open documents also off the main thread since we're checking the full document.
  • Use tree-sitter queries to tag nodes which should be checked. It might also be useful to specify a tokenization strategy when tagging nodes: one mode for prose and another for programming languages.
  • When a document changes, recheck incrementally: only recheck around and within the change's range.

Interacting with the spellchecker should be similar to interacting with LSP code-actions. <space>a should pull up LSP code actions and actions from the spellchecker: adding a misspelled word to a personal dictionary and suggestions from suggest. Adding to a personal dictionary can use the spellbook::Dictionary::add function and append a line with the word to a file in $XDG_STATE_HOME/helix/personal-dictionary/<locale>.

@the-mikedavis the-mikedavis added the C-enhancement Category: Improvements label Sep 8, 2024
@sbromberger
Copy link
Contributor

sbromberger commented Sep 8, 2024

Not that I'm against this idea at all - I'm really happy that we're considering adding spell checking as a first class feature - but what would be the advantage of using an integrated spell checker over an LSP-based solution? I've been using vale now for a while and (aside from the fact that it doesn't allow you to add to local dictionaries) it's pretty darn good. I am similarily enthusiastic about Harper, but it appears to have some different limitations.

Will there be plans for grammar checking as well?

@kirawi
Copy link
Member

kirawi commented Sep 8, 2024

You would be able to leverage tree-sitter for spell checking in programming languages.

@sbromberger
Copy link
Contributor

You would be able to leverage tree-sitter for grammar checking in programming languages.

What about natural languages (markdown / text / LaTeX?)

@kirawi
Copy link
Member

kirawi commented Sep 9, 2024

It should work as normal with LSPs now.

@sbromberger
Copy link
Contributor

Then I guess I'm not seeing as much of an advantage if I have to use an LSP in any case to get grammar checking. Most grammar checkers (vale, harper) do spell checking as well....

@kirawi
Copy link
Member

kirawi commented Sep 9, 2024

Yeah, it's most beneficial for programming languages.

@summersz
Copy link

summersz commented Sep 9, 2024

Looking forward to this. I have used CSpell in the past so having a different strategy/config for code and prose sounds like a good idea and is a limitation of the lsp alternatives.

@pascalkuthe
Copy link
Member

Yeah helix having a universally parser/sytanx tree woth tree sitter that we can use for spellchecker is something an LSP can't really match.

Spellchecking is also a fairly basic/common feature that is nice to have on core withlut having to use/setup a third party LSP.

@justinlovinger
Copy link

Vale and Harper can already spellcheck comments in code. typos can additionally spellcheck variable-names.

@justinlovinger
Copy link

justinlovinger commented Sep 9, 2024

I have concerns this feature will not age well and will eventually become a "thing new users should turn off and replace". Linting natural language is incredibly complicated, and even the best solutions today are far from ideal. I think spelling and grammar checking should be left to integration with external tools, not a part of Helix itself.

P.S. When I first switched from Vim to Helix, I wanted integrated spellchecking. Now that I have experience with the superior quality of language-server-based spellchecking, I never want to go back to a simple dictionary-check.

@the-mikedavis
Copy link
Member Author

Nothing's stopping you from continuing to use something else via LSP if you choose. You shouldn't need a language server installed for basic spellchecking (word checking, not grammar) though.

In any case this issue is about the technical aspects of integration. Discussion on the concept of integrating spellchecking is in #3637

@jsco2t
Copy link

jsco2t commented Sep 9, 2024

I'm a relatively new user to Helix - so take that as the important context in this statement. I spent a non-trivial bit of time setting up LSP's (in my case Vale) just so that I could get spell-checking in Markdown docs.

I get the concerns about the feature potentially not aging well. I think the simple solution to that (similar to what vim/nvim does) is to allow the built-in spell-checker to be turned off if needed.

For me - having a built-in spellchecker would have been a productivity/usability boost as a new user (a user who also sucks at spelling occasionally)

(also just realized this comment was probably off-topic per @the-mikedavis above - my apologies)

@pascalkuthe
Copy link
Member

I am unconvinced by typos(-ls) approach. They aim for low false positives rate and only check for known typos instead of checking words against dictonaries which is why they can check programming languages without parsing the sourccode. However I find the false negative rate way too high and find the results with traditional spell checkers much better.

With helix we are in a position where we do have a universal parser for every language and that means we can enable a much more reliable spellchecker. This is something that is impractical (and inefficient) for a LS to replicate.

@justinlovinger
Copy link

@jsco2t I responded to you on the discussion thread: #3637 (comment).

@pascalkuthe I responded to you on the discussion thread: #3637 (comment).

@hnorkowski
Copy link
Contributor

I'm using it in my driver branch in (commit) where the integration is very naive: it checks all words in the viewport render and highlights misspelled words.

I wanted to try it but there is a private dependency in it:

skidder = { git = "https://github.com/helix-editor/tree-sitter-helix.git" }

@the-mikedavis
Copy link
Member Author

Yeah that commit is on my driver branch which has a bunch of other changes. I pushed up another branch off of master: commit. You'll need a dictionary downloaded into $HELIX_RUNTIME/dictionaries/en_US/en_US.{dic,aff} to run that commit. You can download one from http://wordlist.aspell.net/dicts/ or copy the right files out of https://github.com/LibreOffice/dictionaries or https://searchfox.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/utf8

@hnorkowski
Copy link
Contributor

@the-mikedavis Is it possible to underline the spelling issues in a different color than the LSP issues?

@the-mikedavis
Copy link
Member Author

I was imagining that there would be diagnostics (same as LSP diagnostics) for misspelled words. They could be at a different or maybe customizable severity which would affect how they're displayed. In the linked commit/branch I'm using diagnostic.error only for simplicity

@mrnossiom
Copy link

mrnossiom commented Nov 9, 2024

I don't know if you've already started this implementation. I wanted to do something similar last year as an LSP for spellchecking support called lspelling. (Also made a custom incomplete hunspell port as the spelling checker engine called ruspell). I had arrived to the same conclusion as the one listed in the initial message.

In the wordc part of lspelling, I would parse sources files with TS, fragment the relevant bits with a TS query keeping prose distinct, process fragments with the chosen spellchecking backend (e.g. spellbook).

Also, if the TS strategy is great to avoid spellchecking keywords, it would still be useful to have tailored dictionaries for language and programming terms. CSpell had done a great job if I recall correctly.

I would be happy to help on this feature

@cgahr
Copy link
Contributor

cgahr commented Dec 7, 2024

The spellbook branch looks already really usable to me. Highlight spelling mistakes alone is a game changer for me.

You can download one from http://wordlist.aspell.net/dicts/ or copy the right files out of https://github.com/LibreOffice/dictionaries or https://searchfox.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/utf8

Expanding on this a bit:

Is it possible to configure which treesitter objects (?) are spellchecked? For example, what if I only want spellchecking of i.e. variable names?

EDIT: where do I have to put personal-dictionary.txt?
EDIT2: the personal dictionary should be under $XDG_STATE_HOME/state/helix/personal-dictionary.txt, normally ~/.local/state/helix/personal-dictionary.txt

@the-mikedavis
Copy link
Member Author

I believe that branch should be creating the personal dictionary file for you if you use the add_word_to_personal_dictionary command. I should update that branch now that Spellbook has suggestions support - you should be able to get a popup of suggestions (same UI as code actions) for spelling errors.

That branch is really hacky though - there isn't any interaction with tree-sitter to (dis)qualify nodes to check yet and the checking is run per-render on the visible viewport.

@cgahr
Copy link
Contributor

cgahr commented Dec 10, 2024

I believe that branch should be creating the personal dictionary file for you if you use the add_word_to_personal_dictionary command.

Yes and no: if the directory ~/.local/state/helix exists, the file is created automatically. If the directory does not exist, add_word_to_personal_dict fails with an OS error 2 (I think)

That branch is really hacky though - there isn't any interaction with tree-sitter to (dis)qualify nodes to check yet and the checking is run per-render on the visible viewport.

Yes, I am aware! However, and that's kudos to you, it already works really and is much more performant than for example https://github.com/ltex-plus/ltex-ls-plus and does what it's supposed to.

From a users perspective, only treesitter is missing to decrease the number of false-positives.

Great work already!

@summersz
Copy link

summersz commented Dec 11, 2024

I have just started testing this branch too, and I agree it is working really well.

Is it possible to add words to a dict on a project level?

I have used cSpell in the past and that only checks words that are greater than 3 characters long which removes a lot of false negatives. It would be useful if that could be added as a configurable option.

If a single word within a kebab-case string is identified as misspelled it highlights the whole string. Not sure if this is intended, but I would expect it just to highlight the misspelled word(s) like it is doing for other cases.

When using the add_word_to_personal_dictionary command on a highlighted word within a camel or pascal case string it passes the whole string to the command and not just the highlighted segment.

@dpc
Copy link
Contributor

dpc commented Dec 11, 2024

This will be a bit off-topic for PR, but relevant to the topic of "false negatives". typos takes a different approach where it matches again common mispellings instead of against a dictionary with words that are spelled correctly. This catches most of the typos, with almost no false positives.

@cgahr
Copy link
Contributor

cgahr commented Dec 12, 2024

Another feature that would be nice to have are custom affixes. For example, my custom dictionary contains

gridpoint/S
hyperparameter/S
invariant/S

and uses existing suffixes to define the plural.

Now I want add words with the prefix pre, like

precompute/DG
preprocess/DG

Of course I can specify them manually (which is probably most of the time sufficient), but it would be nice to able to add a custom .aff file, that extends the existing one. This would allow users to write e.g.

compute/DGO
process/DGO

with the custom prefix

PFX O Y 1
PFX O   0     pre         .

In the best case, the words are also merged with existing affixes. In the case of compute, we would have compute/ADSGO.

@pascalkuthe
Copy link
Member

This will be a bit off-topic for PR, but relevant to the topic of "false negatives". typos takes a different approach where it matches again common mispellings instead of against a dictionary with words that are spelled correctly. This catches most of the typos, with almost no false positives.

We are aware of typos but intentionally choose a more traditionam dictionary approach because typoes has way too many false negatives. I would rather add to my personal dictionary and use TS for eliminating false negatives.

@dpc
Copy link
Contributor

dpc commented Dec 14, 2024

@pascalkuthe Yeah, I was not suggesting that approach in Helix. I think typos shines in CIs to enforce some base level of quality, where false positives are very very annoying. During text editing most people doesn't mind few false positives. Having said that on Helix Matrix channel someone linked https://github.com/tekumara/typos-lsp and I just configured it in Helix for my .md files editing. I'll see how well it works in practice.

@hholst80
Copy link

Having a spellcheck in parity with Vim would require no manual setup to provision a user dictionary. just press zg/zw will add/remove the word under the cursor to the user dictionary. if it does not exist, it will be created. I think Helix should be no less user friendly than that! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: Improvements
Projects
None yet
Development

No branches or pull requests