-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spellchecking #2
Comments
Hi, thanks! That's great to hear. I thought LanguageTool used HunSpell which is why I did not consider porting it since HunSpell already has good Python ports. However, reading the docs they say:
so porting that functionality to Rust might actually be interesting. Thanks for bringing that up, and I'll mark it as an enhancement for now. I'll take a closer look once some more important things w.r.t speed and rule coverage are done. |
In the meantime you could use one of the Python wrappers for HunSpell before applying NLPRule. For separation of concerns I do not want to depend on HunSpell in NLPRule. |
Sounds good! I'm looking for speedy libraries though, so while yours is appealing, HunSpell is less so. I'll wait however long it takes for you to do your Rust magic ;) |
Just tracking some info here. There is some information at https://dev.languagetool.org/hunspell-support#morfologik. Specifically, LT uses a dictionary file that looks like this:
As far as I can tell this is the only resource used. It is parsed by the Morfologik library: https://github.com/morfologik/morfologik-stemming/tree/master/morfologik-speller. |
I am now certain that this is functionality nlprule should have. Some rules suppress misspellings so an integration of spellchecking with NLPRule is definitely better than spellchecking as a separate step in the pipeline. I'm still not sure when I'll get around to implementing this but it will be done. |
Thanks! |
Imho this is mixing concerns, making srx a plugable module, and just operate on tokenized streams, might be a better idea. Then one can use |
First off, for context, I think a custom implementation of the SymSpell algorithm for spellchecking will be best. The hard constraint is that using the Python bindings enabling spellchecking has to be as simple as: from nlprule import Tokenizer, Rules
# not necessarily exactly like this but equivalently easy
tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer, spellcheck=True) Additionally the APIs of the Python lib and Rust crate are the same at the moment, it would be good to keep it that way. Mixing of concerns is a valid point but in my opinion it is not an issue if it is (a) cleanly separated within the crate (easy) and (b) the effect of including spellchecking if disabled is negligible (not so easy). Regarding (b) I think the key issue is size of the binary but it should be possible to store the data in the It would of course be possible to split nlprule into one high-level crate (equiv. to the Python API) and many subcrates for spellchecking, rule-based checking, sentence segmentation etc. but that would open up a bunch of other issues (public API of each crate (e. g. if the compilation step happens in another crate lots of things that should be private have to be public), how to separate the binaries, ...) and I don't think it is necessary if (b) can be done. |
I though about this a bit more and I do now think that something like this could make sense and be worth doing. It would certainly need a significant amount of work though. I'll open another issue for modularizing the crate. I'll still implement spellchecking the way I mentioned above though because I want the functionality as soon as possible. I have a mostly working version locally where the |
There is now a draft PR with an implementation of this: #51. |
I love this library already, I've been looking for something like this for a project of mine for months now! However, I saw the README said this about the project:
"and without all the extra stuff LanguageTool does such as spellchecking, n-gram based error detection, etc."
It would be super nice to have the spellchecking part of LanguageTool in this library, as spellchecking is one of the most used features in many, if not all, general-purpose NLP libraries. I'm only good at Python though, so I personally can't help until I focus more on improving my Rust :(
The text was updated successfully, but these errors were encountered: