Similarly to pymorphy2, this implementation is based on the opencorpora morphological dictionary and the dawgdic library. But, unlike pymorphy2, this version supports python 3.10+ and the latest versions of the opencorpora dictionary. In addition, this implementation has only ~300 lines of python and ~200 lines of cython code, and also only 3 dependencies: dawgdic, pandas и lxml. Performance and features: ~150K words / sec, ~11 Mb of RAM (~5M word forms), user-defined lemmas support. Future plans: contextual disambiguation and spelling correction.
git clone https://github.com/vaaliferov/161_lemma && cd 161_lemma
python3 -m venv env && source env/bin/activate && pip install .
wget https://opencorpora.org/files/export/dict/dict.opcorpora.xml.zip
unzip dict.opcorpora.xml.zip && rm dict.opcorpora.xml.zip
from lemma.build import build
params = {
'out_dir': 'data',
'dict_path': 'dict.opcorpora.xml',
'custom_dict_path': 'dict.custom.json'
}
build(**params) # ~25min
from lemma.morph import Morph
morph = Morph('data')
print(morph.search('озера'))