GitHub

Lemmatizer (Russian Language)

Similarly to pymorphy2, this implementation is based on the opencorpora morphological dictionary and the dawgdic library. But, unlike pymorphy2, this version supports python 3.10+ and the latest versions of the opencorpora dictionary. In addition, this implementation has only ~300 lines of python and ~200 lines of cython code, and also only 3 dependencies: dawgdic, pandas и lxml. Performance and features: ~150K words / sec, ~11 Mb of RAM (~5M word forms), user-defined lemmas support. Future plans: contextual disambiguation and spelling correction.

Install

git clone https://github.com/vaaliferov/161_lemma && cd 161_lemma
python3 -m venv env && source env/bin/activate && pip install .

Download dictionary

wget https://opencorpora.org/files/export/dict/dict.opcorpora.xml.zip
unzip dict.opcorpora.xml.zip && rm dict.opcorpora.xml.zip

Build necessary files

from lemma.build import build

params = {
    'out_dir': 'data',
    'dict_path': 'dict.opcorpora.xml',
    'custom_dict_path': 'dict.custom.json'
}

build(**params) # ~25min

Test lemmatization

from lemma.morph import Morph

morph = Morph('data')
print(morph.search('озера'))

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
lib/dawgdic		lib/dawgdic
src/lemma		src/lemma
tmp		tmp
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
dict.custom.json		dict.custom.json
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lemmatizer (Russian Language)

Install

Download dictionary

Build necessary files

Test lemmatization

About

Releases

Packages

Languages

vaaliferov/161_lemma

Folders and files

Latest commit

History

Repository files navigation

Lemmatizer (Russian Language)

Install

Download dictionary

Build necessary files

Test lemmatization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages