Autocorrect

Spelling corrector in python. Currently supports English, Polish, Turkish, Russian, Ukrainian, Czech, Portuguese, Greek, Italian, Vietnamese, French and Spanish, but you can easily add new languages.

Based on: https://github.com/phatpiglet/autocorrect and Peter Norvig's spelling corrector.

Installation

pip install autocorrect

Examples

>>> from autocorrect import Speller
>>> spell = Speller()
>>> spell("I'm not sleapy and tehre is no place I'm giong to.")
"I'm not sleepy and there is no place I'm going to."

>>> spell = Speller('pl')
>>> spell('ptaaki latatją kluczmm')
'ptaki latają kluczem'

Speed

%timeit spell("I'm not sleapy and tehre is no place I'm giong to.")
373 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit spell("There is no comin to consiousnes without pain.")
150 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you see, for some words correction can take ~200ms. If speed is important for your use case (e.g. chatbot) you may want to use option 'fast':

spell = Speller(fast=True)
%timeit spell("There is no comin to consiousnes without pain.")
344 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, the correction should always work in microseconds, but words with double typos (like 'consiousnes') won't be corrected.

OCR

When cleaning up OCR, replacements are the large majority of errors. If this is the case, you may want to use the option 'only_replacements':

spell = Speller(only_replacements=True)

Custom word sets

If you wish to use your own set of words for autocorrection, you can pass an nlp_data argument:

spell = Speller(nlp_data=your_word_frequency_dict)

Where your_word_frequency_dict is a dictionary which maps words to their average frequencies in your text. If you want to change the default word set only a bit, you can just edit spell.nlp_data parameter, after spell was initialized.

Adding new languages

First, define special letters, by adding entries in word_regexes and alphabets dicts in autocorrect/constants.py.

Now, you need a bunch of text. Easiest way is to download wikipedia. For example for Russian you would go to: https://dumps.wikimedia.org/ruwiki/latest/ and download ruwiki-latest-pages-articles.xml.bz2

bzip2 -d ruiwiki-latest-pages-articles.xml.bz2

After that:

First, edit the autocorrect.constants dictionaries in order to accommodate regexes and dictionaries for your language.

Then:

>>> from autocorrect.word_count import count_words
>>> count_words('ruwiki-latest-pages-articles.xml', 'ru')

tar -zcvf autocorrect/data/ru.tar.gz word_count.json

For the correction to work well, you need to cut out rarely used words. First, in test_all.py, write test words for your language, and add them to optional_language_tests the same way as it's done for other languages. It's good to have at least 30 words. Now run:

python test_all.py find_threshold ru

and see which threshold value has the least badly corrected words. After that, manually delete all the words with less occurences than the threshold value you found, from the file in hi.tar.gz (it's already sorted so it should be easy).

To distribute this language support to others, you will need to upload your tar.gz file to IPFS (for example with Pinata, which will pin this file so it doesn't disappear), and then add it's path to ipfs_paths in constants.py. (tip: first put this file inside the folder, and upload the folder to IPFS, for the downloaded file to have the correct filename)

Good luck!

Name		Name	Last commit message	Last commit date
Latest commit History 252 Commits
.github/workflows		.github/workflows
autocorrect		autocorrect
.LICENSE_OLD		.LICENSE_OLD
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
test_all.py		test_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autocorrect

Installation

Examples

Speed

OCR

Custom word sets

Adding new languages

About

Releases 6

Packages

Contributors 14

Languages

License

filyp/autocorrect

Folders and files

Latest commit

History

Repository files navigation

Autocorrect

Installation

Examples

Speed

OCR

Custom word sets

Adding new languages

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 14

Languages

Packages