wikitionary -> rdbms (only sqlite supported at this time)
Note: most of this document is copied directly from my .js project that does something similar.
This is a .py project (poc of my own earlier attempt with .js) that will take a wiktionary archive dump and parse/normalize it into a relational database.
I wanted a word list (for many reasons), that was; current, expansive, descriptive. I found it frustrating that this data was not anywhere else. Hacker lists, frequency lists, etc. are missing class information and are usually truncated/limited to n-percentile usage.
Wiktionary as a source is huge, and probably as good as any other authority on my native language (english), and maybe any other language. It is constantly updated, a living authority.
If Webster et. al. had a nice manner in which to grep their data I might've gone with one of them.
Now you can make your own lists.
I am not a linguist (I just play one on TV).
- This was only tested with english input, however there should be no reason other input languages would work.
- xml parsing in python is excruciatingly slow (but a huge thanks to
lxml
for making it bearable)
lxml is an amazing project. Before stumbling on "lxml" I thought I was doing something incomprehensibly wrong with xml parsing in Python. It was excruciatingly slow. For the size of the file it was completely unacceptable, but after some forum and so scouring I was able to find posters with the same concern. Ultimately leading to the "lxml" project.
-
setup env
cpython
> python -m venv /path/to/new/virtual/environment
pypy
> pypy/python -m venv .pypyvenv
-
activate env
source .venv/bin/activate
-
install requirements
python -m pip -r requirements.txt install
- Run all tests:
python -m unittest
https://mypy.readthedocs.io/en/stable/config_file.html#confval-mypy_path
-
Go and download *pages-meta-current.xml for your language. Note: the dumps contain words for all languages, just the page data is localised. Example:
wget --directory-prefix=./out https://dumps.wikimedia.org/enwiktionary/20230801/enwiktionary-20230801-pages-meta-current.xml.bz2
-
Help
python src/clip.py --help
-
process all english entries
python src/cli.py --language english --sqlite out/words.db mediawiki_archive.xml
Dump all words in database:
-- include labels for information available about the word
SELECT w.word, GROUP_CONCAT(DISTINCT labels.label)
FROM word w
JOIN (
SELECT w.id AS wordid, c.category AS label
FROM word w
JOIN word_languages wl ON wl.wordid = w.id
JOIN languages_categories lc ON lc.wordlangid = wl.id
JOIN category c ON c.id = lc.categoryid
UNION
SELECT w.id AS wordid, cs.section AS label
FROM word w
JOIN word_languages wl ON wl.wordid = w.id
JOIN languages_categories lc ON lc.wordlangid = wl.id
JOIN category c ON c.id = lc.categoryid
JOIN languages_category_section lcs ON lcs.langcatid = lc.id
JOIN category_section cs ON cs.id = lcs.sectionid
) labels ON labels.wordid = w.id
GROUP BY w.id
ORDER BY w.word ASC
- parsing of wikitext is poor
- there is no frequency data
- storage accuracy of encoded character data has not been exhaustively verified
should do, might do, won't do, etc.
- full wiki markdown processing
- markdown directive processing
- front end and/or api
Feel free to open issues or submit a pull request.