Simple script that generates word frequency lists for Latin texts
spaCy with the la_core_web_lg LatinCy pipeline, XlsxWriter
pip install -U XlsxWriter
pip install -U pip setuptools wheel
pip install -U spacy
pip install https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl
Example:
python frequency.py --stopwords=stopwords.txt --coverage=80 --output_type=excel --output=output.xlsx documents
Collects lemma counts for all (plaintext) files in the "documents" directory, using the supplied stop-word file. Returns the top lemmata until at least 80% coverage are achieved in an excel file.
- filename/folder: Path of a file or folder to process (obligatory).
- --stopwords=filename: Path of a textfile containing a list of stop words (one entry per line). If not supplied, the default stop-word list of the spaCy model is used. Point this to an empty file, if you don't want to use any stop words.
- --output=filename: Where to store the output. If not supplied, output will be printed to stdout.
- --output_type=excel/csv: What kind of output to generate. Only applies if --output is specified, defaults to "csv".
- --coverage=n: Lists the most common lemmata in descending order until at least "n" percent of vocabulary coverage is achieved. Takes precedence over --top.
- --top=n: Lists the "n" most frequent lemmata. If neither --coverage nor --top are supplied, all lemmata are listed.