Skip to content

Simple script that generates word frequency lists for Latin texts

License

Notifications You must be signed in to change notification settings

katharinaost/latin-frequencies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

latin-frequencies

Simple script that generates word frequency lists for Latin texts

Prerequisites

spaCy with the la_core_web_lg LatinCy pipeline, XlsxWriter

pip install -U XlsxWriter
pip install -U pip setuptools wheel
pip install -U spacy
pip install https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl

Usage

Example:

  python frequency.py --stopwords=stopwords.txt --coverage=80 --output_type=excel --output=output.xlsx documents

Collects lemma counts for all (plaintext) files in the "documents" directory, using the supplied stop-word file. Returns the top lemmata until at least 80% coverage are achieved in an excel file.

Excel output

Arguments

  • filename/folder: Path of a file or folder to process (obligatory).
  • --stopwords=filename: Path of a textfile containing a list of stop words (one entry per line). If not supplied, the default stop-word list of the spaCy model is used. Point this to an empty file, if you don't want to use any stop words.
  • --output=filename: Where to store the output. If not supplied, output will be printed to stdout.
  • --output_type=excel/csv: What kind of output to generate. Only applies if --output is specified, defaults to "csv".
  • --coverage=n: Lists the most common lemmata in descending order until at least "n" percent of vocabulary coverage is achieved. Takes precedence over --top.
  • --top=n: Lists the "n" most frequent lemmata. If neither --coverage nor --top are supplied, all lemmata are listed.

About

Simple script that generates word frequency lists for Latin texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages