Skip to content

QubitPi/wilhelm-vocabulary

Repository files navigation

license pretty_name language configs tags size_categories
apache-2.0
Wilhelm Vocabulary
en
de
la
grc
config_name data_files
Graph Data
split path
German
german-graph-data.jsonl
split path
Latin
latin-graph-data.jsonl
split path
AncientGreek
ancient-greek-graph-data.jsonl
Natural Language Processing
NLP
Vocabulary
German
Latin
Ancient Greek
Knowledge Graph
1K<n<10K

Wilhelm Vocabulary

Hugging Face dataset badge

Vocabulary count - German Vocabulary count - Latin Vocabulary count - Ancient Greek Docker Hub GitHub workflow status badge Hugging Face sync status badge Apache License Badge

wilhelm-vocabulary is the data sources used for the flashcard contents on wilhelmlang.com. Specifically it's a datasource manually made from the accumulation of the daily language studies of myself:

The data is available on 🤗 Hugging Face Datasets

from datasets import load_dataset
dataset = load_dataset("QubitPi/wilhelm-vocabulary")

Tip

If dataset = load_dataset("QubitPi/wilhelm-vocabulary") throws an error, please upgrade the datasets package to its latest version

In addition, a Docker image has been made to allow us exploring the vocabulary in Neo4J browser backed by a Neo4J database. To get the image and run the container, simply do:

docker run \
    --publish=7474:7474 \
    --publish=7687:7687 \
    --env=NEO4J_AUTH=none \
    --env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    --env NEO4J_browser_remote__content__hostname__whitelist=https://raw.githubusercontent.com \
    --env NEO4J_browser_post__connect__cmd="style https://raw.githubusercontent.com/QubitPi/wilhelm-vocabulary/refs/heads/master/graphstyle.grass" \
    jack20191124/wilhelm-vocabulary

Note

The image is based on Neo4J Enterprise 5.23.0.

  • When container starts, access neo4j through browser at http://localhost:7474
  • Both bolt:// and neo4j:// protocols are fine.
  • Choose No authentication for Authentication type
  • Then hit Connect as shown below

Connecting to Neo4J Docker

We have offered some queries that can be used to quickly explore the vocabulary in graph representations:

  • Search for all Synonyms: MATCH (term:Term)-[r]-(synonym:Term) WHERE r.name = "synonym" RETURN term, r, synonym

  • Finding all gerunds: MATCH (source)-[link:RELATED]->(target) WHERE link.name = "gerund of" RETURN source, link, target;

  • Expanding a word "nämlich" (reveals its relationship to other languages):

    MATCH (term:Term{label:'nämlich'})
    CALL apoc.path.expand(term, "LINK", null, 1, 3)
    YIELD path
    RETURN path, length(path) AS hops
    ORDER BY hops;

    Expanding "nämlich"

  • In German, "rice" and "travel" are related:

    MATCH (term:Term{label:'die Reise'})
    CALL apoc.path.expand(term, "LINK", null, 1, 3)
    YIELD path
    RETURN path, length(path) AS hops
    ORDER BY hops;

    Declension sharing

  • MATCH (term:Term{label:'die Schwester'}) CALL apoc.path.expand(term, "LINK", null, 1, -1) YIELD path RETURN path, length(path) AS hops ORDER BY hops;

  • How German, Latin, and Ancient greek expresses the conjunction "but":

    MATCH (node{label:"δέ"})
    CALL apoc.path.expand(node, "LINK", null, 1, 4)
    YIELD path
    RETURN path, length(path) AS hops
    ORDER BY hops;

    Conjuction - but

Development

Environment Setup

Get the source code:

git clone [email protected]:QubitPi/wilhelm-vocabulary.git
cd wilhelm-vocabulary

It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python environment by

python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv

To activate this environment:

source .venv/bin/activate

or, on Windows

./venv\Scripts\activate

Tip

To deactivate this environment, use

deactivate

Installing Dependencies

pip3 install -r requirements.txt

Data Format

The raw data is written in YAML format, because

  1. it is machine-readable so that it can be consumed quickly in data pipelines
  2. it is human-readable and, thus, easy to read and modify
  3. it supports multi-lines value which is very handy for language data

The YAML data files are

These YAML files are then transformed to Hugging Face Datasets formats in CI/CD

Encoding Table in YAML

To encode the inflections which are common in most Indo-European languages, an application-specific YAML that looks like the following are employed throughout this repository:

  - term: der Gegenstand
    definition:
      - object
      - thing
    declension:
      - ["",         singular,                    plural      ]
      - [nominative, Gegenstand,                  Gegenstände ]
      - [genitive,   "Gegenstandes, Gegenstands", Gegenstände ]
      - [dative,     Gegenstand,                  Gegenständen]
      - [accusative, Gegenstand,                  Gegenstände ]

Note

  • A list under declension is a table row
  • All rows have the same number of columns
  • Each element of the list corresponds to a table cell

The declension (inflection) table above is equivalent to

singular plural
nominative Gegenstand Gegenstände
genitive Gegenstandes, Gegenstands Gegenstände
dative Gegenstand Gegenständen
accusative Gegenstand Gegenstände

Data Pipeline

Data pipeline

Caution

When the graph database is Neo4J, all constrains relating to the Term node must be using:

SHOW CONSTRAINTS
DROP CONSTRAINT constraint_name;

This is because certain vocabulary has multiple grammatical forms. This vocabulary is spread out as multiple entries. These multiple entries, because they have lots of common properties, often triggers constraint violations in Neo4J on load

How Data (Vocabulary) is Stored in a Graph Database

Why Graph Database

Graph data representation assumes universal connectivity among world entities. This applies pretty well to the realm of languages. Multilanguage learners have already seen that Indo-European languages are similar in many aspects. The similarities not only signify the historical facts about Philology but also surface a great opportunity for multilanguage learners to take advantages of them and study much more efficiently. What's missing is connecting the dots using Graph Databases that visually presents these vastly enlightening links between the related languages in a natural way.

Base Schema

vocabulary:
  - term: string
    definition: list
    audio: string

The audio field is an URL that points to a .mp3 or .ogg file that contains the pronunciation of this word.

The meaning of a word is called the definition. A term has a natural relationship to its definition(s). For example, the German noun "Ecke" has at least 4 definitions:

Relationship between term and defintion(s)

Graph data generated by wilhelm-data-loader

Tip

The parenthesized value at the beginning of each definition item played an un-ignorable role: it is the label of the relationship between term and definition in graph database dumped by data loader. For example, both German words

- term: denn
  definition:
    - (adv.) then, thus
    - (conj.) because

and

 - term: nämlich
   definition:
     - (adj.) same
     - (adv.) namely
     - (adv.) because

can mean "because" acting as different types. This is visualized as follows:

error loading example.png

Visualzing synonyms this way presents a big advantage to human brain who is exceedingly good at memorizing patterns

Languages

Pronoun

The declension table of a pronoun follows:

declension:
  - ["",         masclune, feminine, neuter, plural]
  - [nominative, ████████, ████████, ██████, ██████]
  - [genitive,   ████████, ████████, ██████, ██████]
  - [dative,     ████████, ████████, ██████, ██████]
  - [accusative, ████████, ████████, ██████, ██████]

Noun

term with a definite article of der/die/das signifies a noun which has the entry format with the declension table of the following template:

- term:
  definition:
  audio:
  declension:
    - ["",         singular, plural]
    - [nominative, ████████, ██████]
    - [genitive,   ████████, ██████]
    - [dative,     ████████, ██████]
    - [accusative, ████████, ██████]

For example:

  - term: das Gespräch
    definition: the conversation
    audio: https://upload.wikimedia.org/wikipedia/commons/f/f5/De-Gespr%C3%A4ch.ogg
    declension:
      - ["",         singular,                plural    ]
      - [nominative, Gespräch,                Gespräche ]
      - [genitive,   "Gespräches, Gesprächs", Gespräche ]
      - [dative,     Gespräch,                Gesprächen]
      - [accusative, Gespräch,                Gespräche ]

Tip

The declension tables for nouns are almost all sourced from Wiktionary and tiny from (if not present in Wiktionary) Verbformen

Caution

Adjectival nouns, however, do NOT follow the template above but employs the following template:

declension:
  strong:
    - ["",         singular, plural]
    - [nominative, ████████, ██████]
    - [genitive,   ████████, ██████]
    - [dative,     ████████, ██████]
    - [accusative, ████████, ██████]
  weak:
    - ["",         singular, plural]
    - [nominative, ████████, ██████]
    - [genitive,   ████████, ██████]
    - [dative,     ████████, ██████]
    - [accusative, ████████, ██████]
  mixed:
    - ["",         singular, plural]
    - [nominative, ████████, ██████]
    - [genitive,   ████████, ██████]
    - [dative,     ████████, ██████]
    - [accusative, ████████, ██████]

Verb

The conjugation is the inflection paradigm for a German verb. Those with conjugation field denotes a verb; its definition also begins with an indefinite form, i.e. "to ..."

The reason for choosing [verbformen.com] is because of its comprehensive inflection info of German vocabulary provided.

There are 3 persons, 2 numbers, and 4 moods (indicative, conditional, imperative and subjunctive) to consider in conjugation. There are 6 tenses in German: the present and past are conjugated, and there are four compound tenses. There are two categories of verbs in German: weak and strong1. In addition, strong verbs are grouped into 7 "classes"

The conjugation table of German verb on Wiktionary is hard to interpret for German beginner. Netzverb Dictionary is the best German dictionary targeting the vocabulary inflections. Search for "aufwachsen" and we will see much more intuitive conjugation tables listed.

This pretty much serves our needs, but what makes Netzverb unpenetrable by other alternatives is that every verb comes with

  1. A printable version that looks much better than the browser's Control+P export

    • There is also a "Sentences with German verb aufwachsen" section with a link that offer a fruitful number of conjugated examples getting us familiar with the inflections of the verb
  2. An on-the-fly generated flashcard sheet which allows us to make a better usage of our random free time

  3. A YouTube video that offers audios of almost every conjugated form, which helps with pronunciations a lot

The entry for a German verb, hence, has an extra verbformen field that includes the links to the 3 pieces of information above

- term:
  definition:
  audio:
  verbformen:
    video: 
    conjugation:
    flashcards:

For example:

- term: aufwachsen
  definition: to grow up
  audio: https://upload.wikimedia.org/wikipedia/commons/f/f0/De-aufwachsen.ogg
  verbformen:
    video: https://youtu.be/LCtUrSn030A
    conjugation: https://www.verbformen.com/conjugation/aufwachsen.pdf
    flashcards: https://www.verbformen.com/conjugation/worksheets-exercises/lernkarten/aufwachsen.pdf

Important

Note that some verbformen verbs do not have videos in which case the video field does not exist

Unless otherwise mentioned, we are always talking about Attic Greek.

Note

Ancient Greek vocabulary come from the following sources

Diacritic Mark Convention

We employ the following 3 diacritic signs only in vocabulary:

  1. the acute (ά)
  2. the circumflex (ᾶ), and
  3. the grave (ὰ)

In fact, it is called the medium diacritics and the same convention used in Loeb Classical Library prints from Harvard. Notice that, however, the commonly sourced Wiktionary uses full diacritics, including the breve diacritic mark; we don't do that.

Pronoun

The source of pronouns and their declensions are the following

Tip

More grammar about pronouns can be found in these great articles from Ancient Greek for Everyone above

The declension table of a pronoun follows:

declension:
  - ["",         singular, plural]
  - [nominative, ████████, ██████]
  - [genitive,   ████████, ██████]
  - [dative,     ████████, ██████]
  - [accusative, ████████, ██████]
  - [vocative,   N/A,      N/A   ]

Noun

The vocabulary entry for each noun consists of its nominative and genitive forms, an article which indicates the noun's gender all in its term attribute. The English meaning(s) come as a list under definition attribute. For example.

  - term: τέχνη τέχνης, ἡ
    definition:
      - art,
      - skill,
      - craft
    declension class: 1st

The vocabulary entry above consists of the following 5 items:

  1. τέχνη: nominative singular

  2. τέχνης: genitive singular

  3. ἡ: nominative feminine singular of the article, which shows that the gender of the noun is feminine. Gender will be indicated by the appropriate form of the definite article "the":

    • for the masculine nouns
    • for the feminine nouns
    • τό for the neutor nouns
  4. a list of English meanings of the word

  5. the noun employs the first declension. The 3 classes of declensions are

    1. first declension (1st)
    2. second declension (2nd)
    3. third declension (3rd)

The declension of the entry is not shown because to decline any noun, we can take the genitive singular, remove the genitive singular ending to get the stem, and then add the proper set of endings to the stem based on its declension class2.

For example, to decline τέχνη τέχνης, ἡ, (art), take the genitive singular τέχνης, remove the genitive singular ending -ης, and add the appropriate endings to the stem which gives following paradigm:

Case Singular Plural
nominative τέχνη τέχναι
genitive τέχνης τεχνῶν
dative τέχνῃ τέχναις
accusative τέχνην τέχνᾱς
vocative τέχνη τέχναι

Adjective Declension

Declension template:

declension:
  - ["",         singular,  singular, singular, dual,      dual,     dual    plural,    plural,   plural]
  - ["",         masculine, feminine, neuter,   masculine, feminine, neuter, masculine, feminine, neuter]
  - [nominative, █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
  - [genitive,   █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
  - [dative,     █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
  - [accusative, █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
  - [vocative,   █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]

Verb Conjugation

The Greek verb has 6 principal parts. All 6 must be learned whenever a new verb is encountered:

  1. (first person singular) present indicative active
  2. (first person singular) future indicative active
  3. (first person singular) aorist indicative active
  4. (first person singular) perfect indicative active
  5. (first person singular) perfect indicative passive
  6. (first person singular) aorist indicative passive

Tip

The minimum number of forms which one must know in order to generate all possible forms of a verb are called the principal parts of that verb.

From the 6 forms above, various verb forms (i.e. stems & endings) can be derived by rules3

In practice, however, obtaining precise and complete principal parts for some verbs has been proven to be impossible. Best efforts have been made to find them with URL references being provided in a references list field for each verb entry What's also being recorded here are the reconstructed principal parts with a list of references that validate the reconstruction. In conclusion, the entry of a verb, thus, has the form of:

- term: string
  definition: list
  conjugation:
    principal parts:
      - ["",                                                 Attic, (Possibly other dialects)]
      - [(first person singular) present indicative active,  █████, ...                      ]
      - [(first person singular) future indicative active,   █████, ...                      ]
      - [(first person singular) aorist indicative active,   █████, ...                      ]
      - [(first person singular) perfect indicative active,  █████, ...                      ]
      - [(first person singular) perfect indicative passive, █████, ...                      ]
      - [(first person singular) aorist indicative passive,  █████, ...                      ]
    references: list

For example:

  - term: λέγω
    definition:
      - to say, speak
      - to pick up
    conjugation:
      wiktionary: https://en.wiktionary.org/wiki/λέγω#Verb_2
      principal parts:
        - ["",                                                 Attic    , Koine          ]
        - [(first person singular) present indicative active,  λέγω     , λέγω           ]
        - [(first person singular) future indicative active,   λέξω     , ἐρῶ            ]
        - [(first person singular) aorist indicative active,   ἔλεξα    , εἶπον/εἶπα     ]
        - [(first person singular) perfect indicative active,  (missing), εἴρηκα         ]
        - [(first person singular) perfect indicative passive, λέλεγμαι , λέλεγμαι       ]
        - [(first person singular) aorist indicative passive,  ἐλέχθην  , ἐρρέθην/ἐρρήθην]
      references:
        - https://en.wiktionary.org/wiki/λέγω#Inflection
        - http://atticgreek.org/downloads/allPPbytypes.pdf
        - https://books.openbookpublishers.com/10.11647/obp.0264/ch25.xhtml
        - https://www.billmounce.com/greek-dictionary/lego
        - https://koine-greek.fandom.com/wiki/Λέγω

Note

The vocabulary and declensions come from the following sources

vocabulary:
  - term: string
    definition: list

Classical Hebrew (Coming Soon)

The vocabulary is presented to help read and understand Biblical Hebrew. A complementary audio helps well with the pronunciation.

中国人学习韩语有先天优势,加之韩语本身也是一门相当简单的语言,所以这里将语法和词汇合并在一起; 每一项也只由 term(韩)和 definition(中)组成,

vocabulary:
  - term: string
    definition: list of strings
    example:
      - Korean: 제가 아무렴 그쪽 편에 서겠어요
        Chinese: 我无论如何都会站在你这边
      - Korean: ...
        Chinese: ...

不用费太多功夫记牢简单的语法和词汇,剩下的就是拿韩语字幕剧不停练习听说读写既成。example 中的例句均来自韩国本土语料

Note

韩语不属于汉藏语系,因其所属语系非常狭小,无法和其它语言产生足够关联,因此其数据暂时不被存入图数据库进行数据分析

License

The use and distribution terms for wilhelm-vocabulary are covered by the Apache License, Version 2.0.

Footnotes

  1. https://en.wikipedia.org/wiki/German_verbs#Conjugation

  2. Greek: An Intensive Course, 2nd Revised Edition, Hansen & Quinn, p.20

  3. Greek: An Intensive Course, 2nd Revised Edition, Hansen & Quinn, p.44

Releases

No releases published

Packages

No packages published