license | pretty_name | language | configs | tags | size_categories | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
apache-2.0 |
Wilhelm Vocabulary |
|
|
|
|
wilhelm-vocabulary is the data sources used for the flashcard contents on wilhelmlang.com. Specifically it's a datasource manually made from the accumulation of the daily language studies of myself:
The data is available on 🤗 Hugging Face Datasets
from datasets import load_dataset
dataset = load_dataset("QubitPi/wilhelm-vocabulary")
Tip
If dataset = load_dataset("QubitPi/wilhelm-vocabulary")
throws an error, please upgrade the datasets
package to
its latest version
In addition, a Docker image has been made to allow us exploring the vocabulary in Neo4J browser backed by a Neo4J database. To get the image and run the container, simply do:
docker run \
--publish=7474:7474 \
--publish=7687:7687 \
--env=NEO4J_AUTH=none \
--env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes \
-e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
--env NEO4J_browser_remote__content__hostname__whitelist=https://raw.githubusercontent.com \
--env NEO4J_browser_post__connect__cmd="style https://raw.githubusercontent.com/QubitPi/wilhelm-vocabulary/refs/heads/master/graphstyle.grass" \
jack20191124/wilhelm-vocabulary
Note
The image is based on Neo4J Enterprise 5.23.0.
- When container starts, access neo4j through browser at http://localhost:7474
- Both bolt:// and neo4j:// protocols are fine.
- Choose No authentication for Authentication type
- Then hit Connect as shown below
We have offered some queries that can be used to quickly explore the vocabulary in graph representations:
-
Search for all Synonyms:
MATCH (term:Term)-[r]-(synonym:Term) WHERE r.name = "synonym" RETURN term, r, synonym
-
Finding all gerunds:
MATCH (source)-[link:RELATED]->(target) WHERE link.name = "gerund of" RETURN source, link, target;
-
Expanding a word "nämlich" (reveals its relationship to other languages):
MATCH (term:Term{label:'nämlich'}) CALL apoc.path.expand(term, "LINK", null, 1, 3) YIELD path RETURN path, length(path) AS hops ORDER BY hops;
-
In German, "rice" and "travel" are related:
MATCH (term:Term{label:'die Reise'}) CALL apoc.path.expand(term, "LINK", null, 1, 3) YIELD path RETURN path, length(path) AS hops ORDER BY hops;
-
MATCH (term:Term{label:'die Schwester'}) CALL apoc.path.expand(term, "LINK", null, 1, -1) YIELD path RETURN path, length(path) AS hops ORDER BY hops;
-
How German, Latin, and Ancient greek expresses the conjunction "but":
MATCH (node{label:"δέ"}) CALL apoc.path.expand(node, "LINK", null, 1, 4) YIELD path RETURN path, length(path) AS hops ORDER BY hops;
Get the source code:
git clone [email protected]:QubitPi/wilhelm-vocabulary.git
cd wilhelm-vocabulary
It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python environment by
python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv
To activate this environment:
source .venv/bin/activate
or, on Windows
./venv\Scripts\activate
Tip
To deactivate this environment, use
deactivate
pip3 install -r requirements.txt
The raw data is written in YAML format, because
- it is machine-readable so that it can be consumed quickly in data pipelines
- it is human-readable and, thus, easy to read and modify
- it supports multi-lines value which is very handy for language data
The YAML data files are
These YAML files are then transformed to Hugging Face Datasets formats in CI/CD
To encode the inflections which are common in most Indo-European languages, an application-specific YAML that looks like the following are employed throughout this repository:
- term: der Gegenstand
definition:
- object
- thing
declension:
- ["", singular, plural ]
- [nominative, Gegenstand, Gegenstände ]
- [genitive, "Gegenstandes, Gegenstands", Gegenstände ]
- [dative, Gegenstand, Gegenständen]
- [accusative, Gegenstand, Gegenstände ]
Note
- A list under
declension
is a table row - All rows have the same number of columns
- Each element of the list corresponds to a table cell
The declension (inflection) table above is equivalent to
singular | plural | |
nominative | Gegenstand | Gegenstände |
genitive | Gegenstandes, Gegenstands | Gegenstände |
dative | Gegenstand | Gegenständen |
accusative | Gegenstand | Gegenstände |
Caution
When the graph database is Neo4J, all constrains relating to the Term node must be using:
SHOW CONSTRAINTS
DROP CONSTRAINT constraint_name;
This is because certain vocabulary has multiple grammatical forms. This vocabulary is spread out as multiple entries. These multiple entries, because they have lots of common properties, often triggers constraint violations in Neo4J on load
Graph data representation assumes universal connectivity among world entities. This applies pretty well to the realm of languages. Multilanguage learners have already seen that Indo-European languages are similar in many aspects. The similarities not only signify the historical facts about Philology but also surface a great opportunity for multilanguage learners to take advantages of them and study much more efficiently. What's missing is connecting the dots using Graph Databases that visually presents these vastly enlightening links between the related languages in a natural way.
vocabulary:
- term: string
definition: list
audio: string
The audio
field is an URL that points to a .mp3
or .ogg
file that contains the pronunciation of this word.
The meaning of a word is called the definition
. A term has a natural relationship to its definition(s). For example,
the German noun "Ecke" has at least 4 definitions:
Tip
The parenthesized value at the beginning of each definition
item played an un-ignorable role: it is the label of the
relationship between term
and definition
in graph database dumped by
data loader. For example, both German words
- term: denn
definition:
- (adv.) then, thus
- (conj.) because
and
- term: nämlich
definition:
- (adj.) same
- (adv.) namely
- (adv.) because
can mean "because" acting as different types. This is visualized as follows:
Visualzing synonyms this way presents a big advantage to human brain who is exceedingly good at memorizing patterns
The declension table of a pronoun follows:
declension:
- ["", masclune, feminine, neuter, plural]
- [nominative, ████████, ████████, ██████, ██████]
- [genitive, ████████, ████████, ██████, ██████]
- [dative, ████████, ████████, ██████, ██████]
- [accusative, ████████, ████████, ██████, ██████]
term
with a definite article of der
/die
/das
signifies a noun which has the entry format with the
declension table of the following template:
- term:
definition:
audio:
declension:
- ["", singular, plural]
- [nominative, ████████, ██████]
- [genitive, ████████, ██████]
- [dative, ████████, ██████]
- [accusative, ████████, ██████]
For example:
- term: das Gespräch
definition: the conversation
audio: https://upload.wikimedia.org/wikipedia/commons/f/f5/De-Gespr%C3%A4ch.ogg
declension:
- ["", singular, plural ]
- [nominative, Gespräch, Gespräche ]
- [genitive, "Gespräches, Gesprächs", Gespräche ]
- [dative, Gespräch, Gesprächen]
- [accusative, Gespräch, Gespräche ]
Tip
The declension tables for nouns are almost all sourced from Wiktionary and tiny from (if not present in Wiktionary) Verbformen
Caution
Adjectival nouns, however, do NOT follow the template above but employs the following template:
declension:
strong:
- ["", singular, plural]
- [nominative, ████████, ██████]
- [genitive, ████████, ██████]
- [dative, ████████, ██████]
- [accusative, ████████, ██████]
weak:
- ["", singular, plural]
- [nominative, ████████, ██████]
- [genitive, ████████, ██████]
- [dative, ████████, ██████]
- [accusative, ████████, ██████]
mixed:
- ["", singular, plural]
- [nominative, ████████, ██████]
- [genitive, ████████, ██████]
- [dative, ████████, ██████]
- [accusative, ████████, ██████]
The conjugation is the inflection paradigm for a German verb. Those with conjugation
field denotes a verb; its
definition also begins with an indefinite form, i.e. "to ..."
The reason for choosing [verbformen.com] is because of its comprehensive inflection info of German vocabulary provided.
There are 3 persons, 2 numbers, and 4 moods (indicative, conditional, imperative and subjunctive) to consider in conjugation. There are 6 tenses in German: the present and past are conjugated, and there are four compound tenses. There are two categories of verbs in German: weak and strong1. In addition, strong verbs are grouped into 7 "classes"
The conjugation table of German verb on Wiktionary is hard to interpret for German beginner. Netzverb Dictionary is the best German dictionary targeting the vocabulary inflections. Search for "aufwachsen" and we will see much more intuitive conjugation tables listed.
This pretty much serves our needs, but what makes Netzverb unpenetrable by other alternatives is that every verb comes with
-
A printable version that looks much better than the browser's Control+P export
- There is also a "Sentences with German verb aufwachsen" section with a link that offer a fruitful number of conjugated examples getting us familiar with the inflections of the verb
-
An on-the-fly generated flashcard sheet which allows us to make a better usage of our random free time
-
A YouTube video that offers audios of almost every conjugated form, which helps with pronunciations a lot
The entry for a German verb, hence, has an extra verbformen
field that includes the links to the 3 pieces of
information above
- term:
definition:
audio:
verbformen:
video:
conjugation:
flashcards:
For example:
- term: aufwachsen
definition: to grow up
audio: https://upload.wikimedia.org/wikipedia/commons/f/f0/De-aufwachsen.ogg
verbformen:
video: https://youtu.be/LCtUrSn030A
conjugation: https://www.verbformen.com/conjugation/aufwachsen.pdf
flashcards: https://www.verbformen.com/conjugation/worksheets-exercises/lernkarten/aufwachsen.pdf
Important
Note that some verbformen verbs do not have videos in which case the video
field does not exist
Unless otherwise mentioned, we are always talking about Attic Greek.
Note
Ancient Greek vocabulary come from the following sources
- Greek Core Vocabulary of Dickinson College
- Aristotle - Logic I: Categories, On Interpretation, Prior Analytics
We employ the following 3 diacritic signs only in vocabulary:
- the acute (ά)
- the circumflex (ᾶ), and
- the grave (ὰ)
In fact, it is called the medium diacritics and the same convention used in Loeb Classical Library prints from Harvard. Notice that, however, the commonly sourced Wiktionary uses full diacritics, including the breve diacritic mark; we don't do that.
The source of pronouns and their declensions are the following
-
Wiktionary
-
Greek: An Intensive Course, 2nd Revised Edition
- Unit 6, Section 49. The Relative Pronoun
Tip
More grammar about pronouns can be found in these great articles from Ancient Greek for Everyone above
The declension table of a pronoun follows:
declension:
- ["", singular, plural]
- [nominative, ████████, ██████]
- [genitive, ████████, ██████]
- [dative, ████████, ██████]
- [accusative, ████████, ██████]
- [vocative, N/A, N/A ]
The vocabulary entry for each noun consists of its nominative and genitive forms, an article which indicates the noun's
gender all in its term
attribute. The English meaning(s) come as a list under definition
attribute. For example.
- term: τέχνη τέχνης, ἡ
definition:
- art,
- skill,
- craft
declension class: 1st
The vocabulary entry above consists of the following 5 items:
-
τέχνη: nominative singular
-
τέχνης: genitive singular
-
ἡ: nominative feminine singular of the article, which shows that the gender of the noun is feminine. Gender will be indicated by the appropriate form of the definite article "the":
ὁ
for the masculine nounsἡ
for the feminine nounsτό
for the neutor nouns
-
a list of English meanings of the word
-
the noun employs the first declension. The 3 classes of declensions are
- first declension (
1st
) - second declension (
2nd
) - third declension (
3rd
)
- first declension (
The declension of the entry is not shown because to decline any noun, we can take the genitive singular, remove the genitive singular ending to get the stem, and then add the proper set of endings to the stem based on its declension class2.
For example, to decline τέχνη τέχνης, ἡ, (art), take the genitive singular τέχνης, remove the genitive singular ending -ης, and add the appropriate endings to the stem which gives following paradigm:
Case | Singular | Plural |
---|---|---|
nominative | τέχνη | τέχναι |
genitive | τέχνης | τεχνῶν |
dative | τέχνῃ | τέχναις |
accusative | τέχνην | τέχνᾱς |
vocative | τέχνη | τέχναι |
Declension template:
declension:
- ["", singular, singular, singular, dual, dual, dual plural, plural, plural]
- ["", masculine, feminine, neuter, masculine, feminine, neuter, masculine, feminine, neuter]
- [nominative, █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
- [genitive, █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
- [dative, █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
- [accusative, █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
- [vocative, █████████, ████████, ████████, █████████, ████████, ██████, █████████, ████████, ██████]
The Greek verb has 6 principal parts. All 6 must be learned whenever a new verb is encountered:
- (first person singular) present indicative active
- (first person singular) future indicative active
- (first person singular) aorist indicative active
- (first person singular) perfect indicative active
- (first person singular) perfect indicative passive
- (first person singular) aorist indicative passive
Tip
The minimum number of forms which one must know in order to generate all possible forms of a verb are called the principal parts of that verb.
From the 6 forms above, various verb forms (i.e. stems & endings) can be derived by rules3
In practice, however,
obtaining precise and complete principal parts for some verbs has been proven to be impossible. Best efforts have
been made to find them with URL references being provided in a references
list field for each verb entry What's also
being recorded here are the reconstructed principal parts with a list of references that validate the
reconstruction. In conclusion, the entry of a verb, thus, has the form of:
- term: string
definition: list
conjugation:
principal parts:
- ["", Attic, (Possibly other dialects)]
- [(first person singular) present indicative active, █████, ... ]
- [(first person singular) future indicative active, █████, ... ]
- [(first person singular) aorist indicative active, █████, ... ]
- [(first person singular) perfect indicative active, █████, ... ]
- [(first person singular) perfect indicative passive, █████, ... ]
- [(first person singular) aorist indicative passive, █████, ... ]
references: list
For example:
- term: λέγω
definition:
- to say, speak
- to pick up
conjugation:
wiktionary: https://en.wiktionary.org/wiki/λέγω#Verb_2
principal parts:
- ["", Attic , Koine ]
- [(first person singular) present indicative active, λέγω , λέγω ]
- [(first person singular) future indicative active, λέξω , ἐρῶ ]
- [(first person singular) aorist indicative active, ἔλεξα , εἶπον/εἶπα ]
- [(first person singular) perfect indicative active, (missing), εἴρηκα ]
- [(first person singular) perfect indicative passive, λέλεγμαι , λέλεγμαι ]
- [(first person singular) aorist indicative passive, ἐλέχθην , ἐρρέθην/ἐρρήθην]
references:
- https://en.wiktionary.org/wiki/λέγω#Inflection
- http://atticgreek.org/downloads/allPPbytypes.pdf
- https://books.openbookpublishers.com/10.11647/obp.0264/ch25.xhtml
- https://www.billmounce.com/greek-dictionary/lego
- https://koine-greek.fandom.com/wiki/Λέγω
Note
The vocabulary and declensions come from the following sources
- Latin Core Vocabulary of Dickinson College
- Wiktionary
vocabulary:
- term: string
definition: list
The vocabulary is presented to help read and understand Biblical Hebrew. A complementary audio helps well with the pronunciation.
中国人学习韩语有先天优势,加之韩语本身也是一门相当简单的语言,所以这里将语法和词汇合并在一起;
每一项也只由 term
(韩)和 definition
(中)组成,
vocabulary:
- term: string
definition: list of strings
example:
- Korean: 제가 아무렴 그쪽 편에 서겠어요
Chinese: 我无论如何都会站在你这边
- Korean: ...
Chinese: ...
不用费太多功夫记牢简单的语法和词汇,剩下的就是拿韩语字幕剧不停练习听说读写既成。example
中的例句均来自韩国本土语料
Note
韩语不属于汉藏语系,因其所属语系非常狭小,无法和其它语言产生足够关联,因此其数据暂时不被存入图数据库进行数据分析
The use and distribution terms for wilhelm-vocabulary are covered by the Apache License, Version 2.0.
Footnotes
-
Greek: An Intensive Course, 2nd Revised Edition, Hansen & Quinn, p.20 ↩
-
Greek: An Intensive Course, 2nd Revised Edition, Hansen & Quinn, p.44 ↩