This program extracts information from the Czech Wiktionary project into a machine-readable format RDF. For now it focuses on inflected forms of words.
Information being extracted:
- parts of speech (POS)
- declension, conjugation and degrees sections
- pronunciation
- gender and animacy
RDF-Wiktionary project enables extraction from unstructured data into a structured RDF format.
Available formats of RDF output are: RDF/XML
, Turtle
and n-triples
.
You can choose the format by typing
<-x>/<-t>/<-n>
respectively into the format argument.
The program executable is located in out/artifacts/
.
There are several possible scenarion for running this program.
They always start with
-
Download the dump file from the internet into default download directory:
java -jar cswiktionary2rdf.jar -d
-
Download the dump file into a specified directory:
java -jar cswiktionary2rdf.jar -d <dir path>
-
Extract a specified dump file into a specified output file:
java -jar cswiktionary2rdf.jar -e <format> <dump file> <output file>
Example of the execution:
java -jar cswiktionary2rdf.jar -e -t dump.xml output.ttl
For viewing this help again, use the parameter -h
.
The RDF dataset generated by this program can be used for various purposes:
- find missing or incorrect information on Wiktionary through SPARQL queries
- get semantics and context of a word (even through an inflected form of a word) from the Czech DBpedia (through rdfs:seeAlso links generated from External Links section)
- search any word from Wiktionary pages, and get structured RDF data as a result
The third bullet point has been partially implemented in a web app Enhanced Wiktionary Search Engine (in Czech)
It lists all the pages where the desired word form appears, and if it appeared in one of the tables of inflected forms, it also gives a description of the form with its properties like case, number, gender and more.