CsWiktionary2RDF - Czech Wiktionary Extractor

This program extracts information from the Czech Wiktionary project into a machine-readable format RDF. For now it focuses on inflected forms of words.

Information being extracted:

parts of speech (POS)
declension, conjugation and degrees sections
pronunciation
gender and animacy

Using the program

RDF-Wiktionary project enables extraction from unstructured data into a structured RDF format.

Available formats of RDF output are: RDF/XML, Turtle and n-triples.

You can choose the format by typing <-x>/<-t>/<-n> respectively into the format argument.

The program executable is located in out/artifacts/. There are several possible scenarion for running this program.

They always start with

Download the dump file from the internet into default download directory:
- java -jar cswiktionary2rdf.jar -d
Download the dump file into a specified directory:
- java -jar cswiktionary2rdf.jar -d <dir path>
Extract a specified dump file into a specified output file:
- java -jar cswiktionary2rdf.jar -e <format> <dump file> <output file>

Example of the execution:

java -jar cswiktionary2rdf.jar -e -t dump.xml output.ttl

For viewing this help again, use the parameter -h.

Use-cases for the RDF dataset

The RDF dataset generated by this program can be used for various purposes:

find missing or incorrect information on Wiktionary through SPARQL queries
get semantics and context of a word (even through an inflected form of a word) from the Czech DBpedia (through rdfs:seeAlso links generated from External Links section)
search any word from Wiktionary pages, and get structured RDF data as a result

The third bullet point has been partially implemented in a web app Enhanced Wiktionary Search Engine (in Czech)

It lists all the pages where the desired word form appears, and if it appeared in one of the tables of inflected forms, it also gives a description of the form with its properties like case, number, gender and more.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
out/artifacts/cswiktionary2rdf_jar		out/artifacts/cswiktionary2rdf_jar
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CsWiktionary2RDF - Czech Wiktionary Extractor

Using the program

Use-cases for the RDF dataset

About

Languages

License

martin-lukas/cswiktionary2rdf

Folders and files

Latest commit

History

Repository files navigation

CsWiktionary2RDF - Czech Wiktionary Extractor

Using the program

Use-cases for the RDF dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Languages