Skip to content

A program for extracting information from Czech Wiktionary into RDF format

License

Notifications You must be signed in to change notification settings

martin-lukas/cswiktionary2rdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CsWiktionary2RDF - Czech Wiktionary Extractor

This program extracts information from the Czech Wiktionary project into a machine-readable format RDF. For now it focuses on inflected forms of words.

Information being extracted:

  • parts of speech (POS)
  • declension, conjugation and degrees sections
  • pronunciation
  • gender and animacy

Using the program

RDF-Wiktionary project enables extraction from unstructured data into a structured RDF format.

Available formats of RDF output are: RDF/XML, Turtle and n-triples.

You can choose the format by typing <-x>/<-t>/<-n> respectively into the format argument.

The program executable is located in out/artifacts/. There are several possible scenarion for running this program.

They always start with

  1. Download the dump file from the internet into default download directory:

    • java -jar cswiktionary2rdf.jar -d
  2. Download the dump file into a specified directory:

    • java -jar cswiktionary2rdf.jar -d <dir path>
  3. Extract a specified dump file into a specified output file:

    • java -jar cswiktionary2rdf.jar -e <format> <dump file> <output file>

Example of the execution:

  • java -jar cswiktionary2rdf.jar -e -t dump.xml output.ttl

For viewing this help again, use the parameter -h.

Use-cases for the RDF dataset

The RDF dataset generated by this program can be used for various purposes:

  • find missing or incorrect information on Wiktionary through SPARQL queries
  • get semantics and context of a word (even through an inflected form of a word) from the Czech DBpedia (through rdfs:seeAlso links generated from External Links section)
  • search any word from Wiktionary pages, and get structured RDF data as a result

The third bullet point has been partially implemented in a web app Enhanced Wiktionary Search Engine (in Czech)

It lists all the pages where the desired word form appears, and if it appeared in one of the tables of inflected forms, it also gives a description of the form with its properties like case, number, gender and more.

About

A program for extracting information from Czech Wiktionary into RDF format

Topics

Resources

License

Stars

Watchers

Forks

Languages