In this project we will implement NLTK Taggers for Slovene language.
##Reqirements
For this tagger to work, you need Python 2.7 and NLTK.
##Usage
Unitl this taggers are build into NLTK, you can download the taggers from folder slovene_taggers/ and use them in NLTK.
The example, which shows how to use Slovene taggers, is in file example.py
Slovenian explanation of tags is in jos1M/josMSD-canon-sl.tbl
##Folders and files description
-
evaluation/ : outputs from evaluation script. graph.m is octave code for plotting evaluation results.
-
jos100k/ : Slovene corpus taken from JOS project with 100.000 tagged words.
-
jos1M/ : Slovene corpus taken from JOS project with million tagged words.
-
paper :the latex paper about this project
-
pos/jos1M.pos : this file is used as an input for trainer program from trainer/
-
slovene_taggers/ : the result of this project. Here are strored Slovene Taggers, which can be used in NLTK.
-
slides/ : presentation slides in Slovene
-
trainer/ : the code forked from https://github.com/japerk/nltk-trainer. This trainer is used to train the taggers.
-
evaluateTaggers.sh : commands for accuracy evaluation of the taggers.
-
evaluateTaggersSpeed.py : commands for measuring the time spent for tagging.
-
example.py : this example shows, how to use Slovene taggers in NLTK.
-
generateTaggers.sh : commands for generating the taggers. The generation uses data pos/jos1M.pos and program trainer/train_tagger.py.
-
transformJOS.py : the code for transforming all .xml corpuses from jos1M/ into pos/jos1M.pos.