Skip to content

scfrank/de_charlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a quick character language model (by default trained on German fiction text). It uses SRILM and KenLM and was helped by: http://victor.chahuneau.fr/notes/2012/07/03/kenlm.html

Installation/setup

Uses Python3. KenLM requires the python3-dev packages (available on apt-get).

Virtualenv setup, in your project directory:

virtualenv -p python3 venv
source venv/bin/activate

KenML depends on Boost >= 1.36.0 (install via your OS package manager). Install dependencies (cython and kenlm).

pip install cython
git clone https://github.com/vchahun/kenlm.git
cd kenlm
./bjam
# compile LM estimation code
python setup.py install # install Python module
cd -

(There should be a way to compile kenlm with --max-kenlm-order=12 to enable better character language models but I couldn't get it to work with the python interface.)

Usage

You probably want to use the LMQuerier in query_lm.py:

from query_lm import LMQuerier

lmq = LMQuerier(lm_model)
lm_pplx = lmq.query_lm(example_string)

lmq.query_lm() "tokenises" (using char_string()) and scores (using norm_score()) the string, which should be a regular string (e.g. "hier ist eine Schur").

On the command line (where it will just do some basic testing), query_lm.py takes a model name:

$python query_lm.py -h
usage: query_lm.py [-h] [--notlower] [--model MODEL]

Lowercasing the data is default. If the model is estimated on non-lowercased data, you probably want to use the notlower flag or set lowercase=False when initialising the LMQuerier.

The model can be an arpa-formatted LM from SRILM or one that's been binarised using kenlm (see below).

Corpus

I'm using the later German novels in the [txtLAB450 corpus] (http://txtlab.org/?p=601). These are included in the DE_19_novels.tar

The tokchars.py script needs to be pointed at the correct directory (i.e. where you've unpacked the corpus).

Any other corpus can be used but should be relatively clean since no tokenisation (apart from possibly lowercasing) is performed. The novels corpus is not one sentence/line, which will add some noise but hopefully nothing serious.

Building a LM.

If you want to build a LM, you need srilm (you're on your own here). The lm provided here was created using the following command:

ngram-count -order 12 -text de_chars -write-vocab de_char_vocab
-no-sos -no-eos -tolower -wbdiscount -lm char_de_wb_o12_low.lm

Note that Witten-Bell smoothing works with character distributions, whereas Kneser-Ney fails (and is the reason why we can't use KenLM to estimate the LM).

After building the LM, use KenLM to binarise it:

./kenlm/bin/build_binary char_de_wb_o12_low.lm char_de_wb_o12_low.lm.klm

About

Simple character LM for German

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages