Skip to content

Latest commit

 

History

History
387 lines (290 loc) · 12.7 KB

how-to-custom-dict.asciidoc

File metadata and controls

387 lines (290 loc) · 12.7 KB

How to use a custom dictionary in Nori

Nori, the Lucene’s Korean analyzer, is built from a specific version of the mecab-ko-dic. This document shows how to create a distribution that uses a custom dictionary. This operation is manual and requires multiple steps so don’t do it if your intent is to add few words to the existing dictionary. This can be done in the Elasticsearch’s plugin itself by providing a user dictionary. However if your domain-specific vocabulary is big (several thousand), rebuilding the original dictionary with your extra rules can:

  • Lower the memory usage compared to the user-dic approach.

  • Speed up the creation of the analyzer/tokenizer.

  • Speed up the analysis.

Prerequisite

We’ll need to compile Lucene and Elasticsearch so make sure that you have ant and gradle and java installed in your system.

Installing MeCab

First you need to install MeCab, you can download mecab-ko (a fork of MeCab for Korean) here. Extract the compressed archive and install MeCab by running the following command from the directory where the data is extracted:

$ ./configure
$ make
$ sudo make install

At this point you should be able to run:

$ mecab -v
mecab of 0.996/ko-0.9.2
Installing mecab-ko-dic

Download the latest version of the mecab-ko-dic. It’s the dictionary used by Nori by default. Extract the compressed archive and go to the directory containing the source code:

$ tar xvf mecab-ko-dic-2.0.3-20170922.tar.gz
$ cd mecab-ko-dic-2.0.3-20170922

Run the following command to install the dictionary:

$ ./configure
$ make
$ sudo make install

Adding custom words

In this section we’ll see how to add custom words to the original distribution downloaded in the previous steps.

Create a file with a csv extension (custom-words.csv for instance) inside the user-dic directory of the mecab-ko-dic. The directory should now look like this:

$ ls
AUTHORS           EP.csv            IC.csv            MM.csv            NNBC.csv          Person-actor.csv  README            VX.csv            XSV.csv           config.log        install-sh        model.def         unk.def
COPYING           ETM.csv           INSTALL           Makefile          NNG.csv           Person.csv        Symbol.csv        Wikipedia.csv     aclocal.m4        config.status     left-id.def       pos-id.def        unk.dic
ChangeLog         ETN.csv           Inflect.csv       Makefile.am       NNP.csv           Place-address.csv VA.csv            XPN.csv           autogen.sh        configure         matrix.bin        rewrite.def       user-dic
CoinedWord.csv    Foreign.csv       J.csv             Makefile.in       NP.csv            Place-station.csv VCN.csv           XR.csv            char.bin          configure.ac      matrix.def        right-id.def
EC.csv            Group.csv         MAG.csv           NEWS              NR.csv            Place.csv         VCP.csv           XSA.csv           char.def          dicrc             missing           sys.dic
EF.csv            Hanja.csv         MAJ.csv           NNB.csv           NorthKorea.csv    Preanalysis.csv   VV.csv            XSN.csv           clean             feature.def       model.bin         tools
$ ls user-dic
README.md        custom-words.csv nnp.csv          person.csv       place.csv

Remove the other files person.csv, place.csv and nnp.csv (they contains example of custom entries):

$ rm user-dic/person.csv user-dic/place.csv user-dic/nnp.csv
$ ls user-dic
README.md        custom-words.csv
Warning
Removing the extra files is mandatory, place.csv contains an entry that Nori cannot parse.

Add your custom words in the csv file:

  • Add a proper noun:

대우,,,,NNP,*,F,대우,*,*,*,*
구글,,,,NNP,*,T,구글,*,*,*,*
  • Add a person:

까비,,,,NNP,인명,F,까비,*,*,*,*
  • Add place names:

세종,,,,NNP,지명,T,세종,*,*,*,*
세종시,,,,NNP,지명,F,세종시,Compound,*,*,세종/NNP/지명+시/NNG/*

If you need to add other part of speech, you can check the full list here

Run the following command:

$ ./tools/add-userdic.sh

At this point the mecab-ko-dic directory should look like this:

$ ls
AUTHORS               ETN.csv               MAG.csv               NNBC.csv              Place-address.csv     VCP.csv               XSV.csv               configure             missing               unk.def
COPYING               Foreign.csv           MAJ.csv               NNG.csv               Place-station.csv     VV.csv                aclocal.m4            configure.ac          model.bin             unk.dic
ChangeLog             Group.csv             MM.csv                NNP.csv               Place.csv             VX.csv                autogen.sh            dicrc                 model.def             user-custom-words.csv
CoinedWord.csv        Hanja.csv             Makefile              NP.csv                Preanalysis.csv       Wikipedia.csv         char.bin              feature.def           pos-id.def            user-dic
EC.csv                IC.csv                Makefile.am           NR.csv                README                XPN.csv               char.def              install-sh            rewrite.def
EF.csv                INSTALL               Makefile.in           NorthKorea.csv        Symbol.csv            XR.csv                clean                 left-id.def           right-id.def
EP.csv                Inflect.csv           NEWS                  Person-actor.csv      VA.csv                XSA.csv               config.log            matrix.bin            sys.dic
ETM.csv               J.csv                 NNB.csv               Person.csv            VCN.csv               XSN.csv               config.status         matrix.def            tools

Check that the user-custom-words.csv file is present and that it contains the expanded version of your custom entries:

$ ls user-custom-words.csv
$ cat user-custom-words.csv
대우,1783,3538,4394,NNP,*,F,대우,*,*,*,*
구글,1783,3539,3534,NNP,*,T,구글,*,*,*,*
까비,1785,3542,5464,NNP,인명,F,까비,*,*,*,*
세종,1786,3546,5188,NNP,지명,T,세종,*,*,*,*
세종시,1786,3545,5100,NNP,지명,F,세종시,Compound,*,*,세종/NNP/지명+시/NNG/*

Create an archive of the modified dictionary with the following command:

$ tar cvzf custom-mecab-ko-dic.tar.gz mecab-ko-dic-2.0.3-20170922

We’ll use this archive in the next section to build the Lucene module.

Building the Lucene’s binary dictionary

The Nori module uses a binary dictionary that is created from a mecab-ko-dic distribution. In this section we’ll create a binary dictionary for Lucene’s Korean module using the modified distribution. The dictionary is built from the source and packaged inside the jar, so you need to checkout Lucene. We’ll create a custom jar for Lucene 7.4.0 (not released yet, so we use 7x):

$ git clone -b branch_7x https://github.com/apache/lucene-solr.git

Now open lucene/analysis/nori/ivy.xml with your favorite editor and replace the line:

<artifact name="mecab-ko-dic" type=".tar.gz" url="https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz" />

with:

<artifact name="mecab-ko-dic" type=".tar.gz" url="file:///change/me/custom-mecab-ko-dic.tar.gz " />

This replaces the original dictionary with the dictionary we modified on the previous steps. Go to lucene/analysis/nori and run:

$ ant regenerate

This will create a new binary dictionary from our new dictionary in src/resources/org/apache/lucene/analysis/ko/dict/. Verify that the binary dictionary is present and is different than the original one:

$ git status .
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   ivy.xml
	modified:   src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$buffer.dat
	modified:   src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$fst.dat
	modified:   src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$targetMap.dat
	modified:   src/tools/java/org/apache/lucene/analysis/ko/util/TokenInfoDictionaryBuilder.java

no changes added to commit (use "git add" and/or "git commit -a")

We can now create a jar to distribute the module with our custom dictionary:

$ ant jar

The jar for the custom module can be found in lucene/build/analysis/nori/lucene-analyzers-nori-7.4.0-SNAPSHOT.jar from the root of the lucene checkout. Copy this file, we’ll need it in the next steps.

Building a custom plugin for Elasticsearch

In this section we are going to build a custom version of the Elasticsearch’s plugin for Nori that uses the Lucene’s module jar produced in the previous step. We’ll need to access the source of Elasticsearch so the first operation is to checkout the code of Elasticsearch 6.4.0 (not released yet, so we use 6x):

$ git clone -b 6.x https://github.com/elastic/elasticsearch

Now go to elasticsearch/plugins/analysis-nori and open the file build.gradle with your favorite editor. Change the following line:

compile "org.apache.lucene:lucene-analyzers-nori:${versions.lucene}"

with:

compile files('/change/me/lucene-analyzers-nori-7.4.0-SNAPSHOT.jar')

This will tell gradle to build the plugin from the modified jar we built in the previous step.

From the analysis-nori directory, run the following command to produce the custom distribution for our plugin:

$ gradle assemble
...
BUILD SUCCESSFUL in 2m 5s
28 actionable tasks: 28 executed
$ ls build/distributions
analysis-nori-6.4.0-SNAPSHOT-javadoc.jar analysis-nori-6.4.0-SNAPSHOT-sources.jar analysis-nori-6.4.0-SNAPSHOT.jar         analysis-nori-6.4.0-SNAPSHOT.pom         analysis-nori-6.4.0-SNAPSHOT.zip

If the command succeeded, you’ll find the zip distribution in build/distributions that you can use inside Elasticsearch. Copy this file, we’ll need it in the next step.

Testing in Elasticsearch

Download the 6.4.0 version of Elasticsearch (not released yet, so we use 6.x).

Extract the distribution and run the following command from the Elasticsearch’s directory:

./bin/elasticsearch-plugin install file:///change/me/analysis-nori-6.4.0-SNAPSHOT.zip

And we’re done, you can now start Elasticsearch and check if the custom words are recognized:

./bin/elasticsearch

Try the Nori analyzer with:

POST _analyze
{
	"text": "대우그룹",
	"analyzer": "nori",
	"explain": true
}

The answer should look like this:

{
    "detail": {
        "custom_analyzer": false,
        "analyzer": {
            "name": "org.apache.lucene.analysis.ko.KoreanAnalyzer",
            "tokens": [
                {
                    "token": "대우",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "bytes": "[eb 8c 80 ec 9a b0]",
                    "leftPOS": "NNP(Proper Noun))",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NNP(Proper Noun)",
                    "termFrequency": 1
                },
                {
                    "token": "그룹",
                    "start_offset": 2,
                    "end_offset": 4,
                    "type": "word",
                    "position": 1,
                    "bytes": "[ea b7 b8 eb a3 b9]",
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NNG(General Noun)",
                    "termFrequency": 1
                }
            ]
        }
    }
}