-
-
Notifications
You must be signed in to change notification settings - Fork 55
About the dictionary
From:
- 辞書について | Chapter 07 | 実践:形態素解析 kagome v2 @ Zenn.dev
- issue comment @ Issue #277
The kagome
module provides dictionaries in a format that can be embedded in Go programs.
In kagome
, two types of dictionaries, IPA and Uni, are supported as standard.
$ go get github.com/ikawaha/kagome/v2
go: downloading github.com/ikawaha/kagome v1.11.2
go: downloading github.com/ikawaha/kagome/v2 v2.9.0
go: downloading github.com/ikawaha/kagome-dict v1.0.7
go: downloading github.com/ikawaha/kagome-dict/ipa v1.0.9
go: downloading github.com/ikawaha/kagome-dict/uni v1.1.8
go: added github.com/ikawaha/kagome-dict v1.0.7
go: added github.com/ikawaha/kagome-dict/ipa v1.0.9 // <-- IPADIC
go: added github.com/ikawaha/kagome-dict/uni v1.1.8 // <-- UniDIC
go: added github.com/ikawaha/kagome/v2 v2.9.0
The program can simply "import
" this dictionary and use/embed it. Once loaded into memory, the dictionary works as a singleton and can be used by several morphological analyzers.
package main
import (
"fmt"
"log"
"github.com/ikawaha/kagome-dict/ipa" // use and embed IPADIC
"github.com/ikawaha/kagome/v2/tokenizer"
)
func Example() {
// Create a new tokenizer using IPADIC with OmitBosEos option
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
log.Fatal(err)
}
// Segment the input to tokens
seg := t.Wakati("すもももももももものうち")
fmt.Printf("%#v\n", seg)
// Output: []string{"すもも", "も", "もも", "も", "もも", "の", "うち"}
}
As already mentioned, kagome
supports two standard dictionaries, IPADIC and UniDIC.
IPADIC is the MeCab's so-called "standard dictionary", characterized by a more intuitive separation of morphological units than UniDIC. In contrast, UniDIC splits a sentence into smaller example units for retrieval.
Both dictionaries are quite old. Although not "comparable", IPADIC has a vocabulary of about 400,000 words and UniDIC about 750,000; IPADIC is more suitable for memory-limited environments, while UniDIC's shorter lexical units make it more suitable for splitting words when searching.
Dictionary | Source | Go Pacakge |
---|---|---|
IPADIC (MeCab) | mecab-ipadic-2.7.0-20070801 | github.com/ikawaha/kagome-dict/ipa |
UniDIC | UniDIC-mecab-2.1.2_src | github.com/ikawaha/kagome-dict/uni |
And both dictionaries are a "set of morphemes" of "dict.Dict" type, just the information contained in the dictionary is different.
// import dict github.com/ikawaha/kagome-dict
type dict.Dict struct {
Morphs dict.Morphs
POSTable dict.POSTable
ContentsMeta dict.ContentsMeta
Contents dict.Contents
Connection dict.ConnectionTable
Index dict.IndexTable
CharClass dict.CharClass
CharCategory dict.CharCategory
InvokeList dict.InvokeList
GroupList dict.GroupList
UnkDict dict.UnkDict
}
That is, packages with the same type dict.Dict
can be embedded as system dictionary.
For example, NEologd and Korean dictionary from MeCab are also available as such dictionaries, albeit on an "experimental" basis.
NEologd collects proper nouns from the Internet and covers a wide vocabulary, while Korean MeCab is a Korean morphological dictionary available in MeCab.
Dictionary | Source | Go Pacakge |
---|---|---|
IPADIC-NEologd (MeCab) | mecab-ipadic-neologd | github.com/ikawaha/kagome-ipa-neologd |
Korean (MeCab) | mecab-ko-dic-2.1.1-20180720 | github.com/ikawaha/kagome-dict-ko |
Also, the usage is the same as before, just use the dictionary package with go get
and import
.
$ go get github.com/ikawaha/kagome-dict-ko
go: downloading github.com/ikawaha/kagome-dict-ko v1.1.0
go: added github.com/ikawaha/kagome-dict-ko v1.1.0
package main
import (
"fmt"
"log"
ko "github.com/ikawaha/kagome-dict-ko" // use and embed Korean dict
"github.com/ikawaha/kagome/v2/tokenizer"
)
func Example() {
t, err := tokenizer.New(ko.Dict(), tokenizer.OmitBosEos())
if err != nil {
log.Fatal(err)
}
// Segment the input to tokens
seg := t.Wakati("환영합니다, 한국에.")
fmt.Printf("%#v\n", seg)
// Output: []string{"환영", "합니다", ",", " ", "한국", "에", "."}
}
As already mentioned, IPADIC and UniDIC are a "set of morphemes" of dict.Dict
type, and the information contained in the dictionary is just different.
However, although UniDIC has a larger registered vocabulary than IPADIC, many argue that it is less accurate than IPADIC.
$ # IPA DICT
$ echo "私は日本人です。" | kagome -sysdict ipa
私 名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
日本人 名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。 記号,句点,*,*,*,*,。,。,。
EOS
$ # Uni DICT
$ echo "私は日本人です。" | kagome -sysdict uni
私 代名詞,*,*,*,*,*,ワタクシ,私-代名詞,私,ワタクシ,私,ワタクシ,和,*,*,*,*
は 助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
人 接尾辞,名詞的,一般,*,*,*,ニン,人,人,ニン,人,ニン,漢,*,*,*,*
です 助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
Note the difference between "日本人" and "日本" + "人". This is because the purpose of morphological analysis is different.
The latter UniDIC is a dictionary based on "short units" (短単位
) defined by the NINJAL to facilitate the collection of examples for the BCCWJ.
- NINJAL (National Institute of Japanese Language and Linguistics)
- BCCWJ (Balanced Corpus of Contemporary Written Japanese)
These "short units" are known to be too short to be used in "natural language processing" for syntactic and semantic analysis.
It is therefore understandable why some people claim that UniDIC is less accurate than IPADIC. In this respect, IPADIC is faster and more convenient for most use cases.
An advantage of UniDIC is the "consistency" in word segmentation.
The difference between the two dictionaries, IPA
and Uni
, is illustrated by a well-known example.
"
りんごジュースを飲んだ。
" vs "リンゴジュースを飲んだ。
"
Both are correct and mean the same thing, such as "I drank apple juice".
And here comes the problem.
$ # IPA DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict ipa
りん 副詞,助詞類接続,*,*,*,*,りん,リン,リン
ご 接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ジュース 名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん 動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ 助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。 記号,句点,*,*,*,*,。,。,。
EOS
$ # UNI DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
ジュース 名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん 動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ 助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
Note the difference between "りん
, ご
" and "りんご
".
IPADIC recognized "りんご
" as an adverb/prefix (副詞
/接頭詞
) combination and UniDIC as a noun (名詞
).
The simplest solution, apart from registering a user dictionary, is to use katakana notation.
$ # IPADICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
ジュース 名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん 動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ 助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。 記号,句点,*,*,*,*,。,。,。
EOS
$ # UniDICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
ジュース 名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん 動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ 助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
But, sensibly, "りんごジュース
" is easier to read than "リンゴジュース
" because the words are visually separated (katakana-hiranaga-mixture vs all-in-katakana).
And both dictionaries include the word "りんご
" and "リンゴ
" as a noun (名詞
).
$ # IPA DICT
$ echo "りんご" | kagome -sysdict ipa
りんご 名詞,一般,*,*,*,*,りんご,リンゴ,リンゴ
EOS
$ echo "リンゴ" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
EOS
$ # UNI DICT
$ echo "りんご" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
EOS
$ echo "リンゴ" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
EOS
The difference is that IPADIC attempted to interpret them grammatically, while UniDIC interpreted them in short units.
-
"
日本人
" (noun) vs "日本
,人
" (noun + postfix) -
"
りん
,ご
,ジュース
" (adverb + prefix + noun) vs "りんご
,ジュース
" (noun+noun)
In both cases, the latter delimitation is divided into units suitable for search engines, etc.
This means that "short units" are effective in unifying the units of "search examples" in search engines and other information retrieval systems.
Thus, UniDIC has more advantage for word searching purposes.