A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)
I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.
- Support Maximum Matching Method
- Support Minimum Matching Method
- Support Reverse Maximum Matching
- Support Reverse Minimum Matching
- Support Bidirectional Maximum Matching
- Support Bidirectional Minimum Matching
- Support using Stop Tokens
- Support Custom word Filter
go get -u github.com/xujiajun/gotokenizer
package main
import (
"fmt"
"github.com/xujiajun/gotokenizer"
)
func main() {
text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"
dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
// NewMaxMatch default wordFilter is NumAndLetterWordFilter
mm := gotokenizer.NewMaxMatch(dictPath)
// load dict
mm.LoadDict()
fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>
// enabled filter stop tokens
mm.EnabledFilterStopToken = true
mm.StopTokens = gotokenizer.NewStopTokens()
stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
mm.StopTokens.Load(stopTokenDicPath)
fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>
}
More examples see tests
If you'd like to help out with the project. You can put up a Pull Request.
The gotokenizer is open-sourced software licensed under the Apache-2.0
This package is inspired by the following: