A Chinese word processing toolkit
- Although not the fastest, FoolNLTK is probably the most accurate open source Chinese word segmenter in the market
- Trained based on the BiLSTM model
- High-accuracy in participle, part-of-speech tagging, entity recognition
- User-defined dictionary
- Ability to self train models
- Allows for batch processing
*** 2020/2/16 *** update: use bert model train and export model to deploy, chinese train documentation
To download and build FoolNLTK, type:
get clone https://github.com/rockyzhengwu/FoolNLTK.git
cd FoolNLTK/train
For detailed instructions
- Only tested in Linux Python 3 environment.
pip install foolnltk
import fool
text = "一个傻子在北京"
print(fool.cut(text))
# ['一个', '傻子', '在', '北京']
For participle segmentations, specify a -b
parameter to increase the number of lines segmented every run.
python -m fool [filename]
The format of the dictionary is as follows: the higher the weight of a word, and the longer the word length is, the more likely the word is to appear. Word weight value should be greater than 1。
难受香菇 10
什么鬼 10
分词工具 10
北京 10
北京天安门 10
To load the dictionary:
import fool
fool.load_userdict(path)
text = ["我在北京天安门看你难受香菇", "我在北京晒太阳你在非洲看雪"]
print(fool.cut(text))
#[['我', '在', '北京', '天安门', '看', '你', '难受', '香菇'],
# ['我', '在', '北京', '晒太阳', '你', '在', '非洲', '看', '雪']]
To delete the dictionary
fool.delete_userdict();
import fool
text = ["一个傻子在北京"]
print(fool.pos_cut(text))
#[[('一个', 'm'), ('傻子', 'n'), ('在', 'p'), ('北京', 'ns')]]
import fool
text = ["一个傻子在北京","你好啊"]
words, ners = fool.analysis(text)
print(ners)
#[[(5, 8, 'location', '北京')]]
- For any missing model files, try looking in
sys.prefix
, under/usr/local/