Skip to content

Latest commit

 

History

History
52 lines (38 loc) · 1.42 KB

README.md

File metadata and controls

52 lines (38 loc) · 1.42 KB

TinySegmenter

TinySegmenter -- Super compact Japanese tokenizer was originally created by (c) 2008 Taku Kudo for javascript under the terms of a new BSD licence. For details, see here

tinysegmenter for python2.x was written by Masato Hagiwara. for his information see here

This tinysegmenter is modified for python3.x and python2.x for distribution by Tatsuro Yasukawa. Additionaly, this tinysegmenter is modified for being more faster - thanks to @chezou, @cocoatomo and @methane.

See info about tinysegmenter

Installation

pip install tinysegmenter3

Usage

import tinysegmenter
statement = '私はpython大好きStanding Engineerです.'
tokenized_statement = tinysegmenter.tokenize(statement)
print(tokenized_statement)
# ['私', 'は', 'python', '大好き', 'Standing', ' Engineer', 'です', '.']

Test Text

The test text (in the tests directory) was The Time Machine by H.G. Wells, translated to Japanese by Hiroo Yamagata under the CC BY-SA 2.0 License.

How to run Test

Install requirements from requirements.txt by

pip install -r requirements.txt

then run this:

./runtests.sh