Purpose

An Linguistic Tokenizer for English

Purpose

Morphology is the study of the internal structure of words and forms a core part of linguistic study today.

cane, caning, can, canning
transdisciplinary
deanonymization
ignorantly

morph(P)olog(R)y(S) is(R) the(R) study(R) of(R) the(R) interne(R)al(S) struct(R)ure(S) of(R) word(R)s(S) and(R) forms(R) a(R) core(R) part(R) of(R) lingu(R)ist(S)ic(S) study(R) to(P)day(R) .

cane(R) , cane(R)ing(S) , can(R) , can(R)ing(S)
trans(P)discipline(R)ary(S)
de(P)anonym(R)ize(S)ation(S)
ignore(R)ant(S)ly(S)

Features

Tokenize English words into meaningful parts by brute-force exploring possible segmentations

Installation

sudo pip3 install git+https://github.com/ecchochan/lingutok.git

Usage

import lingutok
lingutok.load()

encoded = lingutok.tokenize('Happy Birthday to me')
>> Encoded('happy (R) birth (R) day (S) to (R) me (R)')

encoded.ids
>> [9366, 7154, 5864, 12623, 10309]

encoded.offsets
>> [0, 6, 6, 15, 18]

encoded.offsets_span
>> [(0, 6), (6, 15), (6, 15), (15, 18), (18, 20)]

encoded.casing
>> [True, True, True, False, False]

encoded.size
>> 5

encoded.text
>> 'Happy Birthday to me'


encoded2 = lingutok.tokenize('Untitled Last Checkpoints')
>> Encoded('un (P) title (R) ed (S) last (R) check (R) point (R) s (S)')

Customization

import lingutok

# Set the data directory
lingutok.set_root('/mnt/d/gits/test/resources')

# Generate the vocab file
data_dir = './'
data_fn = 'my-lingutok'
lingutok.generate_trie(data_dir, data_fn)

# Load the vocab files
lingutok.load(data_dir, data_fn)

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
OpenCC @ 455072c		OpenCC @ 455072c
lingutok		lingutok
marisa-trie @ 970b20c		marisa-trie @ 970b20c
utf8proc		utf8proc
.gitignore		.gitignore
.gitmodules		.gitmodules
MANIFEST.in		MANIFEST.in
README.md		README.md
install_dep.sh		install_dep.sh
setup.py		setup.py
update_cpp.sh		update_cpp.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

Features

Installation

Usage

Customization

About

Releases

Packages

Languages

ecchochan/lingutok

Folders and files

Latest commit

History

Repository files navigation

Purpose

Features

Installation

Usage

Customization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages