Sentence splitting errors and different output compared to NLTK #16

Scarfmonster · 2020-01-25T15:20:42Z

Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.

To test things, I loaded the JSON model from rust-punkt:

from collections import defaultdict
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer

with open(model_path, mode='r', encoding='UTF8') as model_file:
    model = json.load(model_file)

params = PunktParameters()
params.sent_starters = set(model['sentence_starters'])
params.abbrev_types = set(model['abbrev_types'])
params.collocations = set([tuple(t) for t in model['collocations']])
params.ortho_context = defaultdict(int, model['ortho_context'])

punkt = PunktSentenceTokenizer(params)
punkt.tokenize(text, realign_boundaries=False)

The output from NLTK Punkt:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np. w perlu), to jednak określa pola bieżącego rekordu.

While rust-punkt produced:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np.
w perlu), to jednak określa pola bieżącego rekordu.

MichaelKohler mentioned this issue Jul 4, 2020

Improve sentence separation common-voice/cv-sentence-extractor#11

Closed

bminixhofer mentioned this issue Feb 3, 2021

How to handle sentence boundary detection bminixhofer/nlprule#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence splitting errors and different output compared to NLTK #16

Sentence splitting errors and different output compared to NLTK #16

Scarfmonster commented Jan 25, 2020

Sentence splitting errors and different output compared to NLTK #16

Sentence splitting errors and different output compared to NLTK #16

Comments

Scarfmonster commented Jan 25, 2020