Improve sentence tokenization #5

pgaskin · 2017-10-19T14:38:08Z

Should use a tokenizer library to handle edge cases.

pgaskin · 2017-10-19T14:39:58Z

https://github.com/diasks2/pragmatic_segmenter/blob/master/README.md#the-golden-rules

pgaskin · 2017-12-14T21:35:17Z

Decided against doing this for now, due to performance, file size, and html tag issues.

- Improved robustness - More is implemented directly in the HTML parser and renderer (see my fork of x/net/html) - Better support for XHTML and HTML5 (rather than using a bunch of workarounds) - No more regexps for modifying HTML - Better smart punctuation - More punctuation supported - More robust (won't apply to everything unconditionally) - Now off by default - Faster and more efficient (15-30% faster, 50-70% less memory) - Less memory allocations and copies due to use of readers and writers rather than storing rhe entire file in memory multiple times - Stack-based span adding algorithm (rather than recursive, which has more runtime and memory overhead) - Use byte arrays or runes rather than strings where possible - Better parallel processing of content files - Eliminated memory, goroutine, and file descriptor leaks - Cleaner and better code - Easier to extend - More stable API - More complete unit tests - More accurate sentence splitting and segment numbering (checked against 3 recent free books) - Better match Kobo's behavior by preserving, but not wrapping (in a koboSpan) TextNodes with only whitespace. Previous versions of kepubify used to collapse it to a single space, which still works, but is less efficient to do and is slightly different than what Kobo does (although it results in the same thing during rendering). - Fixed some edge cases where the segment counter could be incorrectly incremented. - Also increment paragraph counter for tables (this case was missing before). - Don't increment paragraph counter if spans were added (i.e. an empty or only whitespace paragraph element) (this case was missing before). - Smaller binary size - Also run tests on Windows closes #47, fixes #45, fixes #35 better fix for #36, #29, #28, #26, #21, #14, #10, #5, and #2

pgaskin added bug enhancement labels Oct 19, 2017

pgaskin self-assigned this Oct 19, 2017

pgaskin closed this as completed Dec 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sentence tokenization #5

Improve sentence tokenization #5

pgaskin commented Oct 19, 2017

pgaskin commented Oct 19, 2017

pgaskin commented Dec 14, 2017

Improve sentence tokenization #5

Improve sentence tokenization #5

Comments

pgaskin commented Oct 19, 2017

pgaskin commented Oct 19, 2017

pgaskin commented Dec 14, 2017