Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting -m in container does not supress punctuation-based sentence splitting #95

Closed
pirolen opened this issue Mar 18, 2024 · 4 comments
Assignees
Labels

Comments

@pirolen
Copy link

pirolen commented Mar 18, 2024

I'm passing a plain text file, one sentence per line.
docker run -t -i proycon/ucto -L rus -m etc. however still does punctuation-based sentence splitting :-/

@proycon
Copy link
Member

proycon commented Mar 18, 2024

Indeed, this looks like a regression bug. Reproduces on ucto v0.30 and v0.31.

@proycon proycon added the bug label Mar 18, 2024
@kosloot
Copy link
Contributor

kosloot commented Mar 19, 2024

hmm....
Interesting enough, NONE of tests in the testsuite tests this option, so this regression is never detected.
Will have a look into it

MAYBE? it is enough to add the -n option TOO? The you almost get what you want. But also splits single lines that have an embedded punctuation.

@kosloot kosloot self-assigned this Mar 19, 2024
@kosloot
Copy link
Contributor

kosloot commented Mar 19, 2024

OK, I did some research, and this regression is introduced in v0.15 (sic) It dates may 15, 2019
Quite embarrassing .

@kosloot
Copy link
Contributor

kosloot commented Mar 19, 2024

A fix is now in Git. All seems to work now.
NOTE: I'm quite sure you will need -n together with -m to enjoy the full flavor.

Side-note: this fix revealed a failing test for Frog too. Solved that on the fly

@kosloot kosloot closed this as completed Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants