Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Latin langdata #23

Merged
merged 1 commit into from
Feb 21, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions lat/desired_characters

This file was deleted.

25 changes: 25 additions & 0 deletions lat/lat.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Tesseract Latin training - http://ryanfb.github.io/latinocr/
# Build from the https://github.com/ryanfb/latinocr-lat/ repository
# commit: b6885bca0fa755fbed2bbb36d3f5cebf866a15e0

# New segsearch produces better results
enable_new_segsearch 1

# Increase penalty for incorrect punctuation, important as
# diacritics can easily be misrecognised as punctuation
language_model_penalty_punc 0.35

# Increase minimum linesize. This minimises cases of accents
# being incorrectly recognised as separate lines.
textord_min_linesize 2.25

# Also helps to ensure that accents aren't incorrectly recognised
# as separate lines
textord_occupancy_threshold 0.7

# Helps to ensure rows don't overlap
textord_excess_blobsize 0.6

# Disable rare, variant, macron characters
# (can be enabled with tessedit_char_unblacklist)
tessedit_char_blacklist ĀāĒēĪīŌōŪū
Loading