-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ground truth: spaces before and after text? #335
Comments
ok, ocrd-testset.zip *.gt.txt contain no spaces before/after, but \n |
@jbarth-ubhd, I see there is no space or no new line at the end of the *.gt.txt |
I'll see newlines (4th line below):
|
Ground truth line text must not have spaces before or after the text. |
Just tried it again with #7 and https://github.com/ocropus/hocr-tools/blob/master/hocr-extract-images , the generated .exp0.gt.txt files contain spaces before & after:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
bump |
Then I assume that the original data (hOCR) already contains such spaces. Do you have a link to an example? |
The See https://digi.ub.uni-heidelberg.de/diglitData/v/tesstrain-issue-335.zip for a complete test environment; main script is |
The spaces before and after the line occur, if your hocr file is indented. hocr-extract-images uses regex to replace one (or more) whitespace characters with one space. see line 20 i am not sure if they had indentation in mind, though. |
I've created *.exp0.gt.txt as a base for manual ground truth creation using Shreeshrii's shell script and the files contain a space before and after the text (no newlines etc). Example:
... but The-Hallucination-Effect states
»Example 2: Your training text frequently includes a Space at the beginning of your sentences or at the end. Might result in slow training, non-convergence & even model corruption.«
My Question: Spaces or not?
The 1 line images are very tight, no blank space before/after; example:
The text was updated successfully, but these errors were encountered: