-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Line height parameters broke hOCR output #225
Labels
Comments
zdenop
added a commit
that referenced
this issue
Feb 16, 2016
INCOMPATIBLE fix to hOCR line height information - fixes #225.
zdenop
pushed a commit
that referenced
this issue
Feb 16, 2016
This fixes the duplicate line IDs caused by inserting height information into the middle of the ID and it moves the line height info into the title attribute like everything else, rather than using non-standard HTML attributes (which won't validate). This change may break consumers of the HTML output, but 3.04 has only been in the wild for 6 months and the current HTML is invalid, so I believe the benefit outweighs the cost for the fix.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
…r#225. This fixes the duplicate line IDs caused by inserting height information into the middle of the ID and it moves the line height info into the title attribute like everything else, rather than using non-standard HTML attributes (which won't validate). This change may break consumers of the HTML output, but 3.04 has only been in the wild for 6 months and the current HTML is invalid, so I believe the benefit outweighs the cost for the fix.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
…r#225. This fixes the duplicate line IDs caused by inserting height information into the middle of the ID and it moves the line height info into the title attribute like everything else, rather than using non-standard HTML attributes (which won't validate). This change may break consumers of the HTML output, but 3.04 has only been in the wild for 6 months and the current HTML is invalid, so I believe the benefit outweighs the cost for the fix.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
…r#225. This fixes the duplicate line IDs caused by inserting height information into the middle of the ID and it moves the line height info into the title attribute like everything else, rather than using non-standard HTML attributes (which won't validate). This change may break consumers of the HTML output, but 3.04 has only been in the wild for 6 months and the current HTML is invalid, so I believe the benefit outweighs the cost for the fix.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
…r#225. This fixes the duplicate line IDs caused by inserting height information into the middle of the ID and it moves the line height info into the title attribute like everything else, rather than using non-standard HTML attributes (which won't validate). This change may break consumers of the HTML output, but 3.04 has only been in the wild for 6 months and the current HTML is invalid, so I believe the benefit outweighs the cost for the fix.
zvezdochiot
pushed a commit
to ImageProcessing-ElectronicPublications/tesseract
that referenced
this issue
Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Commit 438edd6 from PR #27 has a couple of problems with it.
Most seriously the new information is inserted into the middle of the element ID, causing lots of duplicated IDs on the page and corrupted ascender height information.
The other issue is that it uses non-standard attributes which won't validate. The hOCR spec places all its information in the
title
attribute and I believe it makes most sense to use that mechanism for the extended information, using thex_
prefix to avoid collisions with future extensions to the spec.The text was updated successfully, but these errors were encountered: