-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".
Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.
##hOCR about#
https://github.com/kba/hocrjs
https://github.com/not-implemented/hocr-proofreader
https://github.com/ultrasaurus/hocr-javascript
##invisible-text-only PDF 双层PDF#
https://www.pdflib.com/pdflib-cookbook/text_output/invisible_text/
Place an image and create invisible text on top of it with the "textrendering" parameter set to 3. The most common scenario for this is "scanned page with invisible OCR text" (which has been retrieved from the scanned page in an earlier step with OCR).
TSV:tab separated values;即“制表符分隔值”
ALTO: Analyzed Layout and Text Object https://github.com/kermitt2/pdfalto