Home

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.

##hOCR about#
https://github.com/kba/hocrjs
https://github.com/not-implemented/hocr-proofreader
https://github.com/ultrasaurus/hocr-javascript

##invisible-text-only PDF 双层PDF#
https://www.pdflib.com/pdflib-cookbook/text_output/invisible_text/
Place an image and create invisible text on top of it with the "textrendering" parameter set to 3. The most common scenario for this is "scanned page with invisible OCR text" (which has been retrieved from the scanned page in an earlier step with OCR).

TSV：tab separated values；即“制表符分隔值”
ALTO: Analyzed Layout and Text Object https://github.com/kermitt2/pdfalto

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally