From ecfee53bac59e5467f8a8fb5026928a1d39b5c6d Mon Sep 17 00:00:00 2001 From: Stefan Weil Date: Thu, 4 Oct 2018 12:05:49 +0200 Subject: [PATCH] Don't set page segmentation mode for hocr, pdf and tsv configs Setting the page segmentation mode in those config files gives unexpected results: the text recognized when no config or only txt is given changes if both txt and any of hocr, pdf or tsv is chosen. In a test set of nearly 200 pages from historical books, using segmentation mode 1 is typically slightly better than the default, but there are also cases where it is much worse. Therefore the user should be able to decide which page segmentation mode is best. Old results for hocr, pdf or tsv now need an explicit `--psm 1` for reproduction. Signed-off-by: Stefan Weil --- tessdata/configs/hocr | 1 - tessdata/configs/pdf | 1 - tessdata/configs/tsv | 1 - 3 files changed, 3 deletions(-) diff --git a/tessdata/configs/hocr b/tessdata/configs/hocr index 9f63e41ebe..5ab372eaf8 100644 --- a/tessdata/configs/hocr +++ b/tessdata/configs/hocr @@ -1,3 +1,2 @@ tessedit_create_hocr 1 -tessedit_pageseg_mode 1 hocr_font_info 0 diff --git a/tessdata/configs/pdf b/tessdata/configs/pdf index 0d5f0f14cd..59645d71ce 100644 --- a/tessdata/configs/pdf +++ b/tessdata/configs/pdf @@ -1,2 +1 @@ tessedit_create_pdf 1 -tessedit_pageseg_mode 1 diff --git a/tessdata/configs/tsv b/tessdata/configs/tsv index 11cd6fc97a..dc52478177 100644 --- a/tessdata/configs/tsv +++ b/tessdata/configs/tsv @@ -1,2 +1 @@ tessedit_create_tsv 1 -tessedit_pageseg_mode 1