RFC: Multipage Feature #928

Shreeshrii · 2017-05-17T14:47:18Z

@stweil - How would I start Tesseract to process page1.png and page2.png in a single run?

@amitdo - Prepare a text file that has the path to each image:

path/to/1.png
path/to/2.png
path/to/3.tiff

Save it, and then give its name as input file to Tesseract.

tesseract savedlist output

@stweil - Thank you, good to know that. It looks like the ChangeLog, other documentation and the program help text need an update.

Currently all pages are written to one output file (per format). Some formats include page information (hOCR, PDF). Others like TXT don't, but could use a page separator character (ASCII 0x0C = FF). Would it help to support an output base parameter with placeholders like page number or image base name to generate one output file per input image?

The multi-page feature was added in 2014 by commit 25a8c7b.

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2017-05-17T14:51:15Z

@stweil

Would it help to support an output base parameter with placeholders like page number or image base name to generate one output file per input image?

I can think of two cases for using this.

A document for which each page is available as a separate image.
Separate documents images which need to be run in a batch

For the first, current processing is sufficient, maybe with an additional page separator character .

For the second, it would be useful to have the option you suggest.

stweil · 2017-05-17T14:57:21Z

An implementation could also get pairs of image file name and output base from the file list instead of image files only. That would allow more flexibility.

Shreeshrii · 2017-05-18T08:30:44Z

savedlist

Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif

tesseract savedlist output

Current implementation creates output file output.txt with concatenated OCR text.

Console displays filenames being processed

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 0 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Page 1 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Page 2 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Page 3 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Page 4 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif

`tesseract savedlist savedlist'

Suggested option for additional implementation, to create separate output files for each line in the savedlist list file.

The syntax of the command could be different - either not specifying the outputbase or using the same name as the listfile or something else - whatever is easy to implement.

The output files will be based on the filenames in the listfile, same basename with change of extenstion.

The listfile in the above two cases can be created by eg. by ls -l *.tif >savedlist

option suggested above by @stweil

get pairs of image file name and output base from the file list instead of image files only.

This would require change to listfile format.

Optionally, if output base is not given separately in listfile, then it will create file based on filenames in list.

Shreeshrii · 2017-05-18T08:33:51Z

@stweil Please compare the time taken for the list option vs multipage tif vs processing each file separately in a loop to see what may be a recommended option.

here is the result of my singular test using Hindi traineddata.

 ----------------------------------------------------
Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m24.880s
user    1m11.375s
sys     0m0.813s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m37.305s
user    1m49.734s
sys     0m0.891s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m43.672s
user    2m1.156s
sys     0m1.078s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m39.755s
user    1m46.578s
sys     0m0.922s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m23.214s
user    1m6.891s
sys     0m0.781s
-------------------------------------------------
list.txt
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 0 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Page 1 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Page 2 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Page 3 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Page 4 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif

real    3m26.132s
user    9m32.531s
sys     0m2.594s

---------------------------------------------------
multitest.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Page 2
Page 3
Page 4
Page 5

real    3m35.027s
user    9m44.297s
sys     0m2.781s

----------------------

Shreeshrii · 2017-05-18T08:39:23Z

Page numbering in this option starts with 0. Should be changed to 1 (similar to multipage tif).

If a multipage tiff is listed as one of the files in the listfile, only its first page is processed.

amitdo · 2017-05-18T08:51:55Z

CC: @jbreiden

jbarlow83 · 2017-05-18T21:04:40Z

The sum of the 5 individual tesseract processes is ~2m 46s real time, quicker than batching images in a single process. That's not quite what we were expecting to see. Anyone know why?

Shreeshrii · 2017-05-19T03:29:57Z

@jbarlow83

I ran the test just once under WSL on Windows10 for language Hindi - there may have been other processes running at the same time which might have impacted the numbers. Hence my request to @stweil to test and compare the features.

Do you find individual files to get processed faster?

amitdo · 2017-05-19T08:29:35Z

@jbarlow83

This might be related to the adaptive learning that Tesseract does.

stweil · 2017-05-19T09:51:26Z

Here are my results for a simple hello world image.

Summary for default language (identical to -l eng):

PNG and TIFF show similar performance
TIFF multi-page and list of single page TIFF show similar performance and are much faster than calling Tesseract for each single page
LSTM takes more CPU time than old engine for this example

Summary for language with very large traineddata:

no text recognized
old engine takes much longer (otherwise similar to the results above)
LSTM comparable to result with default language (but no text recognized)

LSTM (--oem 1)

PNG (ten times)
user 0.23
user 0.23
user 0.23
user 0.23
user 0.23
user 0.24
user 0.24
user 0.25
user 0.25
user 0.25

real 2.76
user 2.42
sys 0.77

TIFF (ten times single page)
user 0.23
user 0.24
user 0.24
user 0.24
user 0.24
user 0.25
user 0.25
user 0.25
user 0.26
user 0.27

real 2.77
user 2.50
sys 0.74

TIFF (ten pages)
real 0.43
user 0.76
sys 0.06

TIFF (list with ten single page images)
real 0.43
user 0.73
sys 0.09

Old engine (--oem 0)

PNG (ten times)
real 2.70
user 1.98
sys 0.69

TIFF (ten times single page)
real 2.68
user 1.94
sys 0.70

TIFF (ten pages)
real 0.51
user 0.42
sys 0.08

TIFF (list with ten single page images)
real 0.52
user 0.43
sys 0.08

LSTM engine with large traineddata (--oem 1 -l mya)

PNG (ten times)
real 2.21
user 1.87
sys 1.17

TIFF (ten times single page)
real 2.20
user 1.82
sys 1.20

TIFF (ten pages)
real 0.49
user 1.12
sys 0.11

TIFF (list with ten single page images)
real 0.47
user 1.04
sys 0.11

Old engine with large traineddata (--oem 0 -l mya)

PNG (ten times)
real 18.42
user 16.26
sys 2.10

TIFF (ten times single page)
real 19.07
user 16.20
sys 2.28

TIFF (ten pages)
real 10.69
user 10.46
sys 0.22

TIFF (list with ten single page images)
real 10.78
user 10.53
sys 0.23

jbreiden · 2017-05-19T22:30:33Z

Amitdo, thanks for adding me to this Request For Comments.

I think it is a very good idea to change the text output format to use the form feed character (U+000C) mark page boundaries. Hopefully this is very easy.

Reasonable people can disagree, but I don't think Tesseract should support an output base parameter with placeholders like page number. There's a lot of combinations already between inputs and outputs. Single page input images, multipage input images, lists of images in a file, lists of images on stdin, streaming, various output format. Combinations are tricky, and it is a big reason why still haven't restored the "OCR to memory buffer" feature that has been mentioned so many times.

If this is mostly about increasing throughput by eliminating initialization time, a common thing to do is to create an "OCR service" where a warmed up Tesseract daemon runs all the time. This type of program would make calls to libtesseract, but is otherwise a separate program. Not an additional feature to tesseractmain.cpp

Shreeshrii · 2017-05-25T13:14:25Z

tesseract writes the file names to console, these can be combined with the output.

tesseract list.txt stdout > output.txt 2>&1

or

tesseract list.txt stdout -c include_page_breaks=1 > output.txt 2>&1

Shreeshrii · 2017-06-08T07:50:43Z

I think it is a very good idea to change the text output format to use the form feed character (U+000C) mark page boundaries. Hopefully this is very easy.

@stweil PR, please!

stweil · 2017-06-12T09:04:56Z

Yesterday I had a look on the implementation to see where I could add the page separator and found that it is already there:

The parameter include_page_breaks enables a page separator string in output text after each image / page. It is disabled by default.

The parameter page_separator sets the string used as page separator. It is set to the form feed character by default.

So the desired behavior is achieved by tesseract multipage.tif /tmp/multipage -c include_page_breaks=1. It adds the FF character after each page (also after the last page which would not be necessary).

I noticed that Tesseract also adds an empty line at the end of each page. Do we need / want that? I'd prefer to get rid of it.

Shreeshrii · 2017-06-22T09:23:49Z

I think the question is whether adding of page breaks should be the default in text mode, similar to HOCR or PDF.

If FF is added after each page then the empty line may not be required.

stweil · 2017-06-23T12:45:22Z

I suggest to remove the include_page_breaks parameter, remove the empty line at the end of each page, and always use the page_separator parameter. Then each page will be terminated by the FF character by default for text output. Setting page_separator to the LF character would restore the old behaviour, setting it to an empty string would omit page separators.

Would that be fine for everybody? @theraysmith?

Shreeshrii · 2017-09-11T13:36:08Z

8bb5a89 by @stweil

Don't add empty line to text output
Empty lines in text output are needed to separate paragraphs,
but there should not be an empty line at the end of the text.

What about the other changes?

stweil · 2017-09-11T14:00:22Z

There was no answer to my previous suggestion. If people agree, I'll prepare a pull request which removes include_page_breaks and which always uses the page_separator parameter.

amitdo · 2017-09-16T20:01:18Z

4c7c960

https://web.archive.org/web/20160626112213/http://code.google.com/p/tesseract-ocr/issues/detail?id=1417

https://groups.google.com/d/msg/tesseract-dev/VsgJ9R-cTQ0/OMeDjYWoAdQJ

amitdo · 2017-09-16T20:06:31Z

My question is: Are you sure that any text editor can handle form feed?

stweil · 2017-09-16T20:12:50Z

The ones which I know (more than 10) can handle form feed. So do all printers (which really do a form feed).

amitdo · 2017-09-16T20:33:53Z

Including Notepad?

See the discussion which led to the patch:

https://groups.google.com/forum/#!msg/tesseract-dev/VsgJ9R-cTQ0/OMeDjYWoAdQJ

stweil · 2017-09-17T19:43:32Z

Notepad cannot be used reasonably with text files which use the common LF line endings – it expects CRLF. So it does not work with text files generated by Tesseract, and FF is only an additional detail. Maybe that's why I did not count Notepad as an editor.

As I suggested to keep the page_separator parameter, it would still be possible to use the tricks mentioned in the discussion which you cited.

amitdo · 2017-09-17T19:54:39Z

There was no answer to my previous suggestion. If people agree, I'll prepare a pull request which removes include_page_breaks and which always uses the page_separator parameter.

I agree :-)

jbreiden · 2017-09-19T02:35:11Z

I agree.

Shreeshrii · 2017-09-19T06:43:17Z

Thanks!

Shreeshrii mentioned this issue May 17, 2017

RFC: Add initial support for traineddata files in compressed archive formats (don't merge) #911

Closed

amitdo mentioned this issue May 26, 2017

Multi-Page Input for the CLI mittagessen/kraken#43

Closed

stweil mentioned this issue Sep 19, 2017

Remove Tesseract parameter "include_page_breaks" and use FF by default #1140

Merged

Shreeshrii closed this as completed Sep 19, 2017

amitdo mentioned this issue Jan 11, 2018

Create a PDF with multiple pages? #1268

Closed

amitdo added the RFC label Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Multipage Feature #928

RFC: Multipage Feature #928

Shreeshrii commented May 17, 2017

Shreeshrii commented May 17, 2017

stweil commented May 17, 2017 •

edited

Loading

Shreeshrii commented May 18, 2017

Shreeshrii commented May 18, 2017

Shreeshrii commented May 18, 2017

amitdo commented May 18, 2017

jbarlow83 commented May 18, 2017

Shreeshrii commented May 19, 2017

amitdo commented May 19, 2017

stweil commented May 19, 2017 •

edited

Loading

jbreiden commented May 19, 2017

Shreeshrii commented May 25, 2017 •

edited

Loading

Shreeshrii commented Jun 8, 2017

stweil commented Jun 12, 2017

Shreeshrii commented Jun 22, 2017

stweil commented Jun 23, 2017 •

edited

Loading

Shreeshrii commented Sep 11, 2017

stweil commented Sep 11, 2017 •

edited

Loading

amitdo commented Sep 16, 2017

amitdo commented Sep 16, 2017

stweil commented Sep 16, 2017

amitdo commented Sep 16, 2017

stweil commented Sep 17, 2017 •

edited

Loading

amitdo commented Sep 17, 2017

jbreiden commented Sep 19, 2017 via email

Shreeshrii commented Sep 19, 2017

RFC: Multipage Feature #928

RFC: Multipage Feature #928

Comments

Shreeshrii commented May 17, 2017

Shreeshrii commented May 17, 2017

stweil commented May 17, 2017 • edited Loading

Shreeshrii commented May 18, 2017

Shreeshrii commented May 18, 2017

Shreeshrii commented May 18, 2017

amitdo commented May 18, 2017

jbarlow83 commented May 18, 2017

Shreeshrii commented May 19, 2017

amitdo commented May 19, 2017

stweil commented May 19, 2017 • edited Loading

jbreiden commented May 19, 2017

Shreeshrii commented May 25, 2017 • edited Loading

Shreeshrii commented Jun 8, 2017

stweil commented Jun 12, 2017

Shreeshrii commented Jun 22, 2017

stweil commented Jun 23, 2017 • edited Loading

Shreeshrii commented Sep 11, 2017

stweil commented Sep 11, 2017 • edited Loading

amitdo commented Sep 16, 2017

amitdo commented Sep 16, 2017

stweil commented Sep 16, 2017

amitdo commented Sep 16, 2017

stweil commented Sep 17, 2017 • edited Loading

amitdo commented Sep 17, 2017

jbreiden commented Sep 19, 2017 via email

Shreeshrii commented Sep 19, 2017

stweil commented May 17, 2017 •

edited

Loading

stweil commented May 19, 2017 •

edited

Loading

Shreeshrii commented May 25, 2017 •

edited

Loading

stweil commented Jun 23, 2017 •

edited

Loading

stweil commented Sep 11, 2017 •

edited

Loading

stweil commented Sep 17, 2017 •

edited

Loading