Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Multipage Feature #928

Closed
Shreeshrii opened this issue May 17, 2017 · 26 comments
Closed

RFC: Multipage Feature #928

Shreeshrii opened this issue May 17, 2017 · 26 comments
Labels

Comments

@Shreeshrii
Copy link
Collaborator

Copied from: #911 (comment)

@stweil - How would I start Tesseract to process page1.png and page2.png in a single run?

@amitdo - Prepare a text file that has the path to each image:

path/to/1.png
path/to/2.png
path/to/3.tiff

Save it, and then give its name as input file to Tesseract.

tesseract savedlist output

@stweil - Thank you, good to know that. It looks like the ChangeLog, other documentation and the program help text need an update.

Currently all pages are written to one output file (per format). Some formats include page information (hOCR, PDF). Others like TXT don't, but could use a page separator character (ASCII 0x0C = FF). Would it help to support an output base parameter with placeholders like page number or image base name to generate one output file per input image?

The multi-page feature was added in 2014 by commit 25a8c7b.

@Shreeshrii
Copy link
Collaborator Author

@stweil

Would it help to support an output base parameter with placeholders like page number or image base name to generate one output file per input image?

I can think of two cases for using this.

  1. A document for which each page is available as a separate image.
  2. Separate documents images which need to be run in a batch

For the first, current processing is sufficient, maybe with an additional page separator character .

For the second, it would be useful to have the option you suggest.

@stweil
Copy link
Member

stweil commented May 17, 2017

An implementation could also get pairs of image file name and output base from the file list instead of image files only. That would allow more flexibility.

@Shreeshrii
Copy link
Collaborator Author

savedlist

Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif
  1. tesseract savedlist output

Current implementation creates output file output.txt with concatenated OCR text.

Console displays filenames being processed

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 0 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Page 1 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Page 2 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Page 3 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Page 4 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif
  1. `tesseract savedlist savedlist'

Suggested option for additional implementation, to create separate output files for each line in the savedlist list file.

The syntax of the command could be different - either not specifying the outputbase or using the same name as the listfile or something else - whatever is easy to implement.

The output files will be based on the filenames in the listfile, same basename with change of extenstion.

The listfile in the above two cases can be created by eg. by ls -l *.tif >savedlist

  1. option suggested above by @stweil

get pairs of image file name and output base from the file list instead of image files only.

This would require change to listfile format.

Optionally, if output base is not given separately in listfile, then it will create file based on filenames in list.

@Shreeshrii
Copy link
Collaborator Author

@stweil Please compare the time taken for the list option vs multipage tif vs processing each file separately in a loop to see what may be a recommended option.

here is the result of my singular test using Hindi traineddata.

 ----------------------------------------------------
Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m24.880s
user    1m11.375s
sys     0m0.813s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m37.305s
user    1m49.734s
sys     0m0.891s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m43.672s
user    2m1.156s
sys     0m1.078s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m39.755s
user    1m46.578s
sys     0m0.922s
Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1

real    0m23.214s
user    1m6.891s
sys     0m0.781s
-------------------------------------------------
list.txt
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 0 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0001_1L.tif
Page 1 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0002_2R.tif
Page 2 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0004_2R.tif
Page 3 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_1L.tif
Page 4 : Sachitra_Saraswati_Prasad_004932_HR-g4_page0005_2R.tif

real    3m26.132s
user    9m32.531s
sys     0m2.594s

---------------------------------------------------
multitest.tif
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Page 2
Page 3
Page 4
Page 5

real    3m35.027s
user    9m44.297s
sys     0m2.781s

----------------------

@Shreeshrii
Copy link
Collaborator Author

Page numbering in this option starts with 0. Should be changed to 1 (similar to multipage tif).

If a multipage tiff is listed as one of the files in the listfile, only its first page is processed.

@amitdo
Copy link
Collaborator

amitdo commented May 18, 2017

CC: @jbreiden

@jbarlow83
Copy link

The sum of the 5 individual tesseract processes is ~2m 46s real time, quicker than batching images in a single process. That's not quite what we were expecting to see. Anyone know why?

@Shreeshrii
Copy link
Collaborator Author

@jbarlow83

I ran the test just once under WSL on Windows10 for language Hindi - there may have been other processes running at the same time which might have impacted the numbers. Hence my request to @stweil to test and compare the features.

Do you find individual files to get processed faster?

@amitdo
Copy link
Collaborator

amitdo commented May 19, 2017

@jbarlow83

This might be related to the adaptive learning that Tesseract does.

@stweil
Copy link
Member

stweil commented May 19, 2017

Here are my results for a simple hello world image.

Summary for default language (identical to -l eng):

  • PNG and TIFF show similar performance
  • TIFF multi-page and list of single page TIFF show similar performance and are much faster than calling Tesseract for each single page
  • LSTM takes more CPU time than old engine for this example

Summary for language with very large traineddata:

  • no text recognized
  • old engine takes much longer (otherwise similar to the results above)
  • LSTM comparable to result with default language (but no text recognized)

LSTM (--oem 1)

PNG (ten times)
user 0.23
user 0.23
user 0.23
user 0.23
user 0.23
user 0.24
user 0.24
user 0.25
user 0.25
user 0.25

real 2.76
user 2.42
sys 0.77

TIFF (ten times single page)
user 0.23
user 0.24
user 0.24
user 0.24
user 0.24
user 0.25
user 0.25
user 0.25
user 0.26
user 0.27

real 2.77
user 2.50
sys 0.74

TIFF (ten pages)
real 0.43
user 0.76
sys 0.06

TIFF (list with ten single page images)
real 0.43
user 0.73
sys 0.09

Old engine (--oem 0)

PNG (ten times)
real 2.70
user 1.98
sys 0.69

TIFF (ten times single page)
real 2.68
user 1.94
sys 0.70

TIFF (ten pages)
real 0.51
user 0.42
sys 0.08

TIFF (list with ten single page images)
real 0.52
user 0.43
sys 0.08

LSTM engine with large traineddata (--oem 1 -l mya)

PNG (ten times)
real 2.21
user 1.87
sys 1.17

TIFF (ten times single page)
real 2.20
user 1.82
sys 1.20

TIFF (ten pages)
real 0.49
user 1.12
sys 0.11

TIFF (list with ten single page images)
real 0.47
user 1.04
sys 0.11

Old engine with large traineddata (--oem 0 -l mya)

PNG (ten times)
real 18.42
user 16.26
sys 2.10

TIFF (ten times single page)
real 19.07
user 16.20
sys 2.28

TIFF (ten pages)
real 10.69
user 10.46
sys 0.22

TIFF (list with ten single page images)
real 10.78
user 10.53
sys 0.23

@jbreiden
Copy link
Contributor

Amitdo, thanks for adding me to this Request For Comments.

I think it is a very good idea to change the text output format to use the form feed character (U+000C) mark page boundaries. Hopefully this is very easy.

Reasonable people can disagree, but I don't think Tesseract should support an output base parameter with placeholders like page number. There's a lot of combinations already between inputs and outputs. Single page input images, multipage input images, lists of images in a file, lists of images on stdin, streaming, various output format. Combinations are tricky, and it is a big reason why still haven't restored the "OCR to memory buffer" feature that has been mentioned so many times.

If this is mostly about increasing throughput by eliminating initialization time, a common thing to do is to create an "OCR service" where a warmed up Tesseract daemon runs all the time. This type of program would make calls to libtesseract, but is otherwise a separate program. Not an additional feature to tesseractmain.cpp

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented May 25, 2017

tesseract writes the file names to console, these can be combined with the output.

tesseract list.txt stdout > output.txt 2>&1

or

tesseract list.txt stdout -c include_page_breaks=1 > output.txt 2>&1

@Shreeshrii
Copy link
Collaborator Author

I think it is a very good idea to change the text output format to use the form feed character (U+000C) mark page boundaries. Hopefully this is very easy.

@stweil PR, please!

@stweil
Copy link
Member

stweil commented Jun 12, 2017

Yesterday I had a look on the implementation to see where I could add the page separator and found that it is already there:

The parameter include_page_breaks enables a page separator string in output text after each image / page. It is disabled by default.

The parameter page_separator sets the string used as page separator. It is set to the form feed character by default.

So the desired behavior is achieved by tesseract multipage.tif /tmp/multipage -c include_page_breaks=1. It adds the FF character after each page (also after the last page which would not be necessary).

I noticed that Tesseract also adds an empty line at the end of each page. Do we need / want that? I'd prefer to get rid of it.

@Shreeshrii
Copy link
Collaborator Author

I think the question is whether adding of page breaks should be the default in text mode, similar to HOCR or PDF.

If FF is added after each page then the empty line may not be required.

@stweil
Copy link
Member

stweil commented Jun 23, 2017

I suggest to remove the include_page_breaks parameter, remove the empty line at the end of each page, and always use the page_separator parameter. Then each page will be terminated by the FF character by default for text output. Setting page_separator to the LF character would restore the old behaviour, setting it to an empty string would omit page separators.

Would that be fine for everybody? @theraysmith?

@Shreeshrii
Copy link
Collaborator Author

8bb5a89 by @stweil

Don't add empty line to text output
Empty lines in text output are needed to separate paragraphs,
but there should not be an empty line at the end of the text.

What about the other changes?

@stweil
Copy link
Member

stweil commented Sep 11, 2017

There was no answer to my previous suggestion. If people agree, I'll prepare a pull request which removes include_page_breaks and which always uses the page_separator parameter.

@amitdo
Copy link
Collaborator

amitdo commented Sep 16, 2017

My question is: Are you sure that any text editor can handle form feed?

@stweil
Copy link
Member

stweil commented Sep 16, 2017

The ones which I know (more than 10) can handle form feed. So do all printers (which really do a form feed).

@amitdo
Copy link
Collaborator

amitdo commented Sep 16, 2017

Including Notepad?

See the discussion which led to the patch:

https://groups.google.com/forum/#!msg/tesseract-dev/VsgJ9R-cTQ0/OMeDjYWoAdQJ

@stweil
Copy link
Member

stweil commented Sep 17, 2017

Notepad cannot be used reasonably with text files which use the common LF line endings – it expects CRLF. So it does not work with text files generated by Tesseract, and FF is only an additional detail. Maybe that's why I did not count Notepad as an editor.

As I suggested to keep the page_separator parameter, it would still be possible to use the tricks mentioned in the discussion which you cited.

@amitdo
Copy link
Collaborator

amitdo commented Sep 17, 2017

There was no answer to my previous suggestion. If people agree, I'll prepare a pull request which removes include_page_breaks and which always uses the page_separator parameter.

I agree :-)

@jbreiden
Copy link
Contributor

jbreiden commented Sep 19, 2017 via email

@Shreeshrii
Copy link
Collaborator Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants