AnalyseLayout() for tesseract.js #656

ghost · 2022-09-01T05:47:33Z

Is your feature request related to a problem? Please describe.
Currently it is not possible to perform a fast document layout analysis.

Describe the solution you'd like
The function AnalyseLayout() is present in tesseract C++ and I have seen that there is something present in the tesseract.js-core inside the glue.js file:
https://github.com/naptha/tesseract.js-core/blob/82c349860e5d0cd81449761077d0d113fdf04c1b/javascript/glue.js#L1481

The AnalyzeLayout function makes a very fast analysis of the document returning the document segmented in boxes.

Describe alternatives you've considered
Using the regular worker.recognize() function is possible to perform layout analysis working with the TSV output but this does require e full analysis wheras the function AnalyseLayout() uses another method that is much more immediate and can define the zones to later perfomr a worker.recognize().

Additional context
Using gImageReader with tesseract

Balearica · 2022-09-02T03:31:11Z

I have no opposition to adding this, although probably won't have time personally (in the near future). Will likely require an interested user to develop an interface. As you note, the necessary API functions do appear to be exposed already (in the glue file), so it would presumably just require building an interface around that using JavaScript.

ghost · 2022-09-14T05:14:33Z

@Balearica Thank you for the reply. Could you briefly describe where and what should be modified/added ?

I will see if I can do this. I would need some directions.

Balearica · 2022-09-14T07:03:37Z

@mattiaCanevascini I have never used this particular function, however can speak to development more broadly. The first step of exposing a new feature is cloning Tesseract.js-core and familiarizing yourself with the examples.

For example, this is a basic recognition example in Tesseract.js-core. In contrast to the recognition example in Tesseract.js, you'll note that it uses lower-level functions (calls to methods of api and TessModule). Those are the building blocks for everything in this repo. Once you understand the examples, you can work to implement a proof-of-concept using additional functions from that repo.

Virtually every Tesseract API function is already included in Tesseract.js-core (including, as you note, api.AnalyseLayout ). What those functions lack is (1) documentation and (2) a user-friendly interface. Therefore, it's a matter of figuring out how the functions work, creating a user-friendly interface, and documenting it.

ghost · 2022-09-14T18:33:57Z

@Balearica thank you for the description. It was what I was looking for. I will try :)

Balearica · 2023-05-29T18:55:42Z

I added the ability to run layout analysis but not recognition to the master branch. It is included in releases starting at v4.1.0.

Running only layout analysis requires setting the output option for the recognize method. You need to (1) disable any outputs that require running recognition [notably the formats that are true by default] and (2) set the new layoutBlocks output format to true. An example is below.

await worker.recognize(files[0], undefined, {text: false, blocks: false, hocr: false, tsv: false, layoutBlocks: true});

The layoutBlocks output format is identical to the blocks output format in structure, and allows for retrieving bounding boxes for text blocks/paragraphs/lines/etc. Only blocks can be created if recognition has been run, and only layoutBlocks can be created if recognition has been skipped. With regards to content, the only difference should be that layoutBlocks has null values for all text and confidence fields.

Balearica added the enhancement label Sep 2, 2022

This was referenced May 29, 2023

Recognition being run even when disabled #769

Closed

Add ability to analyse layout without running recognition #770

Merged

Balearica closed this as completed in #770 May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnalyseLayout() for tesseract.js #656

AnalyseLayout() for tesseract.js #656

ghost commented Sep 1, 2022

Balearica commented Sep 2, 2022

ghost commented Sep 14, 2022

Balearica commented Sep 14, 2022

ghost commented Sep 14, 2022

Balearica commented May 29, 2023 •

edited

Loading

AnalyseLayout() for tesseract.js #656

AnalyseLayout() for tesseract.js #656

Comments

ghost commented Sep 1, 2022

Balearica commented Sep 2, 2022

ghost commented Sep 14, 2022

Balearica commented Sep 14, 2022

ghost commented Sep 14, 2022

Balearica commented May 29, 2023 • edited Loading

Balearica commented May 29, 2023 •

edited

Loading