Remove old code which was used for Ocropus #2957

Ray 9/9/14

There is no code in tesseract to implement or train an n-gram model, only to make use of one.

There is a "hook" function and code to set it to an external function.
TessBaseAPI::SetProbabilityInContextFunc is used to set it.

We experimented with it in Google with some existing Google-specific n-gram library. It didn't help very much in most languages, although it does in cases where the dictionary is particularly weak for some reason, like CJK.

A single function is required that returns the probability of the current "character" (in general a utf-8 string) given the context of a previous utf-8 string.
It is fairly simple to write and train, given appropriate training data.

The difficulty is with balancing the strength of the n-gram model and the shape classifier confidences. For speech recognition, where typical system accuracies are 80-90%, it is easy to use language models to combine with a poor accuracy phoneme classifier, but in OCR, where typical word accuracies exceed 95% (for alphabetic, word-segmented languages) language models of any kind can easily start replacing infrequent, but correct results with more frequent, but wrong results, and nobody has properly solved the problem in a way that is mathematically and statistically rigorous. Looking at OCR output, it is easy to spot cases where language models could help, but not so easy to see the already-correct results that would be made worse.

amitdo · 2020-05-04T10:50:35Z

@stweil,

GetPageRes() is used in unittest/baseapi_test.cc

stweil · 2020-05-04T12:09:37Z

Thank you. Sorry that I missed that. Pull request #2962 should fix it.

Remove old code which was used for Ocropus

1188e0a

Signed-off-by: Stefan Weil <[email protected]>

stweil mentioned this pull request Apr 27, 2020

The buffer unicode_repr in tesseract-4.1/src/api/baseapi.cpp at line 2486 does not have a null terminator #2926

Closed

zdenop merged commit 23be532 into tesseract-ocr:master Apr 27, 2020

amitdo mentioned this pull request May 8, 2020

Add RowAttributes getter to PageIterator #2971

Merged

amitdo added the API label Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove old code which was used for Ocropus #2957

Remove old code which was used for Ocropus #2957

stweil commented Apr 27, 2020

stweil commented Apr 27, 2020

zdenop commented Apr 27, 2020

amitdo commented Apr 27, 2020 •

edited

Loading

amitdo commented Apr 27, 2020 •

edited

Loading

stweil commented Apr 27, 2020

amitdo commented Apr 27, 2020

amitdo commented Apr 27, 2020

amitdo commented May 4, 2020

stweil commented May 4, 2020

Remove old code which was used for Ocropus #2957

Remove old code which was used for Ocropus #2957

Conversation

stweil commented Apr 27, 2020

stweil commented Apr 27, 2020

zdenop commented Apr 27, 2020

amitdo commented Apr 27, 2020 • edited Loading

amitdo commented Apr 27, 2020 • edited Loading

stweil commented Apr 27, 2020

amitdo commented Apr 27, 2020

amitdo commented Apr 27, 2020

amitdo commented May 4, 2020

stweil commented May 4, 2020

amitdo commented Apr 27, 2020 •

edited

Loading

amitdo commented Apr 27, 2020 •

edited

Loading