Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove old code which was used for Ocropus #2957

Merged
merged 1 commit into from
Apr 27, 2020

Conversation

stweil
Copy link
Member

@stweil stweil commented Apr 27, 2020

Signed-off-by: Stefan Weil [email protected]

@stweil
Copy link
Member Author

stweil commented Apr 27, 2020

This should fix issue #2926.

@zdenop zdenop merged commit 23be532 into tesseract-ocr:master Apr 27, 2020
@zdenop
Copy link
Contributor

zdenop commented Apr 27, 2020

thanks

@amitdo
Copy link
Collaborator

amitdo commented Apr 27, 2020

SetFillLatticeFunc() was not part of the ocropus cut-line.

@amitdo
Copy link
Collaborator

amitdo commented Apr 27, 2020

fill_lattice_

@stweil
Copy link
Member Author

stweil commented Apr 27, 2020

@amitdo, maybe the code which refers to fill_lattice_ could be removed, too. Or is there someone who uses that?

@amitdo
Copy link
Collaborator

amitdo commented Apr 27, 2020

Maybe that code is/was used by Google internal version.

@amitdo
Copy link
Collaborator

amitdo commented Apr 27, 2020

Here's another SetSomethingFunc

void TessBaseAPI::SetProbabilityInContextFunc(ProbabilityInContextFunc f) {

probability_in_context_

https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/YsJogVVFbGk

Ray 9/9/14

There is no code in tesseract to implement or train an n-gram model, only to make use of one.

There is a "hook" function and code to set it to an external function.
TessBaseAPI::SetProbabilityInContextFunc is used to set it.

We experimented with it in Google with some existing Google-specific n-gram library. It didn't help very much in most languages, although it does in cases where the dictionary is particularly weak for some reason, like CJK.

A single function is required that returns the probability of the current "character" (in general a utf-8 string) given the context of a previous utf-8 string.
It is fairly simple to write and train, given appropriate training data.

The difficulty is with balancing the strength of the n-gram model and the shape classifier confidences. For speech recognition, where typical system accuracies are 80-90%, it is easy to use language models to combine with a poor accuracy phoneme classifier, but in OCR, where typical word accuracies exceed 95% (for alphabetic, word-segmented languages) language models of any kind can easily start replacing infrequent, but correct results with more frequent, but wrong results, and nobody has properly solved the problem in a way that is mathematically and statistically rigorous. Looking at OCR output, it is easy to spot cases where language models could help, but not so easy to see the already-correct results that would be made worse.

@amitdo
Copy link
Collaborator

amitdo commented May 4, 2020

@stweil,

GetPageRes() is used in unittest/baseapi_test.cc

@stweil
Copy link
Member Author

stweil commented May 4, 2020

Thank you. Sorry that I missed that. Pull request #2962 should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants