-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent OCR #350
Comments
That's unfortunately correct. |
I think that's the best option at the moment. Because of how Tesseract handles the OCR, there is no way to guarantee the same result with different scans (although it is usually very consistent). I can work on the fix today |
I’ve been poking around with
Is this a good route to pursue? Other packages include textract and tika, but similar problems are found (it should also be noted that tika runs through a server, which massively increases runtime). |
@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself. |
I've thought of that briefly. Erica's documents should be easy to track and we could add a column to the metadata sheet that's checked if we have hand-corrected text for the document so it doesn't get overwritten when it's changed. We could maybe even store the hand-corrected ocr in the google sheet. |
Example:
meme
. http://127.0.0.1:8000/archives/doc/3_19_pmm_memo_re_709_1960_04_29_1_19 is first result.meme
in text, onlymemo
. Highlighting the sentenceStatus of programming memo and revision of machine shut-down date to late July.
and copy pasting elsewhere gives correct text.data/processed_pdfs
folder. It saysStatus of programming meme
, probably due to OCR error.Seems like PDF preview and search have different opinions on the OCR?
The text was updated successfully, but these errors were encountered: