Tesseract issue (again) #495

Ramon-zaro · 2018-01-26T06:45:43Z

Hi David,
as you might remember I am running fscrawler 2.3 on a windows machine to index large number of document files. it is working like a dream. can't thank you enough for this tool !

I could not however get fscrawler to recognize tesseract as the release notes say it should. tesseract runs well enough on its own from command line.

Now, I know you have mentioned earlier you are not familiar with how Tika works under the hood and that you cant help on this issue. Is that still the case ?
If so, could you just tell me which version of tika you are using and how, so that I can ask the right questions to the Tika people ?

shadiakiki1986 · 2018-01-26T07:06:24Z

In the logs of the most recent travis build of docker-fscrawler line 3343, you can see that tesseract parsed out the content of the pdf file properly with fscrawler.

Also, in the same logs, for fscrawler 2.5-SNAPSHOT, here are the related tika dependencies

inflating: fscrawler-2.5-SNAPSHOT/lib/tika-core-1.16.jar  
inflating: fscrawler-2.5-SNAPSHOT/lib/tika-parsers-1.16.jar  
inflating: fscrawler-2.5-SNAPSHOT/lib/tika-langdetect-1.16.jar

shadiakiki1986 · 2018-01-26T07:10:27Z

Btw maybe if you update from 2.3 to 2.5 it'll fix your problem?
You may also need to update your Elasticsearch installation from 5 to 6.
What Elasticsearch version are you using with fscrawler 2.3?
Pay attention to compatibility between fscrawler 2.5 and older versions of ES < 6

dadoonet · 2018-01-26T08:07:38Z

FSCrawler 2.5 should work with elasticsearch 5.x. Just tests are not working well AFAIK.

dadoonet · 2018-01-26T09:34:24Z

I could not however get fscrawler to recognize tesseract as the release notes say it should. tesseract runs well enough on its own from command line.

This is strange. I think I should add some options to set exactly the path to tesseract instead of relying only on PATH.

Ramon-zaro · 2018-01-26T15:21:21Z

thanks a lot for the responses.
I have no problems moving to 6.x and reindexing. provided 2.5 solves this issue.

setting path exactly should help, I seem to remember reading somewhere about changes in tesseract recenet releases about PATH behaviour.

## OCR Path If your Tesseract application is not available in default system PATH, you can define the path to use by setting `fs.ocr.path` property in your `~/.fscrawler/test/_settings.json` file: ```json { "name" : "test", "fs" : { "url" : "/path/to/data/dir", "ocr" : { "path": "/path/to/tesseract/executable" } } } ``` When you set it, it's highly recommended to [set the data path for Tesseract](#ocr-data-path). ## OCR Data Path Set the path to the 'tessdata' folder, which contains language files and config files if Tesseract can not be automatically detected. You can define the path to use by setting `fs.ocr.data_path` property in your `~/.fscrawler/test/_settings.json` file: ```json { "name" : "test", "fs" : { "url" : "/path/to/data/dir", "ocr" : { "path": "/path/to/tesseract/executable", "data_path": "/path/to/tesseract/tessdata" } } } ``` Closes #495.

Ramon-zaro · 2018-05-07T15:11:41Z

Just wanted to report that have continued to face same issues even with latest snapshot of FS 2.5 on an windows machine.

I have set the path to the tesseract executible in the fscrawler settings but still fscrawler gives the message "But Tesseract is not installed so we won't run OCR".

Ref. @shadiakiki1986 @dadoonet

dadoonet · 2018-05-28T08:12:57Z

@Ramon-zaro Could you open a new issue and describe exactly your configuration file in it?
I did not test on Windows so that might be a problem.

TajinderSaini · 2019-04-03T09:09:41Z

Did Tesseract work on Windows with FScrawler? I tried FScrawler 2.5 and 2.6 with ES 6.5 but it is same issue. Any idea?
11:08:41,431 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated for PDF documents
11:08:41,446 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.

dadoonet mentioned this issue Feb 19, 2018

Allow setting Tesseract path to executable and data #520

Merged

dadoonet closed this as completed in #520 Feb 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract issue (again) #495

Tesseract issue (again) #495

Ramon-zaro commented Jan 26, 2018

shadiakiki1986 commented Jan 26, 2018

shadiakiki1986 commented Jan 26, 2018

dadoonet commented Jan 26, 2018

dadoonet commented Jan 26, 2018

Ramon-zaro commented Jan 26, 2018

Ramon-zaro commented May 7, 2018 •

edited

Loading

dadoonet commented May 28, 2018

TajinderSaini commented Apr 3, 2019

Tesseract issue (again) #495

Tesseract issue (again) #495

Comments

Ramon-zaro commented Jan 26, 2018

shadiakiki1986 commented Jan 26, 2018

shadiakiki1986 commented Jan 26, 2018

dadoonet commented Jan 26, 2018

dadoonet commented Jan 26, 2018

Ramon-zaro commented Jan 26, 2018

Ramon-zaro commented May 7, 2018 • edited Loading

dadoonet commented May 28, 2018

TajinderSaini commented Apr 3, 2019

Ramon-zaro commented May 7, 2018 •

edited

Loading