Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract issue (again) #495

Closed
Ramon-zaro opened this issue Jan 26, 2018 · 8 comments
Closed

Tesseract issue (again) #495

Ramon-zaro opened this issue Jan 26, 2018 · 8 comments

Comments

@Ramon-zaro
Copy link

Hi David,
as you might remember I am running fscrawler 2.3 on a windows machine to index large number of document files. it is working like a dream. can't thank you enough for this tool !

I could not however get fscrawler to recognize tesseract as the release notes say it should. tesseract runs well enough on its own from command line.

Now, I know you have mentioned earlier you are not familiar with how Tika works under the hood and that you cant help on this issue. Is that still the case ?
If so, could you just tell me which version of tika you are using and how, so that I can ask the right questions to the Tika people ?

@shadiakiki1986
Copy link
Contributor

In the logs of the most recent travis build of docker-fscrawler line 3343, you can see that tesseract parsed out the content of the pdf file properly with fscrawler.

Also, in the same logs, for fscrawler 2.5-SNAPSHOT, here are the related tika dependencies

inflating: fscrawler-2.5-SNAPSHOT/lib/tika-core-1.16.jar  
inflating: fscrawler-2.5-SNAPSHOT/lib/tika-parsers-1.16.jar  
inflating: fscrawler-2.5-SNAPSHOT/lib/tika-langdetect-1.16.jar  

@shadiakiki1986
Copy link
Contributor

Btw maybe if you update from 2.3 to 2.5 it'll fix your problem?
You may also need to update your Elasticsearch installation from 5 to 6.
What Elasticsearch version are you using with fscrawler 2.3?
Pay attention to compatibility between fscrawler 2.5 and older versions of ES < 6

@dadoonet
Copy link
Owner

FSCrawler 2.5 should work with elasticsearch 5.x. Just tests are not working well AFAIK.

@dadoonet
Copy link
Owner

I could not however get fscrawler to recognize tesseract as the release notes say it should. tesseract runs well enough on its own from command line.

This is strange. I think I should add some options to set exactly the path to tesseract instead of relying only on PATH.

@Ramon-zaro
Copy link
Author

thanks a lot for the responses.
I have no problems moving to 6.x and reindexing. provided 2.5 solves this issue.

setting path exactly should help, I seem to remember reading somewhere about changes in tesseract recenet releases about PATH behaviour.

dadoonet added a commit that referenced this issue Feb 19, 2018
## OCR Path

If your Tesseract application is not available in default system PATH, you can define the path to use
by setting `fs.ocr.path` property in your `~/.fscrawler/test/_settings.json` file:

```json
{
  "name" : "test",
  "fs" : {
    "url" : "/path/to/data/dir",
    "ocr" : {
      "path": "/path/to/tesseract/executable"
    }
  }
}
```

When you set it, it's highly recommended to [set the data path for Tesseract](#ocr-data-path).

## OCR Data Path

Set the path to the 'tessdata' folder, which contains language files and config files if Tesseract
can not be automatically detected. You can define the path to use
by setting `fs.ocr.data_path` property in your `~/.fscrawler/test/_settings.json` file:

```json
{
  "name" : "test",
  "fs" : {
    "url" : "/path/to/data/dir",
    "ocr" : {
      "path": "/path/to/tesseract/executable",
      "data_path": "/path/to/tesseract/tessdata"
    }
  }
}
```

Closes #495.
@Ramon-zaro
Copy link
Author

Ramon-zaro commented May 7, 2018

Just wanted to report that have continued to face same issues even with latest snapshot of FS 2.5 on an windows machine.

I have set the path to the tesseract executible in the fscrawler settings but still fscrawler gives the message "But Tesseract is not installed so we won't run OCR".

Ref. @shadiakiki1986 @dadoonet

@dadoonet
Copy link
Owner

@Ramon-zaro Could you open a new issue and describe exactly your configuration file in it?
I did not test on Windows so that might be a problem.

@TajinderSaini
Copy link

Did Tesseract work on Windows with FScrawler? I tried FScrawler 2.5 and 2.6 with ES 6.5 but it is same issue. Any idea?
11:08:41,431 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated for PDF documents
11:08:41,446 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants